Newswise — Ever since the adoption of electronic health records (EHRs), medical universities, hospitals and other health institutions have amassed enormous databases of information, comprising a diverse array of information such as diagnoses, medications and lab results.

While such databases promise to serve as rich resources for clinical research, the data tends to be difficult, time-extensive and costly to analyze. A new project funded by the National Science Foundation (NSF) aims to change that.

“As available now, databases of electronic health records are diverse and massive, but they are also messy and heterogeneous. There’s a lot of noise,” said Jimeng Sun, associate professor at Georgia Tech’s School of Computational Science and Engineering. “Our charge is to find ways to make the information more robust and easier to read, thus leading to meaningful clinical concepts without extensive labor and time.”

As part of the four-year, $2.1 million NSF research project, data analytic teams from Georgia Tech and the University of Texas, Austin, will develop algorithms and methods to convert the EHR data into meaningful clinical concepts or phenotypes focused on diseases and specific health traits. Vanderbilt University will provide initial EHR data and phenotype validation.

Resulting phenotypes will be refined and adapted in conjunction with data from Northwestern University so that the information and data can be used across multiple health institutions.

In addition to Sun, who serves as the lead principal investigator of the project, the team includes Bradley Malin and Joshua Denny, associate professors of biomedical informatics and computer science at Vanderbilt; Joydeep Ghosh, professor of electrical and computer engineering at Texas; and Abel Kho, associate professor of medicine-biomedical informatics at Northwestern.

Past efforts to create phenotypes from data tended to be costly and time-intensive. Several challenges face physicians and researchers in developing scalable phenotype methods. These include accurate patient representations, working with data across multiple dimensions, sufficient expert refinement and adaptability across multiple health institutions.

“Traditionally it takes six to 18 months to develop an algorithm for a single phenotype, which is too long,” Denny said. “There is also a tremendous need for developing high-throughput phenotyping methods that can directly model the interactions among heterogeneous information sources.”

The project will focus on three specific applications, including a system to accurately and effectively identify patients, even with multiple symptoms and health traits, for clinical research and developing predictive models for health studies.

The project can also provide effective phenotypes for genomic-wide association studies (GWAS). At present, health researchers can only work with one phenotype at a time. But this project will enable researches to quickly study multiple phenotypes jointly. Finally, those identified phenotypes can help analyze specific risk about patients, such as key health factors, exhibited by Type 2 diabetes patients.

In addition to developing the algorithms and methods, the professors will try to develop new health analytics curricula as a massive open online course (MOOC) and for tutorial sessions at conferences.

This research is supported by the National Science Foundation (NSF) under Award 1418511. Any conclusions or opinions are those of the authors and do not necessarily represent the official views of the NSF.

Register for reporter access to contact details