Abstract
Background/Aims The Electronic Medical Records and Genomics (eMERGE) Network is a national consortium of nine institutions supported by the National Human Genome Research Institute (NHGRI) to study genetic correlates of disease by pooling data from local biorepositories and electronic data ecosystems. Three HMORN sites participate. Twenty-one of >40 planned genome-wide association studies (GWAS) have been completed, including GWAS for chronic, cognitive, cardiovascular, gastro-intestinal, hemotologic, infectious and other phenotypes, without patient contact. Transportable algorithms rely entirely on structured data from the EMR and, optionally, clinical text using natural language processing (NLP). Salient themes include bioinformatics, genomic medicine, privacy, and community engagement.
Methods Algorithms defining phenotype cases and controls are developed iteratively at a primary site in conjunction with one or two secondary sites. SAS, KNIME, Python, and random manual review at multiple sites establish an algorithm’s positive predictive value (PPV) and portability. Validated algorithms, published as site-agnostic pseudo code documents on a secure Web site are implemented with local tailoring at remaining sites; data are pooled for analysis.
Results Pseudo code documents are an efficient way for communicating the logic and content of phenotype algorithms across sites when data not available in multi-site standardized formats (such as the Virtual Data Warehouse) are required or data must be obtained using NLP. Iterative, random-sample chart validation is an important method for developing robust transportable algorithms. Business intelligence rules systems such as KNIME simplify implementation of complex algorithms and NLP. A Web site for sharing pseudo code and validation results aids communication.
Conclusions Experiences from the eMERGE network offer valuable lessons for conducting multi-site studies in the HMORN when non-VDW and/or NLP-derived data are required.




