Abstract

Background/Aims The central goal of Population-Based Research Optimizing Screening through Personalized Regimens (PROSPR), a recently-funded NCI initiative, is to develop multi-site, transdisciplinary research to improve the screening process for breast, colon, and cervical cancer. To support this goal, we aim to collect, document, and manage data for the entire colorectal cancer (CRC) screening process at Group Health (GH), an integrated health system and PROSPR Research Center. We describe the data sources, types, and collection methods being used to assemble the breadth of relevant information on patients, providers, tests, pathology, treatment, and outcomes this effort requires.

Methods To characterize the CRC screening process for GH members enrolled from 1993–2015, we employed administrative databases, previous CRC studies, data partnerships, and GH’s EpicCare-based electronic medical record (EMR). These resources contain both structured data and unstructured text requiring the use of multiple collection methods, including programmatic extraction, natural language processing (NLP), and manual abstraction.

Results We are programmatically extracting demographic information on patients and providers from well-established administrative databases. Information on stool-based tests is extracted from lab databases and EpicCare. Colonoscopy and corresponding pathology notes are available as unstructured text in EpicCare for GH-performed procedures, and we are employing NLP to extract information on family history, test indication, and results from these notes. Scanned notes from contracted colonoscopy providers require manual abstraction; however, through partnership with our largest contracted provider, we receive electronic transfers of this information as structured data, minimizing manual review. For colonoscopies occurring prior to GH’s 2005 implementation of EpicCare, we rely on data from five previously-conducted CRC studies. Treatment information is extracted from pharmacy and utilization databases, and CRC outcomes are available as structured data through partnerships with our local cancer registries.

Conclusions Under the auspices of an ambitious initiative such as PROSPR, documenting the entire screening process can be achieved by creating a comprehensive data collection system that coordinates all available data sources and maximizes their value with appropriate collection methods. Efficiencies can be gained by using data from prior studies and developing external data partnerships for access to higher-quality data.

Loading