Abstract
Background/Aims Utilization of administrative data (such as EHR data) in population-based research is resource advantageous, despite its potential limitations. Few studies have fully assessed the validity and efficiency of EHR-retrieved data. We developed ICD-9 based algorithms and operational processes to evaluate applicability of EHR-derived cancer data.
Methods We retrieved data between 01/01/2002–12/30/2011 from 4 different EHR sources and developed 3 ICD9-based diagnostic algorithms (1+, 2+ and 5+). Women were classified into breast or endometrial cancers or benign breast conditions (BBC). One trained abstractor manually reviewed medical records and recorded data into a structured database. Every 10 observations were selected and reviewed. Basic descriptive statistical analyses were conducted; observations with questionable values were flagged for re-evaluation. The final dataset was considered the “gold standard” and used to validate the algorithms and to assess the duration between the diagnostic and administrative dates.
Results A total of 1,056 women contributed to this study. Of these, 189 were diagnosed with breast and 40 with endometrial cancers. An additional 268 women had BBC. For breast cancer, using the first algorithm we calculated a sensitivity of 95.2% and specificity of 96.4%. Application of the second algorithm yielded a sensitivity of 94.2% and specificity of 97.6%. For the third one, our calculations indicated 89.0% sensitivity and 97.9% specificity. Our analyses based on the same algorithms yielded similar sensitivity and specificity for endometrial cancer. For BBC, we calculated a sensitivity of 82.5% and specificity of 56.7% based on the first, 73.1% sensitivity and specificity of 71.9% for the second, and 38.8% sensitivity and specificity of 90.5% for the third algorithm. The average duration between diagnostic and administrative dates for incident breast, endometrial cancers and BBC was 0.65, 0.01, and 0.31 years, respectively.
Conclusions Our initial findings confirm the validity and potential utility of EHR for population-based cancer research. The algorithm of “2+ ICD-9 Coding System” yielded the most efficient process. The observed lower sensitivity and specificity for BBC potentially can be attributed to the wider pathologic spectrum of BBC. The relatively short duration between the EHR and diagnostic dates suggests unbiased interchangeability of dates.




