Abstract PS2-28: Using Natural Language Processing to Explore Schemes for Assessing Disease Risk from Free-Text Radiology Reports

  • December 2008,
  • 148.4;
  • DOI: https://doi.org/10.3121/cmr.6.3-4.148-c

Abstract

Background: Radiology reports are a form of clinical text that contains rich and useful information that is difficult to extract by automated means. The Cancer Text Information Extraction System (caTIES) is open-source software that uses natural language processing (NLP) techniques to identify concepts from standardized medical thesauri found in clinical text. Information derived from clinical text via NLP may be able to assist clinicians assess clinical risk of disease. We are exploring automated concept-coding of radiological exams of the abdomen in an effort to develop schemes for ordinal classification of clinical risk of diseases such as ovarian cancer.

Methods: We designated 63,681 pelvic ultrasound exams of 44,704 women performed during 1997–2006 as cases or controls. Cases included 214 exams of 188 women who received a pathologically confirmed ovarian cancer diagnosis within 1 year; controls include 63,467 exams of 44,516 women who did not. We divided all exams for cases and 10,000 randomly selected exams from controls into split-half development and validation samples (having 107 cases and 5000 controls each). The full text of development sample reports was concept-coded using caTIES. Guided by domain expertise and trial-and-error we developed an algorithm to identify cases and classified each report according to it. Expert review of the 107 case reports and random samples of 300 false positive and 300 true negative controls is ongoing, as is algorithm modification. The validation sample remains unused.

Results: A simple algorithm employing concepts ‘mass,’ ‘simple,’ ‘hemorrhagic,’ and ‘resolution’ achieved sensitivity of 63% and specificity of 93%. Expert review indicates the algorithm may be improved by 1) focusing exclusively on concepts in the report’s impression section, 2) incorporating additional concepts to identify true positives and exclude false positives, 3) developing custom NLP rules to associate organ systems and concepts referring to them, and 4) excluding concepts expressed in future and past tenses.

Conclusions: Progress to date is encouraging for developing schemes useful for ordinal classification of clinical risk of disease. Opportunities for applying NLP to clinical text for medical research are numerous and caTIES appears to be a promising tool, especially if customized to perform domain-specific NLP tasks.

  • Received September 11, 2008.
Loading
  • Share
  • Bookmark this Article