Abstract

Background/Aims Mammographic findings such as a mass may be associated with breast cancer risk, but these data are only available in free-text reports and require resource-intensive manual abstraction. We developed and tested a Natural Language Processing (NLP) algorithm to extract mammographic findings (mass, calcification, asymmetric density, and architectural distortion) from free-text mammography reports.

Methods We identified 92,947 reports for women receiving screening and diagnostic mammography at Group Health between 2007–2008. We developed an NLP algorithm based on Perl Regular Expressions in SAS v9.2. The algorithm identifies words indicating mammography findings (mass, distortion, asymmetry and calcification) and their related words denoting laterality, negation, family history, personal history and uncertainty. Three flags are made indicating possible errors of the NLP algorithm. An experienced abstractor manually reviewed a random sample of 50 mammography reports to test and refine the NLP algorithm.

Results The algorithm correctly identified a mass on 46/50 reports, calcifications on 48/50 reports, asymmetric density on 50/50 reports, and architectural distortion on 48/50 reports. The NLP algorithm misinterprets sentences such as, “there are calcifications with no other asymmetry.” The NLP algorithm incorrectly associated the negation word “No” with the key word “calcifications.” Building more refined rules on association between negation words and key words will improve the accuracy.

Conclusions This NLP algorithm holds promise for accurate and fast identification of findings from free-text mammography reports. It can be shared across institutions and is an example of what can be done with free-text radiology reports, in addition to mammography. Manual review may still be necessary for some reports with a high probability of error, depending on resources available.

Loading
  • Share
  • Bookmark this Article