C-C4-01: Rapid Exploration of Large Clinical Text Corpora for Information Extraction Feasibility Studies

  • Clinical Medicine & Research
  • November 2011,
  • 9
  • (3-4)
  • 168;
  • DOI: https://doi.org/10.3121/cmr.2011.1020.c-c4-01

Abstract

Background/Aims Large amounts of information are “buried” in unstructured clinical text such as chart notes, pathology reports and radiology reports. Through electronic medical record systems, much of this text is available for computer-aided analysis. Determining the specific language used in clinical text to express content of interest is an important early step in text-mining efforts.

Methods We copied ~29.2 million clinical documents from our Epic Clarity database and other data sources to a secure SQL Server 2008 database, adding a full-text index to the textual content. Details will be discussed. To query and view clinical text we developed a Clinical Text Explorer application using Microsoft Access. None of the clinical text documents are de-identified; IRB approval is required for use. Features include: intuitive interface for testing and refining search schemes; quickly returns chart documents containing specified text; user can review either random or “best match” samples of documents to inform estimates of sensitivity and specificity of the search; highlighting marks facilitate visual scanning for terms of interest.

Results Clinical Text Explorer allows researchers to quickly and easily identify patients who could not have been identified reliably by searching only on structured data.The following example describes the iterative process of defining the best search terms to find records mentioning results of the Oncotype DX test for breast cancer. In less than an hour we determined that the search “oncotype dx” was too narrow, while “oncot*” was too broad (drawing in records with words like “oncotech”). The search “oncotyp*” was the most comprehensive without losing specificity. When limiting the search to test results, we found that adding the additional criterion that “oncotyp*” occur near “score” eliminated most irrelevant documents, while adding “recurrence” narrowed the results too far. This search returned substantially more records than were discovered by searching structured data from lab results alone.

Conclusions The ease with which complex searches over large amounts of clinical text can be executed by this application eliminates barriers to text exploration posed by conventional methods such as regular expressions in SAS, allowing the domain expert (epidemiologist, physician, chart abstractor, etc.) to directly evaluate the results and refine the search.

Loading
  • Print
  • Download PDF
  • Article Alerts
  • Email Article
  • Citation Tools
  • Share
  • Bookmark this Article