PS1-15: A Method for Discovering Variant Spellings of Terms of Interest in Clinical Text

  • December 2010,
  • 183.2;
  • DOI: https://doi.org/10.3121/cmr.2010.943.ps1-15

Abstract

Background and Aims: Mining information from unstructured clinical text requires knowledge of synonyms and variant spellings corresponding to the term of interest. Lexicons such as the Unified Medical Language System identify recognized synonyms but are less helpful in identifying variant spellings or locality-specific language. Natural language processing algorithms employing dictionary look-ups frequently miss terms because of variant spellings, accounting for as much as 25% of algorithm failures. This problem is not easily resolved deductively or through domain expertise. We describe an inductive, empirically-driven method that identifies synonyms/variant spellings of a concept of interest.

Methods: The conceptual basis for this method is the principle that terms with similar meanings tend to be found in similar textual contexts. Accordingly, sets of words found with high frequency in the context of two synonyms will overlap with high probability. Call the word of interest the target term; call the high-frequency words around it context terms; call words most likely to be synonyms/variants synonym candidates. A four-step process can be used to identify variants. Step 1: Identify the target term’s highest frequency context terms. Step 2: Identify terms found in high frequency near each of the context terms. Step 3: Eliminate Step 1 & 2 terms appearing more often outside the context than in it. Step 4: Manually review remaining synonym candidates to identify true synonyms/variants. Using SAS we illustrate the method for target term HER2/NEU in Group Health chart notes for 2008, representing 200,913,281 lines of clinical text.

Results: Four of the top 20 context terms near HER2/NEU had ratios of in_context_occurrence/out_of_context_occurrence > 1: OVEREXPRESSION, ONCOGENE, OVEREXPRESSED, C-ERBB-2. These context terms occurred 1196, 617, 540, and 170 times, respectively, and produced lists of 205, 119, 143, and 54 of their own context terms, respectively (excluding words with frequency < 2). Manual review yielded a final set of 23 synonyms and spelling variants, including HER-2, HER/2, HER-2-NEU, HER2/NUE (note spelling: “NUE”), FISH, IHC, and C-ERB-B-2 among others.

Conclusion: The method describe provides an easily implemented approach to identifying synonyms/variants of terms of interest that can be used to enhance strategies for identifying concepts in clinical text.

  • Received May 27, 2010.
  • Accepted May 27, 2010.
Loading
  • Share
  • Bookmark this Article