Abstract
Background and Aims: An intervention trial requiring recruitment of new cancer cases prompted this investigation into text classifier programs to identify pathology reports describing malignancies. Text classifiers are among the simplest Natural Language Processing methods, and are particularly useful for filtering large streams of documents. Our classifiers allowed study staff to concentrate their time on the reports most likely to result in recruitable subjects. The proposed presentation will describe our use of text classifiers as part of an efficient computer-assisted recruitment pipeline.
Methods: We gathered corpora of pathology reports for breast, lung and colorectal tissue, each with a ‘gold standard’ assessment of whether they discussed a malignancy. We then randomly divided each into 75% training and 25% evaluation subsamples, trained the classifier on the training subsample and compared its categorizations to the gold standard in the evaluation subsample. We performed this training/evaluation cycle repeatedly, generating a distribution of predictive value statistics, thereby getting insight into the classifiers’ sensitivity to the particular random division of corpus reports. More importantly, it allowed us to experiment with tweaks to the classifiers and to the basic text processing that preceded them, and quickly evaluate whether they contributed to the predictive power of the classifier.
Results: Several enhancements contributed to the accuracy of the system. In particular, taking multi-word phrases in addition to individual words as features increased the classifier’s predictive power. To our surprise, ‘stemming’ individual words (that is, reducing them to their root forms) did not have a perceptible effect. Once the classifiers were optimized, we were able to cover approximately 1600 path reports/week in approximately 30 minutes of a Research Specialist’s time. As an added benefit, when the RS found that a report was flagged in error, the classifier was trained on that report, thereby improving its future performance.
Conclusions: Text classifiers are an effective tool for optimizing the use of staff time in rapidly ascertaining cancer cases for recruitment. Furthermore, because the classifiers’ training (including the corrections made during the study) is easily reduced to a file on disk, it becomes an independent asset, useful for future studies needing to do rapid ascertainment of cancer.
- Received May 27, 2010.
- Accepted May 27, 2010.

