Abstract
Background/Aims: N-Grams are an established statistical tool for use in processing streams of tokens found in nature. “Tokens” can be words in speech or a document (as in natural language processing applications), nucleotide sequences (for genetic applications) or, we will argue, health service events (HSEs) such as diagnoses, procedures, and pharmacy fills occurring over time. We propose applying N-gram analysis to streams of HSEs observed in HMO administrative data, both as a method for analyzing actual health services, and also for generating realistic simulated data that are not subject to privacy or IRB concerns. Such simulated data could be used for applications where using real data would be too risky, or for developing code prior to actual IRB approval.
Methods: N-grams are based on counting token co-occurrence in a reference dataset. The n=2 case is easiest to explain. By counting the number of times each token X is preceded by token Y in the reference dataset, a matrix of conditional frequencies is generated. This matrix is then normalized and smoothed, resulting in a matrix of conditional probabilities that any given X is preceded by Y.
Results: The resulting matrix will be interesting to study in its own right. While many of the high-probability co-occurrences will be obvious and well-known, there certainly will be some (particularly those where the time intervening between events X and Y is long) that are not, and that will bear investigation. Thus, the matrix is a tool for hypothesis generation. Further, the matrix can be used to generate realistic streams of simulated HSEs. A starting token is chosen at random, using the matrix probabilities as weights (e.g., if a checkup visit occurs twice as frequently as a starting HSE than a flu diagnosis, then the checkup visit would be approximately twice as likely to be chosen for a starting token). Once the starting token is chosen, a next token is randomly drawn from a weighted sample based on the starting token, and so on, each token providing the sample weights for the choice of the next token. In this way, any amount of data can be manufactured.
Conclusions: Applying N-grams from HSE streams will offer unique opportunities to investigate event co-occurrence, and provide simulated data useful for testing purposes.
- Received September 11, 2008.




