Abstract
Background/Aims Data joins have historically been done using deterministic linkage techniques which require exact data matches to succeed. For data extracted from disparate sources, requiring an exact match may be too limiting. Probability-based linkage techniques have gone from an arcane art involving expensive third-party software to having multiple free and commercial options available for use. We will walk through the theoretical underpinnings of the probability-based linkage theory used by many record linkage applications today.
Methods Describe the history of record linkage, the cause of linkage errors, the need for probability-based techniques for match assessment and the mathematical framework of probabilistic linkage. Discuss matching best practices and limitations in matching ability and match assessment.
Results The audience will better understand how to choose linking variables, ways to limit the search space for comparisons, and the importance of pre-processing linkage variables, including comparison by permutation of variables, phonetic codes and edit-based measures.
Conclusions The audience will be able to better evaluate, and to use more successfully, the available freeware and commercial linkage software given a better understanding of their inner workings, limitations and potential pitfalls. An advanced audience will be in a better position to create their own probabilistic matching algorithms.




