PS3-13: Re-Identification Risk Associated with Sharing Linked Genomic and Phenotypic Data from the Kaiser Permanente Research Program on Genes, Environment and Health (RPGEH)

  • September 2013,
  • 148.3;
  • DOI: https://doi.org/10.3121/cmr.2013.1176.ps3-13

Abstract

Background/Aims It is now understood that conventional de-identification methods such as the HIPAA Safe Harbor standard do not guarantee anonymity of patient records, which may be vulnerable to a variety of attacks aimed at re-identifying confidential information. We present an analytic framework for evaluating these risks quantitatively in order to be able to explicitly balance privacy and scientific utility. As a concrete example, we examine implications for patient privacy of plans to deposit over 70,000 full-genome genotypes and associated clinical data in the dbGaP federally-managed data repository, as a component of a NIH-funded study conducted by the Research Program on Genes, Environment, and Health (RPGEH) at the Kaiser Permanente Northern California Division of Research (KPNC DOR). Risks are examined from multiple perspectives and risk reduction strategies discussed.

Methods Two analytic approaches are described: (1) “k-anonymization”, which computes risk based only on the distribution of cell sizes in the disclosed dataset; and (2) “k-map” which takes account of the characteristics of potential reference datasets – e.g., voter rolls, disease registries - which may be available to the attacker. Probabilities of re-identification were computed using a random sample of records from actual study participants, and assumed disclosure of the following phenotypic attributes: 5-year age group, sex, race (5 categories) and a set of 22 ICD9-defined common diseases. For method 2, the KPNC EMR was used as a proxy for a highly informative reference dataset.

Results The first method tended to yield very conservative estimates of risk: 9.5% of subjects in the disclosed dataset had unique phenotypic attributes, while 18% were in cells of size <5 and 24% were in cells of size <10. Factoring in characteristics of potential reference datasets, method 2, yielded substantially lower levels of risk: 2% of subjects were distinct, 4% in cells size <5, and 6% in cells of size <10.

Conclusions Assessment of re-identification risk of disclosed genomic-phenotypic data is complex, involving differing stakeholders’ perspectives, attack types, and characteristics of both the disclosed data and the surrounding information environment. However, reasonable assumptions can be made which allow quantitative estimates of risk, and suggest strategies for risk reduction.

Loading