Law and Order. CSI. N.C.I.S. Who hasn’t watched a crime investigation show on TV? A detective finds a tiny sample blood at a crime scene. Just before a commercial break, the team runs a DNA analysis on the computer and submits the results to the national database. We all know what happens next – after the commercial, the computer identifies a suspect.
Re-identification of anonymous specimens makes for great TV drama, but there are some disturbing implications for the confidentiality of genomic sequencing data in biomedical research once it becomes feasible to sequence a person’s entire DNA sequence at moderate expense.
How can an anonymous specimen be identified?
A “de-identified” or anonymous specimen can be converted into an identified one if the genomic sequence data can be compared with a database that contains both sequence data and donor identities. For example, unknown donors of biological materials found at a crime scene might be identified through the DNA forensic database at the Federal Bureau of Investigation (FBI). As of August 2009, the FBI had DNA profiles on over 7.3 million persons convicted of or accused of certain crimes. The FBI database uses short tandem repeats (STRs) at 13 loci to identify individuals and thus cannot identify someone from currently available information about single nucleotide polymorphisms (SNPs) in a research specimen. However, when full genomic sequencing in research subjects becomes feasible, STRs at the FBI loci will be apparent. Donors of research samples will be identifiable if they are in the FBI database.
Companies that offer genome-wide testing to the public for ancestry testing or for personal or “recreational genomics” have databases that contain over 200,000 SNPs on each individual, together with identifiers. It is possible to identify a person on the basis of 80 independent SNPs. Hence, participants in whole-genome sequencing research can be identified if their data are in one of these commercial biobanks. The confidentiality of a research subject’s genomic sequencing data therefore depends on the security of databases that contain sequence data linked to identifiers. Confidentiality is only as strong as the weakest link in the chain of security. There is little information available to the public on what security standards are in place for forensic and commercial genomic databases and how those standards are enforced.
There are two general approaches to protecting the confidentiality of identifiable research data: informed consent when specimens are donated, and security/confidentiality protections after donation. The possibility of re-identifying research participants raises two ethical issues regarding full genome sequencing in research. First, when specimens are collected for genomics research, what should donors be told about full genomic sequencing and the risk of re-identification? Second, what security provisions should be in place when full genomic sequencing data are shared with other researchers?
How should we obtain consent for full genomic sequencing?
There are many existing biological specimens (often associated with rich clinical data) on which full genomic sequencing might be carried out for research purposes. Using such existing specimens reduces the cost and the time to carry out genomics research.
Federal regulations allow existing biological materials and clinical data to be used for research without consent if no HIPAA identifiers are included in the data set. The underlying rationale is that there is no risk of breaching confidentiality and no one would object to the use of anonymized materials for research. But these regulations were written years ago, before it became possible to re-identify the donor of a “de-identified” sample using full genomic sequencing. Should we continue to let such “de-identified” specimens be used without consent in research that involves full genomic sequencing?
In other research studies employing full genomic sequencing, investigators will collect new samples from participants, often with detailed clinical information. What should participants be told during the consent process about the genomic sequencing and the risk of re-identification? For example, would it be ok for the research to simply tell participants that the specimens will be used in “genetics research,” without specifically discussing full genomic sequencing and the risk of re-identification?
How should investigators share genomic sequence data with other researchers?
Collecting biological specimens and detailed clinical data is difficult, and whole-genome sequencing currently is expensive. Sharing whole genomic sequencing data with other investigators would really help move research along. For example, participants in one study might be used as controls in a study of another disease or condition. Yet sharing data brings up concerns about confidentiality and consent for future studies. There are several approaches to sharing sequence data from genome-wide association studies:
- Investigators may download individual-level genotype and phenotype data as de-identified data in encrypted files. Researchers must obtain approval for their project from a Data Access Committee and agree to data use restrictions and security measures. This arrangement would be most convenient for researchers, but it would be difficult to enforce the security provisions.
- Investigators may access and analyze genomic sequence data on a secure website but not be permitted to download data. For example, researchers might be allowed access to data only at a central site with strict security.
- Investigators may analyze genomic sequence data using only designated biostatistical centers with strict security, rather than by a biostatistician on the researcher’s team. This arrangement would be provide the strongest security for the data, but would make it more difficult for investigators to carry out their research.
These options all involve difficult trade-offs between confidentiality of genomic data and efficiency of research. What do you think the best trade-off would be?