CTSI Blogs

The Research Ethics Blog by Bernard Lo, MD

Genomic Sequencing: Identifying the Anonymous

Law and Order. CSI. N.C.I.S. Who hasn’t watched a crime investigation show on TV? A detective finds a tiny sample blood at a crime scene. Just before a commercial break, the team runs a DNA analysis on the computer and submits the results to the national database. We all know what happens next – after the commercial, the computer identifies a suspect.

Re-identification of anonymous specimens makes for great TV drama, but there are some disturbing implications for the confidentiality of genomic sequencing data in biomedical research once it becomes feasible to sequence a person’s entire DNA sequence at moderate expense.

How can an anonymous specimen be identified?

A “de-identified” or anonymous specimen can be converted into an identified one if the genomic sequence data can be compared with a database that contains both sequence data and donor identities. For example, unknown donors of biological materials found at a crime scene might be identified through the DNA forensic database at the Federal Bureau of Investigation (FBI). As of August 2009, the FBI had DNA profiles on over 7.3 million persons convicted of or accused of certain crimes. The FBI database uses short tandem repeats (STRs) at 13 loci to identify individuals and thus cannot identify someone from currently available information about single nucleotide polymorphisms (SNPs) in a research specimen. However, when full genomic sequencing in research subjects becomes feasible, STRs at the FBI loci will be apparent. Donors of research samples will be identifiable if they are in the FBI database.

Companies that offer genome-wide testing to the public for ancestry testing or for personal or “recreational genomics” have databases that contain over 200,000 SNPs on each individual, together with identifiers. It is possible to identify a person on the basis of 80 independent SNPs. Hence, participants in whole-genome sequencing research can be identified if their data are in one of these commercial biobanks. The confidentiality of a research subject’s genomic sequencing data therefore depends on the security of databases that contain sequence data linked to identifiers. Confidentiality is only as strong as the weakest link in the chain of security. There is little information available to the public on what security standards are in place for forensic and commercial genomic databases and how those standards are enforced.

There are two general approaches to protecting the confidentiality of identifiable research data: informed consent when specimens are donated, and security/confidentiality protections after donation. The possibility of re-identifying research participants raises two ethical issues regarding full genome sequencing in research. First, when specimens are collected for genomics research, what should donors be told about full genomic sequencing and the risk of re-identification? Second, what security provisions should be in place when full genomic sequencing data are shared with other researchers?

How should we obtain consent for full genomic sequencing?

There are many existing biological specimens (often associated with rich clinical data) on which full genomic sequencing might be carried out for research purposes. Using such existing specimens reduces the cost and the time to carry out genomics research.

Federal regulations allow existing biological materials and clinical data to be used for research without consent if no HIPAA identifiers are included in the data set. The underlying rationale is that there is no risk of breaching confidentiality and no one would object to the use of anonymized materials for research. But these regulations were written years ago, before it became possible to re-identify the donor of a “de-identified” sample using full genomic sequencing. Should we continue to let such “de-identified” specimens be used without consent in research that involves full genomic sequencing?

In other research studies employing full genomic sequencing, investigators will collect new samples from participants, often with detailed clinical information. What should participants be told during the consent process about the genomic sequencing and the risk of re-identification? For example, would it be ok for the research to simply tell participants that the specimens will be used in “genetics research,” without specifically discussing full genomic sequencing and the risk of re-identification?

How should investigators share genomic sequence data with other researchers?

Collecting biological specimens and detailed clinical data is difficult, and whole-genome sequencing currently is expensive. Sharing whole genomic sequencing data with other investigators would really help move research along. For example, participants in one study might be used as controls in a study of another disease or condition. Yet sharing data brings up concerns about confidentiality and consent for future studies. There are several approaches to sharing sequence data from genome-wide association studies:

  • Investigators may download individual-level genotype and phenotype data as de-identified data in encrypted files. Researchers must obtain approval for their project from a Data Access Committee and agree to data use restrictions and security measures. This arrangement would be most convenient for researchers, but it would be difficult to enforce the security provisions.
  • Investigators may access and analyze genomic sequence data on a secure website but not be permitted to download data. For example, researchers might be allowed access to data only at a central site with strict security.
  • Investigators may analyze genomic sequence data using only designated biostatistical centers with strict security, rather than by a biostatistician on the researcher’s team. This arrangement would be provide the strongest security for the data, but would make it more difficult for investigators to carry out their research.

These options all involve difficult trade-offs between confidentiality of genomic data and efficiency of research. What do you think the best trade-off would be?


An important component of consenting to clinical research is that this consent is an informed one. The nature of genetic research and changing technologies results in it being difficult to ensure a truly informed consent to repeated use of their genomic data in research studies beyond the initially consented study, as illustrated by this article. Although efficiency of research is of course nice for researchers, the primary concern should be confidentiality and protection of subject privacy. The onus is on the researchers to ensure the methods of protecting subject privacy are as strict as possible. Sent my Emily T.

I think that it is really important to keep the patient’s confidentiality. Donors for genomics research should know that there is a risk of re-identification and this point has to be clearly discussed with them and then, they must agree with that and give their consent for the specimens to be collected. It is not enough to tell them that their specimens are going to be used in genetics research, all possible consequences of donating specimens and the risks have to be discussed. In the other hand, to be able to learn and do good research, researchers usually have to share information, this is necessary. But the key point is trying to share information, keeping the maximum level of confidentiality for the donor. Maybe a way of trying to do this is to keep the genotype and phenotype data in encrypted files, placed in a secure website were only the genetic information is kept. All the identifiers for these data could be kept in a separate center with strict security system and the researchers would be able to access the data, but no downloading them. I don’t think that it would be necessary to keep away the biostatistician in the researcher’s team, as this would difficult the research itself, but at least I would make sure that the identifiers are separate from the genetic data. There must be a balance between efficiency of research and confidentiality (there are always risk when patients’ information is collected), but knowing that it is more important the donors confidentiality than the results of any study and knowing that this information in the wrong hands could be catastrophic. Sent by T. Castillo

I think the genomics problem is an example of a more general problem: it seems likely that many of our "deidentified" data (i.e., without the HIPAA identifiers) could nonetheless be reidentified by a sufficiently persistent attacker. General medical and research ethics, as well as HIPAA and other legal regimes, seek to guard privacy and provide patients and research subjects with an absolute assurance of confidentiality. The desire for certainty is understandable, but I doubt it is generally achievable. If the facts are as I suspect--and that is not certain!--I'm not sure what conclusions should be drawn about appropriate responses in law, in data handling, and in disclosure.

I believe that with new developments in technology and in turn changes in the capabilities gained by these new discoveries, the rules and regulations must also change. Policies on patient confidentiality and informed consent must remain stringent and take into account new discoveries/developments of relevance. On one hand, "de-identified" specimens are value sources of data; on the other, a paternalistic approach to informed consent negates the very meaning of the process. Withholding the potential harms/risks to participants, such as their identities being know and linked to in-depth records of their clinical data, circumvents the purpose of informed consent. The participant's autonomy should be respected and they should be allowed to decide whether or not their information, be it genetic or clinical, be available for research and to what degree this information will be disseminated and used. Of the several approaches listed regarding sharing of information and protocol, I foresee the first option being the most widely accepted by researchers. The key is the burden of responsibility is on the researchers for the maintenance of a high-level of security. The data obtained must be treated with the same level of caution as social security numbers or medical record numbers, with serious repercussions for breaches in security. Strict regulations regarding data management, handling and transfer must be upheld and enforced at an institution level; again, with serious repercussions for failing to abide to these rules and regulations.

Like others have said, the central issues here are informed consent and confidentiality. The only reason why these issues pose new challenges is that the technology has changed. With the widespread utilization of the internet, for instance, people have multiple identifiers that are constantly at risk. To be able to utilize the amazing resources the internet has the ability to offer, companies have developed ways to protect these identifiers. In this way, I think that clinical research too, and specifically the field of genomics has to evolve as well to address these risks to the confidentiality of the original donor. In ethics, one of the principle tenets is do no harm, so while we see the ability to learn so much from these genetic sequences, we have to remember that we have to protect those whom we employ in our research. I do think that there is something to be said for making this pursuit of knowledge more feasible and so a secure server that would allow researchers to access but not download the information might take care of portability issues of the information, making it more difficult for it to link up with databases that have the ability to re-identify subjects. To echo what has been said, the other issue is informed consent. I do think that individuals should be informed of the risks of participating in a study in which their genetic material is sampled. I think however, that the risk should not be overestimated. While it is possible for a genetic sequence to be re-identified, it is likely rare, and rarer still if efforts are made to put strict limitations on who accesses the data and in what way it is used.

When enrolling patients in a study we are asking them to participate in an endeavor that usually will not directly benefit them, that is not a part of standard clinical practice and that will potentially expose them to risks. Therefore, it is incredibly important that consent for clinical research be informed. That being said, in genomics research, the possibility of re-identifying an anonymous or de-identified specimen poses significant challenges to achieving truly informed consent. Many of those challenges stem from the fact that the confidentiality of a research subject’s genomic sequencing data does not depend solely on those conducting the research study. It relies heavily on the security of databases that contain sequence data linked to identifiers. Because of these additional databases, the people running the study cannot put into place measures to fully protect the confidentiality of a patient’s data once the specimens are donated, and therefore cannot fully assure patients that their genome sequencing data will remain protected. When specimens are collected for genomics reason, it is the responsibility of the researchers to inform study participants of the risk of re-identification of the their specimens. Unfortunately, with little public information on the security standards in place for forensic and commercial genmic databases, it may be difficult for researchers to adequately address the risk of re-identification and even more so for participants to find this information on their own. Perhaps the agencies with genomic databases should be required to disclose information on their security standards. Regarding the sharing of genomic information with other researchers, while this could help expedite research significantly and avoid repetition, the issues of confidentiality and informed consent must be carefully considered. To ensure informed consent, when a participant consents to the original study he should be informed of the possibility that his genomic sequencing data may be distributed to and used by other researchers. A secure website that allows other researchers to access but not download genomic sequencing data would help to maintain the confidentiality of study participants while allowing researchers to share valuable data.

Part of the issue in being able to re-identify someone who donates for genomic studies that goes beyond research, is the link between a database of profiles that may reside in a criminal department like the FBI and that of a research database. Although, if there is the potential for a volunteer to be re-identified, they should be made aware of that during the consent process. But at what point would it be ok, for the FBI or some other governement body to take that de-identified data to try and re-identify a person. Although we are wary of a big-brother type of government, there are already many things that happen to us that are unique identifiers. For example, working at the VA, you have to get fingerprinted. Now your fingerprints are in a national database and can be linked back to you. To play devil's advocate, how different really is that if the VA took a blood sample and had your genomic sequence on file vs just your fingerprint?

Just as with any other type of epidemiologic study, participant confidentiality should be of paramount importance to a researcher. As long as genomic data is anonymous when it is published or shared, the risk of re-identification should be very low. In my opinion, it is wrong for governments to take extraordinary measures to match genomic information to the donor and use the information in criminal investigations, but if this happens, it is the researcher who is at fault. Researchers and companies that offer recreational genomics services should be aware of this potential problem and take measures to ensure that confidentiality is maintained.