Published On Mar 09, 2017
Under Lock and Key?
Genetic databases have helped medicine make great leaps forward. But is it really possible to keep the identities behind those genes a secret?
In 2008 James Watson became the second person to publish his whole genome online, a legacy especially significant because of his major role in discovering the structure of DNA more than a half century earlier. Watson made his genes public “to encourage the development of an era of personalized medicine,” which would come about when researchers had enough data from genetic volunteers to begin teasing out the connections between genes and diseases.
But Watson explained that he was choosing to black out a tiny section of his DNA first—a segment that contained the APOE gene, which can be linked with Alzheimer’s disease. Before long, specialists were able to deduce whether or not he carried the riskier version of that gene, since the trail of genetic markers pointing to it turned out to be more far-reaching than the segment he had deleted. The Nobel laureate then asked that an even wider region of his genome be made private, but by then the sensitive information was, to those in the know, already out of the bag.
Watson proved to be right about the value of genetic data. Hundreds of thousands of genomes have now been sequenced, and sifting through them has led to leaps ahead in the detection, treatment and prevention of disease. But the privacy breach that vexed Watson casts a shadow too. The problem of genomic privacy has no easy fix, and may in fact be getting worse.
Most volunteers start off with their identities more secure than Watson’s was. When genetic information is used by researchers, it is generally stripped of names and addresses and subject to tight controls over who can see it. But every sequenced genome is, by definition, more personal than a fingerprint. And it carries an intimate index of ancestry, sickness and health that can be compromising if it gets out.
Discovering the identity of the donor by looking at the data is easier than expected, says Yaniv Erlich, assistant professor of computer science at Columbia University. In an academic effort to test current identity protections, Erlich and other researchers were able to re-identify anonymous DNA donors from research datasets. The process involved a cross-search against genetic information freely available on the Internet, made public through genealogy and other consumer genetics websites. Users of these sites post their genetic sequences in the hope of being matched to relatives they might not know about, and many also make their genomes available to researchers.
Even if one donor hasn’t made data available on such a site, a close relative might have. A recent article in the Journal of Genetic Counseling, co-authored by identical twins, explored the ethical implications of one twin getting their genome sequenced without the consent of the other twin. Even third cousins may share up to 2% of their DNA, which means that a determined sleuth might start with your genome and turn up your relative’s data, then deduce your name through other clues your genome holds, such as hair color or ancestry.
“You might not even know these relatives,” says Erlich. “There are so many people with their data online now, and you have no control over what they do.”
The march of progress poses a problem too. While you might have given the OK to publish certain parts of your healthy genome, new discoveries may point to risk factors in your sequence that you didn’t know about at the time—or, in the case of Watson, new clues and correlations may pop up that unmask the areas you had wanted to keep hidden.
The consequences of having a name associated with a dangerous genetic health condition can be devastating for the patient. The Genetic Information Nondiscrimination Act (GINA) makes it illegal for employers and health insurance companies to use genetic information in a discriminatory way. But such use would be difficult to prove, and the law may be troublesome to enforce. “GINA doesn’t have teeth,” says Dov Greenbaum, an intellectual property attorney and adjunct professor at Yale University who has studied and written on issues of genomic privacy. Other kinds of seekers—life insurance, long-term-care insurance and disability insurance, for example—are not covered under GINA and presumably could use genetic information to deny coverage to those at risk of developing dangerous conditions.
And while all patients’ medical data is usually protected by the strict guidelines of the federal Health Insurance Portability and Accountability Act (HIPAA), “anonymized” data in a research study is exempt from those rules. If the data is not directly connected to personal information, HIPAA doesn’t apply—which means the patient’s protection is only as good as the research team’s commitment to keep it safe.
The most widespread response from research institutions to the possibility of revealing patient identities has been to keep genetic data locked up tight. The National Institutes of Health has a number of genomic databases, but grants access only after a review process. And so far, hackers haven’t come looking for genetic data. “But your medical records are worth 20 times your credit card on the black market,” says George Church, a professor of genetics at Harvard Medical School. “So it’s only a matter of time before your genome is a prime target for hacking.”
Church believes that it’s naïve to promise complete privacy to genomic testing participants—at least today. So he’s taken the path of transparency with his Personal Genome Project, which has collected genomic, health and trait data from more than 5,000 volunteers. Those genomes are openly available to the public. And while most of the data has been anonymized to the best of the project’s ability, participants must take a test about their knowledge of the risks of releasing their genomes, and score 100% before they are allowed to participate.
But evolving technical approaches may help to ensure privacy. Laura Rodriguez, director of the Division of Policy, Communications and Education at NIH, has worked on the agency’s policy for sharing genomic research data. While she also sees the need for clear communication with participants about the risks and benefits, she is hopeful about advances in cybersecurity. “We need to keep thinking through and innovating around technical solutions,” she says.
One of those solutions might be to throw a bit of digital sand into genomic data. Earlier this year, researchers from MIT and Indiana University in Bloomington described a method of adding “differential privacy” to genomic datasets. This slightly scrambles the data, so that individual DNA sequences become smudged, and individuals can’t be identified. Rodriguez says that this kind of approach could be especially useful if it were tied to a system of tiered access. The first tier of a database could be accessible to a large number of researchers, but would have this smudging widely applied. Researchers could submit broad queries to see whether the data fit a research hypothesis. They would get back imperfect but statistically meaningful results that would allow them to determine whether to take a closer look . To get access to the second tier would require passing a number of tests for data protection and safety, but the data would be clean.
But Erlich and others fear that differential privacy could compromise biomedical research, since it would be harder to work with this smudged data. Mark Gerstein, co-director of the Yale program in Computational Biology and Bioinformatics, favors the system used by Genomics England, a U.K. government repository of 100,000 genomes. That data exists in a protected enclave to which incoming and outgoing traffic is carefully monitored. Researchers must apply for credentials to study and manipulate that data. Gerstein believes that scientists would be better off testing their starting hypotheses on small amounts of pure data than on large quantities of imperfect data.
Whatever the solution, it must come soon. According to estimates in 2014, the number of sequenced genomes may be doubling every 12 months, with 1.6 million genomes sequenced by 2017—which means that genomic privacy could soon be everyone’s problem.