Scientists uncover the identities of anonymous DNA donors using freely available web searches.
By Ruth Williams from The Scientist
A person donating their DNA sequence anonymously for research purposes may in fact be identified by a few simple web searches, according to a paper published today (January 17) in Science. But rather than trying to protect anonymity, some scientists believe efforts should instead be focused on educating DNA donors and on legislating against the misuse of sequence data.
“The paper is a nice example of how simple it is to re-identify de-identified samples and that the reliance on de-identification as the mechanism of ensuring privacy and avoiding misuse is one that is not viable,” said Nita Farahany, a professor of law and research at Duke University in Durham, North Carolina, who was not involved in the study.
Participants in public sequencing projects are told that their anonymity is not 100 percent guaranteed, but the risk of a person’s identity being discovered was perceived to be miniscule, explained Yaniv Erlich, a computational geneticist at the Whitehead Institute for Biomedical Research in Cambridge, Massachusetts, who led the study. However, a 2005 Washington Post article about a teenage boy who tracked-down his biological sperm-donor father via online genealogy searches suggested the risk may be significant. According to the article, the boy had submitted a sample of his own DNA to a genealogy service that used repeat sequences from his Y-chromosome to search their sequence databases for related males. Although the search did not uncover his father directly, it did find weak matches to two men who importantly shared a surname. Along with his father’s place and date of birth—information released to the mother—the likely surname enabled the boy to find and contact his father.
“We heard about this story and we thought, wow, this could be a threat for [the privacy of] personal genomes,” said Erlich.
To see how easy it might be to discover the identity of DNA donors, his team built software for retrieving Y-chromosome repeat information from whole genome sequences. With those repeat sequences, they could perform genealogy searches. “We thought, cool, let’s try it on the genome of Craig Venter,” said Erlich. “And it worked!”
They searched the available genealogical sequence database at Ysearch.org and, sure enough, the strongest match by far was to someone named Venter from Lincolnshire in England. The surname, together with Craig Venter’s known age and state of residence—two pieces of information commonly accompanying anonymous genome sequences—were then used to search the online public record, USsearch.com. The search came up with just two possible people, and one was Craig Venter.
Taking the experiment further, Erlich and his colleagues used their software to retrieve Y chromosome information from the anonymous DNA sequences of male participants in a public sequencing project and showed that, using the same methods, they could accurately determine the identities of multiple individuals. They could even identify anonymous women donors related to the males, by virtue of family tree data accompanying the genome sequences and the ability to search online public records. The important point, said Erlich, is that “everything was publically available. We didn’t break into any database. We didn’t need any special passwords.”
Although, the authors find the probability of discovering someone’s identity is still low, the study raises the question of whether more should be done to protect donors’ anonymity. But George Church, professor of genetics at Harvard Medical School, who was not involved in the study, thinks there is little point. “You can keep trying to adjust the protocols” - information about participants’ ages might be kept private, for example-“but that’s kind of putting a bandage on it.... It’s only going to get easier to re-identify [anonymous sequences], not harder,” he said.
Although the Genetic Information Nondiscrimination Act in the United States prohibits employers and health insurance companies from discriminating on the basis of genetic information, “there is still a fear of the unknown,” said Brad Malin, a professor of biomedical informatics and computer science at Vanderbilt University in Nashville, Tennessee, who is worried that the study will frighten away members of the public from participating in genome sequencing projects. “It is important to highlight these problems, but at the same time, when you highlight them it is very difficult to temper the result,” he said.
Farahany agreed. “What we need to do is better educate people about the facts,” she said. Furthermore, she added, efforts might be better spent on regulating the use of sequence data, rather than ensuring anonymity. “That’s where we should focus our legal and ethical analyses,” she said - “not on trying to prevent the flow of information, but on trying to prevent the misuse of information.”
M. Gymrek et al., “Identifying Personal Genomes by Surname Inference,” Science, 339: 321-324, 2013.