Two decades after the draft sequence of the human genome was unveiled to great fanfare, a team of 99 scientists has finally deciphered the entire thing. They have filled in vast gaps and corrected a long list of errors in previous versions, giving us a new view of our DNA.
The consortium has posted six papers online in recent weeks in which they describe the full genome. These hard-sought data, now under review by scientific journals, will give scientists a deeper understanding of how DNA influences risks of disease, the scientists say, and how cells keep it in neatly organized chromosomes instead of molecular tangles.
For example, the researchers have uncovered more than 100 new genes that may be functional, and have identified millions of genetic variations between people. Some of those differences probably play a role in diseases.
For Nicolas Altemose, a postdoctoral researcher at the University of California, Berkeley, who worked on the team, the view of the complete human genome feels something like the close-up pictures of Pluto from the New Horizons space probe.
“You could see every crater, you could see every color, from something that we only had the blurriest understanding of before,” he said. “This has just been an absolute dream come true.”
Experts who were not involved in the project said it will enable scientists to explore the human genome in much greater detail. Large chunks of the genome that had been simply blank are now deciphered so clearly that scientists can start studying them in earnest.
“The fruit of this sequencing effort is amazing,” said Yukiko Yamashita, a developmental biologist at the Whitehead Institute for Biomedical Research at the Massachusetts Institute of Technology.
A century ago, scientists knew that genes were spread across 23 pairs of chromosomes, but these strange, wormlike microscopic structures remained largely a mystery.
By the late 1970s, scientists had gained the ability to pinpoint a few individual human genes and decode their sequence. But their tools were so crude that hunting down a single gene could take up an entire career.
Toward the end of the 20th Century, an international network of geneticists decided to try to sequence all the DNA in our chromosomes. The Human Genome Project was an audacious undertaking, given how much there was to sequence. Scientists knew that the twin strands of DNA in our cells contained roughly three billion pairs of letters — a text long enough to fill hundreds of books.
When that team began its work, the best technology the scientists could use sequenced bits of DNA just a few dozen letters, or bases, long. Researchers were left to put them together like the pieces of a vast jigsaw puzzle. To assemble the puzzle, they looked for fragments with identical ends, meaning that they came from overlapping portions of the genome. It took years for them to gradually assemble the sequenced fragments into larger swaths.
The White House announced in 2000 that scientists had finished the first draft of the human genome, and details of the project were published the following year. But long stretches of the genome remained unknown, while scientists struggled to figure out where millions of other bases belonged.
It turned out that the genome was a very hard puzzle to put together from small pieces. Many of our genes exist as multiple copies that are nearly identical to each other. Sometimes the different copies carry out different jobs. Other copies — known as pseudogenes — are disabled by mutations. A short fragment of DNA from one gene might fit just as well into the others.
And genes only make up a small percentage of the genome. The rest of it can be even more baffling. Much of the genome is made up of virus-like stretches of DNA that exist largely just to make new copies of themselves that get inserted back into the genome.
In the early 2000s, scientists got a little better at putting together the genome puzzle from its tiny pieces. They made more fragments, read them more accurately, and developed new computer programs to assemble them into bigger chunks of the genome.
Periodically, researchers would unveil the latest, best draft of the human genome — known as the reference genome. Scientists used the reference genome as a guide for their own sequencing efforts. For example, clinical geneticists would catalog disease-causing mutations by comparing genes from patients to the reference genome.
The newest reference genome came out in 2013. It was a lot better than the first draft, but it was a long way from complete. Eight percent of it was simply blank.
“There’s basically an entire human chromosome that had gone missing,” said Michael Schatz, a computational biologist at Johns Hopkins University.
In 2019, two scientists — Adam Phillippy, a computational biologist at the National Human Genome Research Institute, and Karen Miga, a geneticist at the University of California, Santa Cruz — founded the Telomere-to-Telomere Consortium to complete the genome.
Dr. Phillippy admitted that part of his motivation for such an audacious project was that the missing gaps annoyed him. “They were just really bugging me,” he said. “You take a beautiful landscape puzzle, pull out a hundred pieces, and look at it — that’s very bothersome to a perfectionist.”
Dr. Phillippy and Dr. Miga put out a call for scientists to join them to finish the puzzle. They ended up with 99 scientists working directly on sequencing the human genome, and dozens more pitching in to make sense of the data. The researchers worked remotely through the pandemic, coordinating their efforts over Slack, a messaging app.
“It was a surprisingly nice ant colony,” Dr. Miga said.
The consortium took advantage of new machines that can read stretches of DNA reaching tens of thousands of bases long. The researchers also invented techniques to figure out where particularly mysterious repeating sequences belonged in a genome.
All told, the scientists added or fixed more than 200 million base pairs in the reference genome. They can now say with confidence that the human genome measures 3.05 billion base pairs long.
Within those new sequences of DNA, the scientists discovered more than 2,000 new genes. Most appear to be disabled by mutations, but 115 of them look as if they can produce proteins — the function of which scientists may need years to figure out. The consortium now estimates that the human genome contains 19,969 protein-coding genes.
With a complete genome finally assembled, the researchers could take a better look at the variation in DNA from one person to the next. They discovered more than two million new spots in the genome where people differ. Using the new genome also helped them to avoid identifying disease-linked mutations where none actually exist.
“It’s a great advance for the field,” said Dr. Midhat Farooqui, the director of molecular oncology at Children’s Mercy, a hospital in Kansas City, Mo., who was not involved in the project.
Dr. Farooqi has started using the genome for his research into rare childhood diseases, aligning DNA from his patients against the newly filled gaps to search for mutations.
Switching to the new genome may be a challenge for many clinical labs, however. They’ll have to shift all of their information about the links between genes and diseases to a new map of the genome. “There will be a big effort, but it will take a couple years,” said Dr. Sharon Plon, a medical geneticist at Baylor College of Medicine in Houston.
Dr. Altemose plans on using the complete genome to explore a particularly mysterious region in each chromosome known as the centromere. Instead of storing genes, centromeres anchor proteins that move chromosomes around a cell as it divides. The centromere region contains thousands of repeated segments of DNA.
In their first look, Dr. Altemose and his colleagues were struck by how different centromere regions can be from one person to another. That observation suggests that centromeres have been evolving rapidly, as mutations insert new pieces of repeating DNA into the regions or cut other pieces out.
While some of this repeating DNA may play a role in pulling chromosomes apart, the researchers have also found new segments — some of them millions of bases long — that don’t appear to be involved. “We don’t know what they’re doing,” Dr. Altemose said.
But now that the empty zones of the genome are filled in, Dr. Altemose and his colleagues can study them up close. “I’m really excited moving forward to see all the things we can discover,” he said.