The Rockefeller University » Technology Pipeline and Policies

From 2015 to 2017, the G10K consortium worked with the major sequencing and assembly companies (e.g., Illumina(opens in new window), Pacific Biosciences(opens in new window), Oxford Nanopore(opens in new window), Bionano(opens in new window), 10X Genomics(opens in new window), NRGene(opens in new window), Dovetail Genomics(opens in new window), Phase Genomics(opens in new window), Arima Genomics)(opens in new window), major sequencing centers (BGI(opens in new window), Broad Institute(opens in new window), Sanger Institute(opens in new window), Washington University Genome Center)(opens in new window), major public genome archive and annotation centers (NCBI(opens in new window), Ensembl(opens in new window), UCSC(opens in new window)), and experts in academia and government (NIH, NSF) to test, improve, and generate new approaches for producing the highest quality, error-free, 3rd generation reference genome assemblies achievable for the least cost possible. For the first time, we tested each of these technologies on one individual animal, a bird (hummingbird or zebra finch) and a mammal (goat or human), such that our analyses were not hampered by the common problem of multiple variables changing simultaneously, which has plagued previous comparative genome technology efforts.

Our goal with these approaches is to generate a genome with a metric minimum contig N50 of 1 million bp (1Mb), scaffold N50 of 10Mb, 90% of the genome assembled into chromosomes confirmed by 2 independent sources, a base-call quality error of QV40 (no more than 1 nucleotide error in 10,000 bp), and haplotype phased. We call this a 3.4.2.QV40 phased metric, where the first three numbers are the exponents of the N50 contig, N50 scaffold, and level of chromosomal assembly.

To date, of the over 335 vertebrate genomes in the public NCBI database, only 9 meet our 3.4.2.QV40 metric. Of these 9, 7 were done by members of our G10K group using the above approaches. The other 2 are human and mouse and were only brought to this quality level after billions (human) and millions (mouse) of dollars were spent for continuous correction of these assemblies. However, none of these 9 genomes have been phased and have errors related to haplotype collapsing.

A major challenge for saving species and conducting high-quality genomic research has been finding cost-effective and sufficient technology to generate high-quality genomes. We worked with industry partners to develop unprecedented high-resolution genome sequencing methods at significantly lower costs than current, less robust technologies.

The Current Pipeline

The current pipeline (Figure 1) to meet the 3.4.2.QV40 phased metric with the fewest errors currently achievable consists of a combination of the following approaches:

60X PacBio long-reads for an initial phased contig assembly (30X/haplotype);
68X coverage of 10X Genomics-linked reads for intermediate-range scaffolding and further phasing (34X/haplotype);
80X Bionano optical maps to correct scaffolding errors and for further scaffolding;
68X Hi-C linked reads for long-range scaffolding;
PacBio Jelly algorithm to fill gaps using long-reads;
10X Genomics Illumina short-reads for base-call accuracy polishing; and
Assembly algorithms that merge multiple data types without creating haplotype errors

DNA nexus link

Full VGP pipeline. From sample processing, sequence data generation, assembly algorithms, annotation, to uploading onto public databases. The pipeline is a 4-way partnership between the sequence data generators and assemblers, DNAnexus, AWS, and annotation centers. Figure modified from Mark Mooney (DNAnexus).

Rationale

High-quality error-free genome assemblies and annotations are necessary as current 1st and 2nd generation genome sequencing approaches generate numerous errors that cause a variety of problems in downstream analyses. Parts of genes are missing, and some are incorrectly assembled, while others are completely missing from the assemblies despite pieces found in the raw sequence reads. Due to these fragmented, error-prone assemblies, researchers have had to clone, re-sequence, and correct individual genes. In some cases, the gene structures are too complex, too long, or too closely related, preventing even the Sanger-based higher quality 1st generation methods from correcting genome assemblies. In many other instances, investigators do not even know that they are working with incorrect gene sequences and structures, impacting many scientific findings and scientific progress.

Policies

External Use Embargo Policy

The Vertebrate Genomes Project at Rockefeller Vertebrate Genomes Project Plan Technology Pipeline and Policies Vertebrate Genome Lab(opens in new window)What Have We Accomplished?Vertebrate Genomes Project Photo Gallery

Upcoming Events

(opens in new window)

Join and Give to the Vertebrate Genome Laboratory

There are many ways to help support this project! You can help support this work by providing a financial gift that will be used towards mapping the genomes of these species.

(opens in new window)

Other ideas? We love ideas! Please direct all inquires below.

contact@genomeark.org (opens in new window)