The power of PacBio long sequencing reads

Genome assembly

What is valuable when it comes to assembling sequencing reads into a long contiguous sequence? Here at SNPsaurus we recently had an interesting bacterial sample come through that helped illustrate what factors drive a successful outcome.

All about that base (pair)

“Human science fragments everything in order to understand it” Leo Tolstoy, War and Peace
“SNPsaurus didn’t fragment this sample any further, Leo!”

We ask for high quality DNA for working on a genome assembly project. We check out the DNA on a Fragment Analyzer, and high quality DNA is often 30kb long or more.What happens when DNA comes that isn’t that long? Is all lost? We thought we’d see if we could make something of this sample, which had DNA averaging 7kb.

We didn’t do any shearing step, then gave it a little extra sequencing on a PacBio Sequel. The reads were assembled by Canu, then polished and corrected by arrow. To our surprise, the results looked great!

-- Found 353643 reads.
-- Found 1468835527 bases (293.76 times coverage)
-- contigs: 1 sequences, total length 4804266 bp

One contig is the goal, meaning that there were no regions that caused the assembler to give up and start on another contig. The contig length was similar to the reference genome length of the closest match, another sign that the assembly went well.

Since we had produced close to 300X read depth for this sample, we had a chance to go back and test a few subsamplings of the data. First, we went slightly low on the read number and depth (~75X) and the resulting assembly was not as good

-- Found 92782 reads.
-- Found 384994639 bases (76.99 times coverage)
-- contigs: 6 sequences, total length 4819845 bp (including 2 repeats)

Next we doubled the coverage and collapsed those 6 contigs into 3:

-- Found 185497 reads.
-- Found 770680608 bases (154.13 times coverage)
-- contigs: 3 sequences, total length 4807463 bp (including 1 repeat)

So, more reads are better… but why exactly? We next went back to low read depth, but only allowed PacBio subreads of 5kb or longer. It did even better!

-- Found 49428 reads.
-- Found 345746348 bases (69.14 times coverage)
-- contigs: 2 sequences, total length 4804360 bp (including 0 repeats)

To help make that clear, we took 150X read depth of reads, but only the PacBio subreads of 5kb or shorter… and the results were not that good:

-- Found 263738 reads.
-- Found 839832486 bases (167.96 times coverage)
-- contigs: 26 sequences, total length 4879780 bp (including 7 repeats)

The conclusion is that while giving this sample lots of reads was helpful, it was most helpful because it allowed more long reads to be generated, even though they were a minority of the output. This is where the PacBio Sequel shines for assembly projects–producing long reads that can span repeats and help hook together the genome. The shorter reads help for generating a low-error consensus, but are less helpful for piecing it all together.

The long and the short of it

We did a “best case” assembly from Illumina reads, by creating 300 bp reads (simulating a paired-end 150 bp run) from the assembly at high depth and assembling that:

-- contigs: 443 sequences, total length 4678598 bp, 3 contigs longer than 100 kb

So we love Illumina for nextRAD genotyping by sequencing, but for assembly…PacBio long reads have some unique advantages that lead to a more complete genome.


How low can you go?

How low can you go?

A common project design question is “how much information do I need?”. The usual response is, “as much as possible!”. But this is perhaps informed as much by tradition as actual need. There are several dimensions to information as well–number of loci assayed, read depth at each locus, for example.

A little history of genotyping by sequencing

The early genotyping by sequencing methods were attempts at replicating the standards of the time as closely as possible. Fixed-content genotyping arrays ruled, and these delivered high-quality genotype calls for heterozygous alleles, so next-gen sequencing methods tried to emulate these types of data. The GBS method broke away from this by making light sequencing of many loci and inferring the nearby genotypes the standard.

Bears and bears and bears, oh my!

So researchers have been more comfortable with assaying many markers but at a low depth, essentially getting high quality data from just one of the two homologous chromosomes (in a diploid). A great example of this, and an interesting read is:

Genomic Evidence for Island Population Conversion Resolves Conflicting Theories of Polar Bear Evolution

from Beth Shapiro’s group at UCSC. After light sequencing of polar, brown and black bears, the data were downsampled to only choose one allele at each locus, even if two alleles were present. They were then able to apply informative population statistics to the downsampled data, such as assessing genetic diversity (spoiler: polar bears aren’t very diverse), quantifying admixture using the D-statistic, and using the data for simulations of gene flow.

A lot from a little

The paradigm of getting a little information about a lot of loci is a useful one. Sometimes input DNA amounts are scarce, or the DNA is damaged and low quality. These issues can prevent the creation of a fully complex sequencing library. But “scans” of the genome like the paper above are still possible, and can be incredibly useful for providing new insights into long-studied populations of ecological, environmental and evolutionary importance.