What is valuable when it comes to assembling sequencing reads into a long contiguous sequence? Here at SNPsaurus we recently had an interesting bacterial sample come through that helped illustrate what factors drive a successful outcome.
All about that base (pair)
We ask for high quality DNA for working on a genome assembly project. We check out the DNA on a Fragment Analyzer, and high quality DNA is often 30kb long or more.What happens when DNA comes that isn’t that long? Is all lost? We thought we’d see if we could make something of this sample, which had DNA averaging 7kb.
We didn’t do any shearing step, then gave it a little extra sequencing on a PacBio Sequel. The reads were assembled by Canu, then polished and corrected by arrow. To our surprise, the results looked great!
-- Found 353643 reads. -- Found 1468835527 bases (293.76 times coverage) -- contigs: 1 sequences, total length 4804266 bp
One contig is the goal, meaning that there were no regions that caused the assembler to give up and start on another contig. The contig length was similar to the reference genome length of the closest match, another sign that the assembly went well.
Since we had produced close to 300X read depth for this sample, we had a chance to go back and test a few subsamplings of the data. First, we went slightly low on the read number and depth (~75X) and the resulting assembly was not as good
-- Found 92782 reads. -- Found 384994639 bases (76.99 times coverage) -- contigs: 6 sequences, total length 4819845 bp (including 2 repeats)
Next we doubled the coverage and collapsed those 6 contigs into 3:
-- Found 185497 reads. -- Found 770680608 bases (154.13 times coverage) -- contigs: 3 sequences, total length 4807463 bp (including 1 repeat)
So, more reads are better… but why exactly? We next went back to low read depth, but only allowed PacBio subreads of 5kb or longer. It did even better!
-- Found 49428 reads. -- Found 345746348 bases (69.14 times coverage) -- contigs: 2 sequences, total length 4804360 bp (including 0 repeats)
To help make that clear, we took 150X read depth of reads, but only the PacBio subreads of 5kb or shorter… and the results were not that good:
-- Found 263738 reads. -- Found 839832486 bases (167.96 times coverage) -- contigs: 26 sequences, total length 4879780 bp (including 7 repeats)
The conclusion is that while giving this sample lots of reads was helpful, it was most helpful because it allowed more long reads to be generated, even though they were a minority of the output. This is where the PacBio Sequel shines for assembly projects–producing long reads that can span repeats and help hook together the genome. The shorter reads help for generating a low-error consensus, but are less helpful for piecing it all together.
The long and the short of it
We did a “best case” assembly from Illumina reads, by creating 300 bp reads (simulating a paired-end 150 bp run) from the assembly at high depth and assembling that:
-- contigs: 443 sequences, total length 4678598 bp, 3 contigs longer than 100 kb