Improving quality of high-throughput sequencing reads 提高高通量測序的質量

5.3.4 Evaluation Results for PacBio Error Correction Tools

Due to the high error rate of PacBio reads, error correction outputs could have many uncorrected bases.

Therefore, most PacBio error correction tools generate two types of reads:

(1) trimmed reads that only contain corrected regions in input reads and

(2) untrimmed reads that include both corrected and uncorrected regions in input reads.

While PBcR only produces trimmed reads, LSC and Proovread generate both trimmed reads and untrimmed reads, and they were assessed separately.

For LoRDEC, trimmed reads were generated from the untrimmed reads using lordec-trim-split that is included in the LoRDEC package.

Accuracy of PacBio Error Correction Tools

In Figure 5.5A, percentage similarity of the outputs from PacBio read error correction methods for P1 are compared. Percent similarity of the input reads was 76.6 percent before error correction, and all the output results were better than this number. Among the four tools, three tools except LSC showed percent similarity over 95 percent for the trimmed reads. For the untrimmed reads, LoRDEC and Proovread generated more accurate reads than LSC. Except the untrimmed LoRDEC reads, read coverage of Illumina reads gave almost no impact on percentage similarity.

Figure 5.5B and Figure 5.5C show read coverage and NG50 of the outputs of the compared tools. The two charts have similar shapes and the values became high when percentage similarity in Figure 5.5A was low. The trimmed LoRDEC reads and the PBcR outputs were improved a lot by increasing Illumina read coverage. The trimmed reads from Proovread were also improved but the values were saturated for 30 X coverage.

Percentage similarity, read coverage, and NG50 are compared for P2- 40X and P2-40X-EF that is the error-free version of P2-40X in Figure 5.6. Both the trimmed Proovread reads and the trimmed LoRDEC reads showed high percentage similarity. Percentage similarity and read coverage of the untrimmed Proovread reads were almost the same compared to those of the trimmed Proovread reads. However, NG50 of the trimmed Proovread reads was shorter than that of the untrimmed Proovread reads. LoRDEC generated the trimmed reads with high percent similarity but it removed too many bases and read coverage and NG50 of the read set became much lower than those of the original input reads.

For all the three metric, P2-40-EF did not make a meaningful difference when it was compared with P2-40. This means sequencing errors in Illumina reads are not important when Illumina read coverage is about 40 X.

Alignment Results for PacBio Error Correction Tools

We aligned input PacBio reads and their error correction results using BWA with “-x pacbio” option, and evaluated the alignment results. Before error correction, over 95 percent of P1 PacBio reads and over 98 percent of P2 PacBio reads could be aligned to the reference sequences, hence the number was not improved much after error correction.

The ratio of the number of reads that were aligned without any mismatches or indels to the total number of corrected reads is shown in Figure 5.7. The ratio was 0 both for P1 and for P2 before error correction, and some error correction methods improved the number a lot. For P1, over 50 percent of trimmed reads from PBcR and Proovread could be aligned to the reference sequence without any differences. Proovread also showed a good result for P2. However, PBcR generated much worse results for P2 than for P1. The ratio of the LSC trimmed reads for P1 was 0.3 percent and no untrimmed LSC read could be aligned to the reference sequence with no difference. Among untrimmed corrected reads, the quality of the reads from Proovread was the best, and 4.3 percent and 14.5 percent of the reads could be aligned without mismatches or indels for P1 and P2, respectively.

Memory Usage and Runtime of PacBio Error Correction Tools

Memory usage of the PacBio error correction methods is summarized in Figure 5.8A. LoRDEC was the most memory efficient method and it could correct all the reads with under 1 GB of memory. Memory usage of LSC was sensitive to Illumina read coverage, and correcting P1-40X required two times larger memory than that for correcting P1-20X. PBcR corrected errors with relatively small memory for P1, but memory usage increased by four times from P1 to P2. Memory usage of Proovread was constant for all the inputs. This was because Proovread splits PacBio reads into chunks with the small size (20 MB in the experiments). Runtime of the tools are shown in Figure 5.8B. LoRDEC was much faster than the others and the difference became larger as the size of genome and Illumina read coverage increased.

Runtime of LSC was not that long for P1 but it could not finish error correction for P2 even after 40 times longer duration was allowed compared to the runtime for P1. Runtime of PBcR was sensitive both to genome length and Illumina read coverage. Proovread was the slowest among the assessed tools for P1 but it was less sensitive to genome size than PBcR and it became the second fastest for P2.