首页 > 代码库 > 纠错工具之 - proovread

纠错工具之 - proovread

主要是来解读 proovread 发表的文章,搞清楚它内在的原理。

原文:proovread: large-scale high-accuracy PacBio correction through iterative short read consensus

摘要

动机:目前边合成边测序的二代技术占主导,虽然准,但太短,导致分析困难。近期,SMRT可以解决这个问题,它生产超长的reads。但是高错误率阻碍了SMRT的应用,因此,混合利用SR和LR的方法已经开发出来了,但是目前的实现方法都太依赖硬件,不好。这限制了它的应用。

结果:我们开发了一个混合纠错流程,能灵活地运行与普通台式机和大型集群,在基因组和转录组的测试中,准确度高达99.9%,胜过现有的所有混合纠错软件,而且更长量多。

引言

过去十年,二代改写了测序的历史,Today, a single run of a HiSeq2500 can generate as much as 600Gb high-quality output data, which covers a human genome 200. 但是,太短,不好组装,尤其是重复区域。因此,大量的SR组装软件出现了,Allpath-LG (Gnerre et al., 2011), the Celera Assembler (Miller et al., 2008; Myers et al., 2000) and SOAPdenovo (Li et al., 2010).

比SR长的重复不能被解决,目前的好的组装方案是,联合short reads和long insert libraries和额外的fosmid测序。

但是,SMRT出现了,With the latest chemistry, this approach delivers reads44 kb. 而且无偏向性,Their third-generation sequencer, PacBio RS II, generates to date up to 400Mb per sequencing run.

LR 的准确度太低,二代99%,而三代只有80%-85%,而且错误分布模型也不同,Although Illumina reads mainly contain miscalled bases with increasing frequency toward read ends, SMRT generates primarily insertions (10%) and deletions (5%) in a random pattern (Ross et al., 2013).  SMRT可以CCS,但这同时也减少了reads的长度,从而失去了三代的优势。

目前有两种方法用于SMRT的校正:

(i) The hierarchical genome-assembly process (HGAP) uses shorter SMRT reads contained within longer reads to generate pre-assemblies and to calculate consensus sequences (Chin et al., 2013). (缺陷:coverage of 80 to 100)

(ii) PacBioToCA (Koren et al., 2012) and LSC (Au et al., 2012) use Illumina SRs in a hybrid approach to correct SMRT reads. These approaches result in higher quality LRs.(需要大量计算资源,PacBioToCA lost >40%数据,LCS只能转录组,WGS集成,不好调用)

本方法优点:

(i) run on standard computers as well as computer grids and

(ii) can be easily adapted to different use cases.

Obviously, these objectives should not be at the cost of accuracy, length of corrected reads or throughput.

实现

纠错工具之 - proovread