首页 > 代码库 > 如何用GATK call snp

如何用GATK call snp

1, 什么是GATK?

The Genome Analysis Toolkit or GATK is a software package developed at the Broad Institute to analyse next-generation resequencing data.

The toolkit offers a wide variety of tools, with a primary focus on variant discovery and genotyping as well as strong emphasis on data quality assurance.

Its robust architecture, powerful processing engine and high-performance computing features make it capable of taking on projects of any size.

2, 如何用GATK call SNP?

用来call snp的数据为经过处理过的bam文件。如何处理另见博文。用到的工具为HaplotypeCaller。假如我有四个bam文件,

LC17-1_L005.sorted.rmp.rg.recal.bam,

LC17-2_L008.sorted.rmp.rg.recal.bam,

RC17-1_L003.sorted.rmp.rg.recal.bam,

RC17-3_L004.sorted.rmp.rg.recal.bam,

都是经过处理,符合GATK要求的bam文件,这四个文件都属于样本C17,我现在要用对样本C17 call snp, 具体命令如下:

 java -jar ./GenomeAnalysisTK.jar -nct 50 -T HaplotypeCaller -R RAP_cDAN.fasta  \

-I LC17-1_L002.sorted.rmp.rg.recal.bam -I LC17-1_L005.sorted.rmp.rg.recal.bam \

-I LC17-2_L006.sorted.rmp.rg.recal.bam -I LC17-2_L008.sorted.rmp.rg.recal.bam \

-I LC17-3_L002.sorted.rmp.rg.recal.bam -I RC17-1_L003.sorted.rmp.rg.recal.bam \

-I RC17-2_L004.sorted.rmp.rg.recal.bam -I RC17-3_L004.sorted.rmp.rg.recal.bam \

-o gatk.vcf

以上几行命令要在同一行,所以看到每行最后有换行符,工具选用的是GATK中的HaplotypeCaller,

-R后跟参考序列,-I 后是bam文件,这几个bam文件都属于一个sample, -o后跟输出文件名字。

-nct 是指定线程数,目前并不能多线程,只能用一个cpu。

结果文件就为gatk.vcf。