首页 > 代码库 > vcffilter 工具bug以及解决办法

vcffilter 工具bug以及解决办法

1,使用说明:

usage: vcffilter [options] <vcf file>

options:    

-f, --info-filter     specifies a filter to apply to the info fields of records, removes alleles which do not pass the filter    

-g, --genotype-filter specifies a filter to apply to the genotype fields of records    

-s, --filter-sites    filter entire records, not just alleles    

-t, --tag-pass        tag vcf records as positively filtered with this tag, print all records    

-F, --tag-fail        tag vcf records as negatively filtered with this tag, print all records    

-A, --append-filter   append the existing filter tag, don‘t just replace it    

-a, --allele-tag      apply -t on a per-allele basis.  adds or sets the corresponding INFO field tag    

-v, --invert          inverts the filter, e.g. grep -v    

-o, --or              use logical OR instead of AND to combine filters    

-r, --region          specify a region on which to target the filtering, requires a BGZF compressed file which has been indexed with tabix.  any number of regions may be specified.

Filter the specified vcf file using the set of filters. Filters are specified in the form "<ID> <operator> <value>:

 -f "DP > 10"  # for info fields  -g "GT = 1|1" # for genotype fields  -f "CpG"  # for ‘flag‘ fields

Operators can be any of: =, !, <, >, |, &

Any number of filters may be specified.  They are combined via logical AND unless --or is specified on the command line.  Obtain logical negation through the use of parentheses, e.g. ! "( DP = 10 )"

For convenience, you can specify "QUAL" to refer to the quality of the site, even though it does not appear in the INFO fields.

2,举例:

以上是说明,我举个例子,假如你对C18.vcf文件筛选的要求是:质量要大于30,深度要大于10,类型为SNP,那么执行如下命令:

vcffilter -f "QUAL > 30 & DP > 10 & TYPE = snp" C18.vcf > C18.filtered.vcf

会得到经过筛选过的vcf文件。

3,BUG:

为了检验经过筛选的文件是否符合要求,自己写一个脚本来进行比较一下,

 1 f0 = open(C18.vcf, r)
 2             filtername = i.split(.)[0]+.filtered.vcf
 3             f1 = open(filtername, w)
 4             for i in f0:
 5                 if i.startswith(#):
 6                     f1.write(i)
 7                 else:
 8                     j = i.split()
 9                     if (float(j[5]) > 30 and j[7].split(=)[-1] == snp
10                         and int(j[9].split(:)[1]) > 10) :
11                         f1.write(i)

运行脚本后发现,和软件结果不同,脚本产生的文件比软件产生的结果要小。

经过比较发现,软件将变异类型为 complex,snp 转化为了snp,放到了结果文件中。这种类型我们暂时不需要考虑,所以软件是存在bug的!

4,总结。

筛选vcf很简单,只需要对每行进行遍历,将符合条件的行数写入新文件即可。

自己写几行代码就可以了。

by freemao

FAFU.

free_mao@qq.com