VCF格式詳解

cvf是用於描述snp，indel和sv結果的文字檔案。在gatk軟體中得到最好的支援，當然samtools得到的結果也是cvf格式，和gatk的cvf格式有點差別。

先給出乙個vcf檔案的範例：

##fileformat=vcfv4.0
#chrom pos id ref alt qual filter info format na12878
chr1 873762 . t g 5231.78 pass ac=1;af=0.50;an=2;dp=315;dels=0.00;hrun=2;haplotypescore=15.11;mq=91.05;mq0=15;qd=16.61;sb=-1533.02;vqslod=-1.5473 gt:ad:dp:gq:pl 0/1:173,141:282:99:255,0,255
chr1 877664 rs3828047 a g 3931.66 pass ac=2;af=1.00;an=2;db;dp=105;dels=0.00;hrun=1;haplotypescore=1.59;mq=92.52;mq0=4;qd=37.44;sb=-1152.13;vqslod= 0.1185 gt:ad:dp:gq:pl 1/1:0,105:94:99:255,255,0
chr1 899282 rs28548431 c t 71.77 pass ac=1;af=0.50;an=2;db;dp=4;dels=0.00;hrun=0;haplotypescore=0.00;mq=99.00;mq0=0;qd=17.94;sb=-46.55;vqslod=-1.9148 gt:ad:dp:gq:pl 0/1:1,3:4:25.92:103,0,26
chr1 974165 rs9442391 t c 29.84 lowqual ac=1;af=0.50;an=2;db;dp=18;dels=0.00;hrun=1;haplotypescore=0.16;mq=95.26;mq0=0;qd=1.66;sb=-0.98 gt:ad:dp:gq:pl 0/1:14,4:14:60.91:61,0,255

從範例上看，vcf檔案分為兩部分內容：以「#」開頭的注釋部分；沒有「#」開頭的主體部分。

值得注意的是，注釋部分有很多對vcf的介紹資訊。實際上不需要本文章，只是看看這個注釋部分就完全明白了vcf各行各列代表的意義。我們先講vcf檔案主題部分的結構，如下所示：

[header lines]

#chrom pos id ref alt qual filter info format na12878

chr1 873762 . t g 5231.78 pass [annotations] gt:ad:dp:gq:pl 0/1:173,141:282:99:255,0,255

chr1 877664 rs3828047 a g 3931.66 pass [annotations] gt:ad:dp:gq:pl 1/1:0,105:94:99:255,255,0

chr1 899282 rs28548431 c t 71.77 pass [annotations] gt:ad:dp:gq:pl 0/1:1,3:4:25.92:103,0,26

chr1 974165 rs9442391 t c 29.84 lowqual [annotations] gt:ad:dp:gq:pl 0/1:14,4:14:60.91:61,0,255

以上去掉了頭部的注釋行，只留下了代表每一行意義的注釋行。主體部分中每一行代表乙個variant的資訊。

chrom 和 pos：代表參考序列名和variant的位置；如果是indel的話，位置是indel的第乙個鹼基位置。

id：variant的id。比如在dbsnp中有該snp的id，則會在此行給出；若沒有，則用』.'表示其為乙個novel variant。

ref 和 alt：參考序列的鹼基和 variant的鹼基。

qual：phred格式(phred_scaled)的質量值，表示在該位點存在variant的可能性；該值越高，則variant的可能性越大；計算方法：phred值 = -10 * log (1-p) p為variant存在的概率; 通過計算公式可以看出值為10的表示錯誤概率為0.1，該位點為variant的概率為90%。

filter：使用上乙個qual值來進行過濾的話，是不夠的。gatk能使用其它的方法來進行過濾，過濾結果中通過則該值為」pass」;若variant不可靠，則該項不為」pass」或」.」。

info：這一行是variant的詳細資訊，內容很多，以下再具體詳述。

format 和 na12878：這兩行合起來提供了』na12878′這個sample的基因型的資訊。』na12878′代表這該名稱的樣品，是由bam檔案中的@rg下的 sm 標籤決定的。

chr1 873762 . t g [clipped] gt:ad:dp:gq:pl 0/1:173,141:282:99:255,0,255

chr1 877664 rs3828047 a g [clipped] gt:ad:dp:gq:pl 1/1:0,105:94:99:255,255,0

chr1 899282 rs28548431 c t [clipped] gt:ad:dp:gq:pl 0/1:1,3:4:25.92:103,0,26

看上面最後兩列資料，這兩列資料是對應的，前者為格式，後者為格式對應的資料。

gt：樣品的基因型（genotype）。兩個數字中間用』/'分開，這兩個數字表示雙倍體的sample的基因型。0 表示樣品中有ref的allele； 1 表示樣品中variant的allele； 2表示有第二個variant的allele。因此： 0/0 表示sample中該位點為純合的，和ref一致； 0/1 表示sample中該位點為雜合的，有ref和variant兩個基因型； 1/1 表示sample中該位點為純合的，和variant一致。

ad 和 dp：ad(allele depth)為sample中每一種allele的reads覆蓋度,在diploid中則是用逗號分割的兩個值，前者對應ref基因型，後者對應variant基因型； dp（depth）為sample中該位點的覆蓋度。

gq：基因型的質量值(genotype quality)。phred格式(phred_scaled)的質量值，表示在該位點該基因型存在的可能性；該值越高，則genotype的可能性越大；計算方法：phred值 = -10 * log (1-p) p為基因型存在的概率。

pl：指定的三種基因型的質量值(provieds the likelihoods of the given genotypes)。這三種指定的基因型為(0/0,0/1,1/1)，這三種基因型的概率總和為1。和之前不一致，該值越大，表明為該種基因型的可能性越小。 phred值 = -10 * log (p) p為基因型存在的概率。

該列資訊最多了，都是以「tag=value」,並使用」;」分隔的形式。其中很多的注釋資訊在vcf檔案的頭部注釋中給出。以下是這些tag的解釋

ac，af 和 an：ac(allele count) 表示該allele的數目；af(allele frequency) 表示allele的頻率； an(allele number) 表示allele的總數目。對於1個diploid sample而言：則基因型 0/1 表示sample為雜合子，allele數為1(雙倍體的sample在該位點只有1個等位基因發生了突變)，allele的頻率為0.5(雙倍體的sample在該位點只有50%的等位基因發生了突變)，總的allele為2；基因型 1/1 則表示sample為純合的，allele數為2，allele的頻率為1，總的allele為2。

dp：reads覆蓋度。是一些reads被過濾掉後的覆蓋度。

dels：fraction of reads containing spanning deletions。進行snp和indel calling的結果中，有該tag並且值為0表示該位點為snp，沒有則為indel。

fs：使用fisher』s精確檢驗來檢測strand bias而得到的fhred格式的p值。該值越小越好。一般進行filter的時候，可以設定 fs < 10～20。

haplotypescore：consistency of the site with at most two segregating haplotypes

inbreedingcoeff：inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the hard-weinberg expectation

mleac：maximum likelihood expectation (mle) for the allele counts (not necessarily the same as the ac), for each alt allele, in the same order as listed

mleaf：maximum likelihood expectation (mle) for the allele frequency (not necessarily the same as the af), for each alt alle in the same order as listed

qd：variant confidence/quality by depth

rpa：number of times tandem repeat unit is repeated, for each allele (including reference)

ru：tandem repeat unit (bases)

readposranksum：z-score from wilcoxon rank sum test of alt vs. ref read position bias

str：variant is a short tandem repeat

VCF格式詳解

Vcf檔案格式

vcf檔案格式詳細解釋

通訊錄自動匯入 txt格式轉vcf格式

VCF格式詳解

Vcf檔案格式

vcf檔案格式詳細解釋

通訊錄自動匯入 txt格式轉vcf格式

相關推薦