Monday, 23 May 2011

Next Generation Sequencing

The problem is combining variant calls from different species to a reference genome in order to find the variants between the two species.

First you need to edit the filtered vcf files to remove the sample name - as the program is intended to find differences between samples of the same species.

vi varX.flt.vcf
vi varY.flt.vcf

Next you need to use bgzip and tabix from Heng Li to get a compressed and indexed datafile.

bgzip varX.flt.vcf
bgzip varY.flt.vcf
tabix -p vcf varX.flt.vcf.gz
tabix -p vcf varY.flt.vcf.gz

Next you can use the vcftools function vcf-isec to find the complements of the two datasets. These will be the variants that are unique to the different species.

vcf-isec -c varX.flt.vcf.gz varY.flt.vcf.gz | bgzip -c > unique_varX.vcf.gz
vcf-isec -c varY.flt.vcf.gz varX.flt.vcf.gz | bgzip -c > unique_varY.vcf.gz

You can also create a Venn diagram of the overlap of variants between the different species.

vcf-compare var0.flt.vcf.gz var15.flt.vcf.gz > venn.out

And also look at the overlap in variants

vcf-isec -o -n +2 var15.flt.vcf.gz var0.flt.vcf.gz | bgzip -c > overlap_var15.vcf.gz

No comments:

Post a Comment