Monday 23 May 2011

Next Generation Sequencing

The problem is combining variant calls from different species to a reference genome in order to find the variants between the two species.

First you need to edit the filtered vcf files to remove the sample name - as the program is intended to find differences between samples of the same species.


vi varX.flt.vcf
vi varY.flt.vcf


Next you need to use bgzip and tabix from Heng Li to get a compressed and indexed datafile.


bgzip varX.flt.vcf
bgzip varY.flt.vcf
tabix -p vcf varX.flt.vcf.gz
tabix -p vcf varY.flt.vcf.gz


Next you can use the vcftools function vcf-isec to find the complements of the two datasets. These will be the variants that are unique to the different species.


vcf-isec -c varX.flt.vcf.gz varY.flt.vcf.gz | bgzip -c > unique_varX.vcf.gz
vcf-isec -c varY.flt.vcf.gz varX.flt.vcf.gz | bgzip -c > unique_varY.vcf.gz


You can also create a Venn diagram of the overlap of variants between the different species.


vcf-compare var0.flt.vcf.gz var15.flt.vcf.gz > venn.out


And also look at the overlap in variants


vcf-isec -o -n +2 var15.flt.vcf.gz var0.flt.vcf.gz | bgzip -c > overlap_var15.vcf.gz

No comments:

Post a Comment