Data Deluge: Annotating Next Generation Sequencing

Monday, 23 May 2011

Annotating Next Generation Sequencing

To annotate a vcf file using vcftools you first need to create a file containing the annotations.

This should have the following format:


#CHR FROM TO Annotation
5 53719508 53936990 ENSGALG00000011590; gene_name=TMEM179; gene_type=KNOWN_protein_coding
5 54024011 54038102 ENSGALG00000011608; gene_name=INF2; gene_type=KNOWN_BY_PROJECTION_protein_coding
5 54073641 54096083 ENSGALG00000011618; gene_name=ADSSL1; gene_type=KNOWN_protein_coding

I got this file from editing a GFF file for the region of interest. This file then needs to be zipped with bgzip and indexed with tabix. You will need a different annotation file for each feature if you edit a gff as otherwise they will not appear in order in the file.


bgzip annotate.gff 
tabix -p gff annotate.gff.gz

The the annotations can be added to the file.


cat input.vcf.gz | vcf-annotate -a annotate.gff.gz -d key=INFO,ID=ANN,Number=1,Type=Integer,Description='My custom annotation' -c CHROM,FROM,TO,INFO/ANN > out.vcf

Once the file is annotated you can use grep to pull out the lines with the required annotation.

2 comments:

Jeremy Leipzig13 June 2011 at 12:52
good tip. thanks!
ReplyDelete
Replies
Unknown11 July 2012 at 05:49
Hi Andy,

So I am trying to annotate my vcf file with the gff annotation file with my gene models, but i keep getting the following error:

Broken VCF header, no column names?

I am not sure how to fix this, since this is the vcf output that I got from samtools & bcftools. Have you got any suggestions?
ReplyDelete
Replies

Add comment