Data Deluge: May 2011

Monday, 23 May 2011

Annotating Next Generation Sequencing

To annotate a vcf file using vcftools you first need to create a file containing the annotations.

This should have the following format:


#CHR FROM TO Annotation
5 53719508 53936990 ENSGALG00000011590; gene_name=TMEM179; gene_type=KNOWN_protein_coding
5 54024011 54038102 ENSGALG00000011608; gene_name=INF2; gene_type=KNOWN_BY_PROJECTION_protein_coding
5 54073641 54096083 ENSGALG00000011618; gene_name=ADSSL1; gene_type=KNOWN_protein_coding

I got this file from editing a GFF file for the region of interest. This file then needs to be zipped with bgzip and indexed with tabix. You will need a different annotation file for each feature if you edit a gff as otherwise they will not appear in order in the file.


bgzip annotate.gff 
tabix -p gff annotate.gff.gz

The the annotations can be added to the file.


cat input.vcf.gz | vcf-annotate -a annotate.gff.gz -d key=INFO,ID=ANN,Number=1,Type=Integer,Description='My custom annotation' -c CHROM,FROM,TO,INFO/ANN > out.vcf

Once the file is annotated you can use grep to pull out the lines with the required annotation.

Next Generation Sequencing

The problem is combining variant calls from different species to a reference genome in order to find the variants between the two species.

First you need to edit the filtered vcf files to remove the sample name - as the program is intended to find differences between samples of the same species.


vi varX.flt.vcf 
vi varY.flt.vcf

Next you need to use bgzip and tabix from Heng Li to get a compressed and indexed datafile.

 
bgzip varX.flt.vcf 
bgzip varY.flt.vcf 
tabix -p vcf varX.flt.vcf.gz 
tabix -p vcf varY.flt.vcf.gz

Next you can use the vcftools function vcf-isec to find the complements of the two datasets. These will be the variants that are unique to the different species.


vcf-isec -c varX.flt.vcf.gz varY.flt.vcf.gz | bgzip -c > unique_varX.vcf.gz
vcf-isec -c varY.flt.vcf.gz varX.flt.vcf.gz | bgzip -c > unique_varY.vcf.gz

You can also create a Venn diagram of the overlap of variants between the different species.


vcf-compare var0.flt.vcf.gz var15.flt.vcf.gz > venn.out

And also look at the overlap in variants


vcf-isec -o -n +2 var15.flt.vcf.gz var0.flt.vcf.gz | bgzip -c > overlap_var15.vcf.gz

Thursday, 5 May 2011

Permutating miRBase

You can also user seqinr to permutate sequences.

For the miRBase example the code is:


permutation(mature[[1]],modele='base',frame=0)

GC Content Analysis of miRBase

I want to create a dataset for supervised learning based on miRBase. So I need to know the statistical properties of miRBase. Having used seqinr to import miRBase into R, I need to carry out some analysis. GC content is a first place to start as many of the sequences I am familiar with have a high AT content and so an AT bias might be a factor affecting the learning process.

There are 19724 sequences in the database. I could have used length(mature) instead of hard coding the number. You also need to initialise out as a variable first.


> for (x in c(1:19724)) out <- c(out, GC(mature[[x]])) > hist (out)

So I am happy with this as it is normally distributed and so I do not need to take any special care in the GC/AT content for my training sets.

Sequence Analysis

There is an R package for biological sequence analysis called seqinr.


> install.packages("seqinr")
> library("seqinr")
> mature <- read.fasta(file="mature.fa") # This reads in the database of miRNA fragments> length(mature) # How many entries are there in the database
> mirna <-mature[[1]] # Load the first sequence into mirna> table(mirna)
mirna
a g t 
5 8 9 
> count(mirna, 1) # Counts the number of bases

a c g t 
5 0 8 9 

> count(mirna, 2) # Counts the number of two letter words

aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt 
 0  0  4  1  0  0  0  0  1  0  2  5  4  0  2  2

You can keep going up to longer and longer words but beyond 6 is going to be a waste of computing time as it will be very sparsely populated.