Monday, 23 May 2011

Annotating Next Generation Sequencing

To annotate a vcf file using vcftools you first need to create a file containing the annotations.

This should have the following format:

#CHR FROM TO Annotation
5 53719508 53936990 ENSGALG00000011590; gene_name=TMEM179; gene_type=KNOWN_protein_coding
5 54024011 54038102 ENSGALG00000011608; gene_name=INF2; gene_type=KNOWN_BY_PROJECTION_protein_coding
5 54073641 54096083 ENSGALG00000011618; gene_name=ADSSL1; gene_type=KNOWN_protein_coding

I got this file from editing a GFF file for the region of interest. This file then needs to be zipped with bgzip and indexed with tabix. You will need a different annotation file for each feature if you edit a gff as otherwise they will not appear in order in the file.

bgzip annotate.gff
tabix -p gff annotate.gff.gz

The the annotations can be added to the file.

cat input.vcf.gz | vcf-annotate -a annotate.gff.gz -d key=INFO,ID=ANN,Number=1,Type=Integer,Description='My custom annotation' -c CHROM,FROM,TO,INFO/ANN > out.vcf

Once the file is annotated you can use grep to pull out the lines with the required annotation.

Next Generation Sequencing

The problem is combining variant calls from different species to a reference genome in order to find the variants between the two species.

First you need to edit the filtered vcf files to remove the sample name - as the program is intended to find differences between samples of the same species.

vi varX.flt.vcf
vi varY.flt.vcf

Next you need to use bgzip and tabix from Heng Li to get a compressed and indexed datafile.

bgzip varX.flt.vcf
bgzip varY.flt.vcf
tabix -p vcf varX.flt.vcf.gz
tabix -p vcf varY.flt.vcf.gz

Next you can use the vcftools function vcf-isec to find the complements of the two datasets. These will be the variants that are unique to the different species.

vcf-isec -c varX.flt.vcf.gz varY.flt.vcf.gz | bgzip -c > unique_varX.vcf.gz
vcf-isec -c varY.flt.vcf.gz varX.flt.vcf.gz | bgzip -c > unique_varY.vcf.gz

You can also create a Venn diagram of the overlap of variants between the different species.

vcf-compare var0.flt.vcf.gz var15.flt.vcf.gz > venn.out

And also look at the overlap in variants

vcf-isec -o -n +2 var15.flt.vcf.gz var0.flt.vcf.gz | bgzip -c > overlap_var15.vcf.gz

Thursday, 5 May 2011

Permutating miRBase

You can also user seqinr to permutate sequences.

For the miRBase example the code is:


GC Content Analysis of miRBase

I want to create a dataset for supervised learning based on miRBase. So I need to know the statistical properties of miRBase. Having used seqinr to import miRBase into R, I need to carry out some analysis. GC content is a first place to start as many of the sequences I am familiar with have a high AT content and so an AT bias might be a factor affecting the learning process.

There are 19724 sequences in the database. I could have used length(mature) instead of hard coding the number. You also need to initialise out as a variable first.

> for (x in c(1:19724)) out <- c(out, GC(mature[[x]])) > hist (out)

So I am happy with this as it is normally distributed and so I do not need to take any special care in the GC/AT content for my training sets.

Sequence Analysis

There is an R package for biological sequence analysis called seqinr.

> install.packages("seqinr")
> library("seqinr")
> mature <- read.fasta(file="mature.fa") # This reads in the database of miRNA fragments> length(mature) # How many entries are there in the database
> mirna <-mature[[1]] # Load the first sequence into mirna> table(mirna)
a g t
5 8 9
> count(mirna, 1) # Counts the number of bases

a c g t
5 0 8 9

> count(mirna, 2) # Counts the number of two letter words

aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt 
0 0 4 1 0 0 0 0 1 0 2 5 4 0 2 2

You can keep going up to longer and longer words but beyond 6 is going to be a waste of computing time as it will be very sparsely populated.

Wednesday, 27 April 2011

Why I love working behind a Websense Firewall

There is nothing more useful than a Websense Firewall. I especially like the way it continuously helps you to do your job, the way it gives you the best possible support. The way it ... Here is todays brilliantly blocked site.

So I cannot access Mendeley for a paper I would have liked to have read. Very clever and a big waste of my time having to find it another way.

Thursday, 24 February 2011

Plan to automate miRbase

Need to check for a new release of miRbase and then download it

It is located at:

Need to check the data-stamp on the directory that CURRENT points to and then download it if it is newer than the date-stamp of the last update on the local machine. On average 6 monthly.

Need to get the mature.fa.gz file from this directory

Then this needs to be moved to the local /usr/share/ncbi/data directory
Then it needs to be gunziped
Then it needs to be converted to a blast database with the formatdb command.

Grep gga from mature.fa > gga_mature.fa selects out those from Gallus gallus.

To use
megablast -d /usr/share/ncbi/data/mature.fa -i sequences.fa -o sequences.out -W 7 -D 2 -p 100 [-e 1000]

Monday, 31 January 2011

Useful tools for linux

ps -eo pid,pmem,cmd
Shows the memory usage of the processes that are currently running.
Updates the locate database.
locate String
Very fast find routine.

Thursday, 27 January 2011

Possibly Useful - code verification for modules

Monday, 17 January 2011


If the tar contains the binaries

groupadd mysql
useradd -r -g mysql mysql
cd /usr/local
ln -s /usr/local/src/msyql-VERSION mysql
cd mysql
chown -R mysql .
chgro -R mysql .
scripts/mysql_install.db --user=mysql
chown -R root .
chown -R mysql data
bin/mysqld_safe --user=mysql &
cp /usr/local/mysql/support-files/mysql.server /etc/init.d/mysql.server


./configure --prefix=/usr/local/pgsql
gmake install
adduser postgres
mkdir /usr/local/pgql/data
su - postgres
/usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data
/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data -l logfile start


/usr/local/pgsql/bin/createdb test
/usr/local/pgsql/bin/psql test

Preparing for Drupal

Making sure Apache and MySQL run on start-up

chkconfig mysqld on

look at the rc.local file for apachectl

disable SE Linux

Copy drupal files into /var/www/html

cp default/default.settings.php default/settings.php
mkdir default/files
chmod a+w default/settings.php
chmod -R a+w default/files

Configuring Apache/PHP

./configure --prefix=/usr/local/apache --enable-so
make install
vi /usr/local/apache/conf/httpd.conf
Add the places from the webpages - usually this will not actually be in the same place as you are running apache and it is most commonly /var/www/html

/usr/local/apache/bin/apachectl -k start


copy the config.nice file from the working version of the server to the new version directory. Use a new prefix and a different port to test the configuration - change the Listen directive to specify the new port.
make install
usr/local/apache/bin/apachectl -k graceful-stop
usr/local/apache/bin/apachectl -k start

add a call to apachectl to the rc.local file so that it autostarts.


./configure --prefix=/usr/local/php \
--with-apxs2=/usr/local/apache/bin/apxs \
--with-config-file-path=/usr/local/php \
--with-gd \
--with-mysql=/usr/local/mysql/ \
--with-mysqli=mysqlnd \
--with-pdo-mysql=mysqlnd \

/usr/local/apache/bin/apachectl stop
make install

Edit httpd.conf to add
LoadModule php5_module modules/
AddType application/x-httpd-php .php
<Files *.php>
SetOutputFilter PHP
SetInputFilter PHP

Edit php.ini for pgsql and mysql lines removes the ; but not on the dll lines.

Sunday, 16 January 2011

Setting up MySQL

When you add MySQL to your Linux (in my case Fedora) system it has a root account but no password! This is not a very good idea and so the first thing that you need to do is create a root account and password. You do this using the mysqladmin tool as root.

# mysqladmin -u root password new_password

Where new_password is the new password (not just the word new_password).

You can now login to the MySQL using the root username and password and create a new account. 

# mysql -u root -p
mysql> CREATE USER 'drupal'@'localhost' IDENTIFIED BY 'password';

You need to the exit (use quit) and use mysqladmin again to create the database.
# mysqladmin -u root -p create drupalbase
# mysql -u root -p