Data Deluge: 2011

Monday, 23 May 2011

Annotating Next Generation Sequencing

To annotate a vcf file using vcftools you first need to create a file containing the annotations.

This should have the following format:


#CHR FROM TO Annotation
5 53719508 53936990 ENSGALG00000011590; gene_name=TMEM179; gene_type=KNOWN_protein_coding
5 54024011 54038102 ENSGALG00000011608; gene_name=INF2; gene_type=KNOWN_BY_PROJECTION_protein_coding
5 54073641 54096083 ENSGALG00000011618; gene_name=ADSSL1; gene_type=KNOWN_protein_coding

I got this file from editing a GFF file for the region of interest. This file then needs to be zipped with bgzip and indexed with tabix. You will need a different annotation file for each feature if you edit a gff as otherwise they will not appear in order in the file.


bgzip annotate.gff 
tabix -p gff annotate.gff.gz

The the annotations can be added to the file.


cat input.vcf.gz | vcf-annotate -a annotate.gff.gz -d key=INFO,ID=ANN,Number=1,Type=Integer,Description='My custom annotation' -c CHROM,FROM,TO,INFO/ANN > out.vcf

Once the file is annotated you can use grep to pull out the lines with the required annotation.

Next Generation Sequencing

The problem is combining variant calls from different species to a reference genome in order to find the variants between the two species.

First you need to edit the filtered vcf files to remove the sample name - as the program is intended to find differences between samples of the same species.


vi varX.flt.vcf 
vi varY.flt.vcf

Next you need to use bgzip and tabix from Heng Li to get a compressed and indexed datafile.

 
bgzip varX.flt.vcf 
bgzip varY.flt.vcf 
tabix -p vcf varX.flt.vcf.gz 
tabix -p vcf varY.flt.vcf.gz

Next you can use the vcftools function vcf-isec to find the complements of the two datasets. These will be the variants that are unique to the different species.


vcf-isec -c varX.flt.vcf.gz varY.flt.vcf.gz | bgzip -c > unique_varX.vcf.gz
vcf-isec -c varY.flt.vcf.gz varX.flt.vcf.gz | bgzip -c > unique_varY.vcf.gz

You can also create a Venn diagram of the overlap of variants between the different species.


vcf-compare var0.flt.vcf.gz var15.flt.vcf.gz > venn.out

And also look at the overlap in variants


vcf-isec -o -n +2 var15.flt.vcf.gz var0.flt.vcf.gz | bgzip -c > overlap_var15.vcf.gz

Thursday, 5 May 2011

Permutating miRBase

You can also user seqinr to permutate sequences.

For the miRBase example the code is:


permutation(mature[[1]],modele='base',frame=0)

GC Content Analysis of miRBase

I want to create a dataset for supervised learning based on miRBase. So I need to know the statistical properties of miRBase. Having used seqinr to import miRBase into R, I need to carry out some analysis. GC content is a first place to start as many of the sequences I am familiar with have a high AT content and so an AT bias might be a factor affecting the learning process.

There are 19724 sequences in the database. I could have used length(mature) instead of hard coding the number. You also need to initialise out as a variable first.


> for (x in c(1:19724)) out <- c(out, GC(mature[[x]])) > hist (out)

So I am happy with this as it is normally distributed and so I do not need to take any special care in the GC/AT content for my training sets.

Sequence Analysis

There is an R package for biological sequence analysis called seqinr.


> install.packages("seqinr")
> library("seqinr")
> mature <- read.fasta(file="mature.fa") # This reads in the database of miRNA fragments> length(mature) # How many entries are there in the database
> mirna <-mature[[1]] # Load the first sequence into mirna> table(mirna)
mirna
a g t 
5 8 9 
> count(mirna, 1) # Counts the number of bases

a c g t 
5 0 8 9 

> count(mirna, 2) # Counts the number of two letter words

aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt 
 0  0  4  1  0  0  0  0  1  0  2  5  4  0  2  2

You can keep going up to longer and longer words but beyond 6 is going to be a waste of computing time as it will be very sparsely populated.

Wednesday, 27 April 2011

Why I love working behind a Websense Firewall

There is nothing more useful than a Websense Firewall. I especially like the way it continuously helps you to do your job, the way it gives you the best possible support. The way it ... Here is todays brilliantly blocked site.

So I cannot access Mendeley for a paper I would have liked to have read. Very clever and a big waste of my time having to find it another way.

Thursday, 24 February 2011

Plan to automate miRbase

Need to check for a new release of miRbase and then download it

It is located at:
ftp://mirbase.org/pub/mirbase/

Need to check the data-stamp on the directory that CURRENT points to and then download it if it is newer than the date-stamp of the last update on the local machine. On average 6 monthly.

Need to get the mature.fa.gz file from this directory

Then this needs to be moved to the local /usr/share/ncbi/data directory
Then it needs to be gunziped
Then it needs to be converted to a blast database with the formatdb command.

Grep gga from mature.fa > gga_mature.fa selects out those from Gallus gallus.

To use
megablast -d /usr/share/ncbi/data/mature.fa -i sequences.fa -o sequences.out -W 7 -D 2 -p 100 [-e 1000]

Monday, 31 January 2011

Useful tools for linux

ps -eo pid,pmem,cmd: Shows the memory usage of the processes that are currently running.
updatedb: Updates the locate database.
locate String: Very fast find routine.

Thursday, 27 January 2011

Possibly Useful

http://osticket.com/index.php
http://drupal.org/project/support
http://drupal.org/documentation/modules/statistics
http://drupal.org/project/coder - code verification for modules

Monday, 17 January 2011

MySQL

If the tar contains the binaries



groupadd mysql

useradd -r -g mysql mysql

cd /usr/local

ln -s /usr/local/src/msyql-VERSION mysql

cd mysql

chown -R mysql .

chgro -R mysql .

scripts/mysql_install.db --user=mysql

chown -R root .

chown -R mysql data

bin/mysqld_safe --user=mysql &

.bin/mysql_secure_installation

cp /usr/local/mysql/support-files/mysql.server /etc/init.d/mysql.server

Postgresql



./configure --prefix=/usr/local/pgsql

gmake

gmake install

adduser postgres

mkdir /usr/local/pgql/data

su - postgres

/usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data

/usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data -l logfile start

Testing



/usr/local/pgsql/bin/createdb test

/usr/local/pgsql/bin/psql test

Preparing for Drupal

Making sure Apache and MySQL run on start-up



chkconfig mysqld on

look at the rc.local file for apachectl

disable SE Linux

Copy drupal files into /var/www/html



cp default/default.settings.php default/settings.php

mkdir default/files

chmod a+w default/settings.php

chmod -R a+w default/files

Configuring Apache/PHP

./configure --prefix=/usr/local/apache --enable-so
make
make install
vi /usr/local/apache/conf/httpd.conf
Add the places from the webpages - usually this will not actually be in the same place as you are running apache and it is most commonly /var/www/html

/usr/local/apache/bin/apachectl -k start

Upgrading

copy the config.nice file from the working version of the server to the new version directory. Use a new prefix and a different port to test the configuration - change the Listen directive to specify the new port.

./config.nice

make

make install

usr/local/apache/bin/apachectl -k graceful-stop

usr/local/apache/bin/apachectl -k start

add a call to apachectl to the rc.local file so that it autostarts.

PHP



./configure --prefix=/usr/local/php \

--with-apxs2=/usr/local/apache/bin/apxs \

--with-config-file-path=/usr/local/php \

--with-gd \

--with-mysql=/usr/local/mysql/ \

--with-mysqli=mysqlnd \

--with-pdo-mysql=mysqlnd \

--with-pgsql=/usr/local/pgsql



make

/usr/local/apache/bin/apachectl stop

make install



Edit httpd.conf to add

LoadModule php5_module modules/libphp5.so

AddType application/x-httpd-php .php

<Files *.php>

SetOutputFilter PHP

SetInputFilter PHP

</Files>

Edit php.ini for pgsql and mysql lines removes the ; but not on the dll lines.

Sunday, 16 January 2011

Setting up MySQL

When you add MySQL to your Linux (in my case Fedora) system it has a root account but no password! This is not a very good idea and so the first thing that you need to do is create a root account and password. You do this using the mysqladmin tool as root.

# mysqladmin -u root password new_password

Where new_password is the new password (not just the word new_password).

You can now login to the MySQL using the root username and password and create a new account.

# mysql -u root -p
...
mysql> CREATE USER 'drupal'@'localhost' IDENTIFIED BY 'password';

You need to the exit (use quit) and use mysqladmin again to create the database.
# mysqladmin -u root -p create drupalbase
# mysql -u root -p
...
mysql> GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, INDEX, ALTER, LOCK TABLES, CREATE TEMPORARY TABLES ON drupalbase.* TO 'drupal'@'localhost' IDENTIFIED BY 'password';