Friday 13 September 2013

Final Stages of Using Qiime for Ion Torrent Data

Once you have the rarefaction curves for within the sample you can now calculate them for between the bar-coded samples. This will give a measure of distance between the samples that can then be used for clustering.

beta_diversity_through_plots.py -i otus/otu_table.biom -m mapping.txt -o wf_bdiv_even4/ -t otus/rep_set.tre -e 4

Where the last term is a cutoff for the minimum number of reads in a sample (must be even) - this is the sampling depth. You need to try a series of cut-offs to find out which works best, but keep it less than the median. If you choose the median then values below the median are ignored (you will lose half of you data but if you are expecting a large number of blank results this might be OK) so only do this if you have lots of skew and the mean is much larger than the median.

This will give a series of 2D and 3D Principle Components plots that can be used to look for any clustering of the samples. Clusters should be visually identifiable and well defined.

Qiime can also carry out a jack-knife analysis to test the clustering dependence on omitting some of the samples. This produces a phylogenetic tree where like samples are on closely related nodes. Clusters are represented by clades (a group of leaves with a common root).

jackknifed_beta_diversity.py -i otus/otu_table.biom -t otus/rep_set.tre -m mapping.txt -o wf_jack50 -e 50

Finally you can make Biplots to show the most important axes between the clusters.

make_3d_plots.py -i wf_bdiv_even4/unweighted_unifrac_pc.txt -m mapping.txt -t wf_taxa_summary/otu_table_L3.txt --n_taxa_keep 5 -o 3d_biplot

The 5 tells the program how many top level taxa to display for comparison to the clusters.

Thursday 12 September 2013

Using Qiime for Microbiome Analysis of Ion Torrent Data

First read my last post about getting Fasta and quality files from the Fastq files. If there are multiple samples in the run then there will be a single file containing all of the data with the different bar-codes attached.

The process to follow is the same as for 454 data.

You sample should be made up of read containing:

  1. An adaptor (A) sequence
  2. A barcode sequence
  3. A linker primer sequence 
  4. The actual target sequence
  5. A reverse primer sequence
  6. An adaptor (B) sequence
First you need to create a mapping file telling Qiime about the barcodes, linkers adaptors and any experimental conditions you might want to add.

  • #SampleID BarcodeSequence LinkerPrimerSequence Treatment ReversePrimer Description
  • 1 CTAAGGTAAC CCTACGGGAGGCAGCAG control ATTACCGCGGCTGCTGG oral swab

This file is easiest to create in a spreadsheet and then save it as a tab delimited text file. You should then check this file to make sure that it is in the right format.

check_id_map.py -m Mapping.txt -o mapping_output

If you are doing a properly blinded experiment then you should not know what the treatments are and so this should be just given a number or a simple code.

The first stage of the Qiime pipeline is splitting the sample by barcodes removing any low quality or ambiguous reads. 

split_libraries.py -M3 -l12 -L250 -b variable_length -m Mapping.txt -f Example.fna -q Example.qual -o split_library_output

The command line arguments allow a mismatch in the primer of 3 a minimum sequence length of 12 and maximum of 250 and allow for variable barcode lengths. This minimum length is too short and for a real run with quality control it should be > 120.

It is important to check the output to make sure you are getting the number of sequences you expect from the data. The output is in the directory split_library_output.

Look at the histograms.txt file. If you get something like this then you have a problem.

# bins raw sequence lengths, length of sequences that pass quality filters before processing, and lengths of sequences that pass quality filters post processing.
Length Raw Before After
0 4602 0 28
10 130658 0 58
20 15425 10 55
30 46920 32 46
40 118695 58 28
50 95443 57 43
60 67658 39 102
70 65570 27 124
80 78656 54 99
90 101635 117 82
100 94073 114 108
110 99076 100 80
120 93229 76 116
130 88529 114 145
140 79631 94 506
150 79257 133 1991
160 114626 124 539
170 408446 1709 3027
180 768141 810 27
190 293266 1429 1
200 1051049 2103 1
210 31571 5 2
220 2318 0 0
230 1175 1 0
240 623 2 0
250 534 0 0
260 309 0 0
270 315 0 0
280 213 0 0
290 136 0 0
300 75 0 0
310 47 0 0
320 32 0 0
330 45 0 0
340 29 0 0
350 33 0 0
360 51 0 0
370 5 0 0

There might be a problem because the quality control is removing almost all of your data before it puts it into the different splits and removes the barcode sequences. In this case there is a lot of junk sequence because of short reads witjout primers. Looking at the top of the split_library_log.txt file makes this clear

Number raw input seqs 3932096
Length outside bounds of 12 and 250 26356
Num ambiguous bases exceeds limit of 6 0
Missing Qual Score 0
Mean qual score below minimum of 25 1222172
Max homopolymer run exceeds limit of 6 7327
Num mismatches in primer exceeds limit of 3: 2491654

Next Qiime will cluster the sequences together based on a threshold of 97% identity. It will then take a representative member of each cluster and compare it to the taxonomic database to create a list of Operational Taxonomic Units (OTUs). Qiime carries out all of the necessary steps including alignments and creating a clustering tree of the OTUs using the following command:

pick_de_novo_otus.py -i split_library_output/seqs.fna -o otus

You can see a summary of the output with the following command:

print_biom_table_summary.py -i otus/otu_table.biom

This is an example output for a sample with a small number of sequences

Num samples: 45
Num observations: 1313
Total count: 7180.0
Table density (fraction of non-zero values): 0.0461
Table md5 (unzipped): c3d9bcf2d50c7477f7eadd115f5a0b99

Counts/sample summary:
 Min: 3.0
 Max: 1831.0
 Median: 53.0
 Mean: 159.555555556
 Std. dev.: 334.228834154
The OTUs can be visually displayed as a heat-map showing their relative abundance.

make_otu_heatmap_html.py -i otus/otu_table.biom -o otus/OTU_Heatmap/

This produces a webpage that can be hovered over or clicked on to provide more details about the samples.

Alternatively you can create a network for visualisation that can be viewed using Cytoscape (Networks are useful in some contexts but the relatedness of nodes here is unclear and so I prefer not to use this form of visualisation).

make_otu_network.py -m mapping.txt -i otus/otu_table.biom -o otus/OTU_Network

The OTUs are then used to cluster the samples. Samples which have the same OTUs present should be clustered together.

summarize_taxa_through_plots.py -i otus/otu_table.biom -o wf_taxa_summary -m mapping.txt

These can then be viewed as area charts under the wf_taxa_summary/taxa_summary_plots folder. The area charts are a bit of an awkward representation as they are shown as continuous plots and so the bar charts which are discrete should be used.

Ecologists calculate the within sample (alpha) and between samples (beta) diversities, to show the range of species found in the samples. If you want to use the Shannon Index you have to tell Qiime that you want to calculate it using the following command.

echo "alpha_diversity:metrics shannon,PD_whole_tree,chao1,observed_species" > alpha_params.txt

The diversity and rarefaction are then calculated using the following script.

alpha_rarefaction.py -i otus/otu_table.biom -m mapping.txt -o wf_arare/ -p alpha_params.txt -t otus/rep_set.tre

The curve is assymptotic to having a complete sampling of all of the species in the sample


Next Generation Sequencing for Microbiome Analysis - Getting Started

There are a number of online pipelines for the analysis of ribosomal samples for carrying out microbiome studies. Two of the most frequently used ones are:
Another possibility is using a pipeline installed on a local machine. The weakness of this approach is that you need to keep the r-RNA databases up-to-date to make sure you get the best results. Possible locally installed pipelines are Pangaea and Qiime (pronounced chime).

An initial challenge can be that the files you get from Next Generation sequencing might not be compatible with the pipelines either because they have a different quality measure to that expected, or because the pipelines cannot work with fastq files.

Transforming Fastq Files

One tool for transforming fastq files is the FASTX-Toolkit. Another simpler and faster method but with much less functionality is to use BioPython (you need to install it first but that is relatively simple on Linux systems like Ubuntu). For the FASTX toolkit if you are using Ion Torrent data do not forget to include the -Q 33 flag to show that a quality string different to the default is being used.

The BioPython code for creating the fasta files from the fastq files is:
SeqIO.convert("filename.fastq", "fastq", "output.fasta", "fasta")


To create the accompanying quality file the command is:
SeqIO.convert("filename.fastq", "fastq", "output.qual", "qual")

Tuesday 8 January 2013

R in Ubuntu

Keeping up to Date

One of the most frustrating things about using distributions and not installing from source is making sure that the programs are up to date. This is especially true of a rapidly changing program such as R and Bioconductor. So the first thing that is needed is to add the repository to those searched for updates.

You can do this by adding the following to the /etc/apt/sources.list file using your favourite editor (do not forget to use sudo)

deb http://cran.ma.imperial.ac.uk/bin/linux/ubuntu precise/
deb-src http://cran.ma.imperial.ac.uk/bin/linux/ubuntu precise/

For R to install the rgl library properly you also need to make sure that you have all of the headers installed (gl.h and glu.h). You can do this by installing the following libraries.

sudo apt-get install build-essential libsdl1.2debian libsdl1.2-dev libgl1-mesa-dev libglu1-mesa-dev libsdl-image1.2 libsdl-image1.2-dev

When you install the packages it is best if you install as root as some core packages need to be updated and so run R as sudo

sudo R

install.packages("multicore")
install.packages("sem")
install.packages("leaps")

install.packages("e1071")
install.packages("aplpack")

Bioconductor

You need curl and xml to connect to databases.

sudo apt-get install libcurl4-gnutls-dev
sudo apt-get install libxml++2.6-dev


source("http://bioconductor.org/biocLite.R")
biocLite()
biocLite("ShortRead")
biocLite("Rsamtools")
biocLite("nucleR")
biocLite("SRAdb")

biocLite("edgeR")


Monday 23 May 2011

Annotating Next Generation Sequencing

To annotate a vcf file using vcftools you first need to create a file containing the annotations.

This should have the following format:


#CHR FROM TO Annotation
5 53719508 53936990 ENSGALG00000011590; gene_name=TMEM179; gene_type=KNOWN_protein_coding
5 54024011 54038102 ENSGALG00000011608; gene_name=INF2; gene_type=KNOWN_BY_PROJECTION_protein_coding
5 54073641 54096083 ENSGALG00000011618; gene_name=ADSSL1; gene_type=KNOWN_protein_coding


I got this file from editing a GFF file for the region of interest. This file then needs to be zipped with bgzip and indexed with tabix. You will need a different annotation file for each feature if you edit a gff as otherwise they will not appear in order in the file.


bgzip annotate.gff
tabix -p gff annotate.gff.gz


The the annotations can be added to the file.


cat input.vcf.gz | vcf-annotate -a annotate.gff.gz -d key=INFO,ID=ANN,Number=1,Type=Integer,Description='My custom annotation' -c CHROM,FROM,TO,INFO/ANN > out.vcf


Once the file is annotated you can use grep to pull out the lines with the required annotation.

Next Generation Sequencing

The problem is combining variant calls from different species to a reference genome in order to find the variants between the two species.

First you need to edit the filtered vcf files to remove the sample name - as the program is intended to find differences between samples of the same species.


vi varX.flt.vcf
vi varY.flt.vcf


Next you need to use bgzip and tabix from Heng Li to get a compressed and indexed datafile.


bgzip varX.flt.vcf
bgzip varY.flt.vcf
tabix -p vcf varX.flt.vcf.gz
tabix -p vcf varY.flt.vcf.gz


Next you can use the vcftools function vcf-isec to find the complements of the two datasets. These will be the variants that are unique to the different species.


vcf-isec -c varX.flt.vcf.gz varY.flt.vcf.gz | bgzip -c > unique_varX.vcf.gz
vcf-isec -c varY.flt.vcf.gz varX.flt.vcf.gz | bgzip -c > unique_varY.vcf.gz


You can also create a Venn diagram of the overlap of variants between the different species.


vcf-compare var0.flt.vcf.gz var15.flt.vcf.gz > venn.out


And also look at the overlap in variants


vcf-isec -o -n +2 var15.flt.vcf.gz var0.flt.vcf.gz | bgzip -c > overlap_var15.vcf.gz

Thursday 5 May 2011

Permutating miRBase

You can also user seqinr to permutate sequences.

For the miRBase example the code is:

permutation(mature[[1]],modele='base',frame=0)