Data Deluge: September 2013

Friday, 13 September 2013

Final Stages of Using Qiime for Ion Torrent Data

Once you have the rarefaction curves for within the sample you can now calculate them for between the bar-coded samples. This will give a measure of distance between the samples that can then be used for clustering.

beta_diversity_through_plots.py -i otus/otu_table.biom -m mapping.txt -o wf_bdiv_even4/ -t otus/rep_set.tre -e 4

Where the last term is a cutoff for the minimum number of reads in a sample (must be even) - this is the sampling depth. You need to try a series of cut-offs to find out which works best, but keep it less than the median. If you choose the median then values below the median are ignored (you will lose half of you data but if you are expecting a large number of blank results this might be OK) so only do this if you have lots of skew and the mean is much larger than the median.

This will give a series of 2D and 3D Principle Components plots that can be used to look for any clustering of the samples. Clusters should be visually identifiable and well defined.

Qiime can also carry out a jack-knife analysis to test the clustering dependence on omitting some of the samples. This produces a phylogenetic tree where like samples are on closely related nodes. Clusters are represented by clades (a group of leaves with a common root).

jackknifed_beta_diversity.py -i otus/otu_table.biom -t otus/rep_set.tre -m mapping.txt -o wf_jack50 -e 50

Finally you can make Biplots to show the most important axes between the clusters.

make_3d_plots.py -i wf_bdiv_even4/unweighted_unifrac_pc.txt -m mapping.txt -t wf_taxa_summary/otu_table_L3.txt --n_taxa_keep 5 -o 3d_biplot

The 5 tells the program how many top level taxa to display for comparison to the clusters.

Thursday, 12 September 2013

Using Qiime for Microbiome Analysis of Ion Torrent Data

First read my last post about getting Fasta and quality files from the Fastq files. If there are multiple samples in the run then there will be a single file containing all of the data with the different bar-codes attached.

The process to follow is the same as for 454 data.

You sample should be made up of read containing:

An adaptor (A) sequence
A barcode sequence
A linker primer sequence
The actual target sequence
A reverse primer sequence
An adaptor (B) sequence

First you need to create a mapping file telling Qiime about the barcodes, linkers adaptors and any experimental conditions you might want to add.

#SampleID BarcodeSequence LinkerPrimerSequence Treatment ReversePrimer Description
1 CTAAGGTAAC CCTACGGGAGGCAGCAG control ATTACCGCGGCTGCTGG oral swab

This file is easiest to create in a spreadsheet and then save it as a tab delimited text file. You should then check this file to make sure that it is in the right format.

check_id_map.py -m Mapping.txt -o mapping_output

If you are doing a properly blinded experiment then you should not know what the treatments are and so this should be just given a number or a simple code.

The first stage of the Qiime pipeline is splitting the sample by barcodes removing any low quality or ambiguous reads.

split_libraries.py -M3 -l12 -L250 -b variable_length -m Mapping.txt -f Example.fna -q Example.qual -o split_library_output

The command line arguments allow a mismatch in the primer of 3 a minimum sequence length of 12 and maximum of 250 and allow for variable barcode lengths. This minimum length is too short and for a real run with quality control it should be > 120.

It is important to check the output to make sure you are getting the number of sequences you expect from the data. The output is in the directory split_library_output.

Look at the histograms.txt file. If you get something like this then you have a problem.

# bins raw sequence lengths, length of sequences that pass quality filters before processing, and lengths of sequences that pass quality filters post processing.
Length Raw Before After
0 4602 0 28
10 130658 0 58
20 15425 10 55
30 46920 32 46
40 118695 58 28
50 95443 57 43
60 67658 39 102
70 65570 27 124
80 78656 54 99
90 101635 117 82
100 94073 114 108
110 99076 100 80
120 93229 76 116
130 88529 114 145
140 79631 94 506
150 79257 133 1991
160 114626 124 539
170 408446 1709 3027
180 768141 810 27
190 293266 1429 1
200 1051049 2103 1
210 31571 5 2
220 2318 0 0
230 1175 1 0
240 623 2 0
250 534 0 0
260 309 0 0
270 315 0 0
280 213 0 0
290 136 0 0
300 75 0 0
310 47 0 0
320 32 0 0
330 45 0 0
340 29 0 0
350 33 0 0
360 51 0 0
370 5 0 0

There might be a problem because the quality control is removing almost all of your data before it puts it into the different splits and removes the barcode sequences. In this case there is a lot of junk sequence because of short reads witjout primers. Looking at the top of the split_library_log.txt file makes this clear

Number raw input seqs 3932096
Length outside bounds of 12 and 250 26356
Num ambiguous bases exceeds limit of 6 0
Missing Qual Score 0
Mean qual score below minimum of 25 1222172
Max homopolymer run exceeds limit of 6 7327
Num mismatches in primer exceeds limit of 3: 2491654

Next Qiime will cluster the sequences together based on a threshold of 97% identity. It will then take a representative member of each cluster and compare it to the taxonomic database to create a list of Operational Taxonomic Units (OTUs). Qiime carries out all of the necessary steps including alignments and creating a clustering tree of the OTUs using the following command:

pick_de_novo_otus.py -i split_library_output/seqs.fna -o otus

You can see a summary of the output with the following command:

print_biom_table_summary.py -i otus/otu_table.biom

This is an example output for a sample with a small number of sequences

Num samples: 45
Num observations: 1313
Total count: 7180.0
Table density (fraction of non-zero values): 0.0461
Table md5 (unzipped): c3d9bcf2d50c7477f7eadd115f5a0b99

Counts/sample summary:
Min: 3.0
Max: 1831.0
Median: 53.0
Mean: 159.555555556
Std. dev.: 334.228834154

The OTUs can be visually displayed as a heat-map showing their relative abundance.

make_otu_heatmap_html.py -i otus/otu_table.biom -o otus/OTU_Heatmap/

This produces a webpage that can be hovered over or clicked on to provide more details about the samples.

Alternatively you can create a network for visualisation that can be viewed using Cytoscape (Networks are useful in some contexts but the relatedness of nodes here is unclear and so I prefer not to use this form of visualisation).

make_otu_network.py -m mapping.txt -i otus/otu_table.biom -o otus/OTU_Network

The OTUs are then used to cluster the samples. Samples which have the same OTUs present should be clustered together.

summarize_taxa_through_plots.py -i otus/otu_table.biom -o wf_taxa_summary -m mapping.txt

These can then be viewed as area charts under the wf_taxa_summary/taxa_summary_plots folder. The area charts are a bit of an awkward representation as they are shown as continuous plots and so the bar charts which are discrete should be used.

Ecologists calculate the within sample (alpha) and between samples (beta) diversities, to show the range of species found in the samples. If you want to use the Shannon Index you have to tell Qiime that you want to calculate it using the following command.

echo "alpha_diversity:metrics shannon,PD_whole_tree,chao1,observed_species" > alpha_params.txt

The diversity and rarefaction are then calculated using the following script.

alpha_rarefaction.py -i otus/otu_table.biom -m mapping.txt -o wf_arare/ -p alpha_params.txt -t otus/rep_set.tre

The curve is assymptotic to having a complete sampling of all of the species in the sample

Next Generation Sequencing for Microbiome Analysis - Getting Started

There are a number of online pipelines for the analysis of ribosomal samples for carrying out microbiome studies. Two of the most frequently used ones are:

RDP
MG-Rast

Another possibility is using a pipeline installed on a local machine. The weakness of this approach is that you need to keep the r-RNA databases up-to-date to make sure you get the best results. Possible locally installed pipelines are Pangaea and Qiime (pronounced chime).

An initial challenge can be that the files you get from Next Generation sequencing might not be compatible with the pipelines either because they have a different quality measure to that expected, or because the pipelines cannot work with fastq files.

Transforming Fastq Files

One tool for transforming fastq files is the FASTX-Toolkit. Another simpler and faster method but with much less functionality is to use BioPython (you need to install it first but that is relatively simple on Linux systems like Ubuntu). For the FASTX toolkit if you are using Ion Torrent data do not forget to include the -Q 33 flag to show that a quality string different to the default is being used.

The BioPython code for creating the fasta files from the fastq files is:

SeqIO.convert("filename.fastq", "fastq", "output.fasta", "fasta")

To create the accompanying quality file the command is:
SeqIO.convert("filename.fastq", "fastq", "output.qual", "qual")