First read my last post about getting Fasta and quality files from the Fastq files. If there are multiple samples in the run then there will be a single file containing all of the data with the different bar-codes attached.
The process to follow is the same as for
454 data.
You sample should be made up of read containing:
- An adaptor (A) sequence
- A barcode sequence
- A linker primer sequence
- The actual target sequence
- A reverse primer sequence
- An adaptor (B) sequence
First you need to create a mapping file telling Qiime about the barcodes, linkers adaptors and any experimental conditions you might want to add.
- #SampleID BarcodeSequence LinkerPrimerSequence Treatment ReversePrimer Description
- 1 CTAAGGTAAC CCTACGGGAGGCAGCAG control ATTACCGCGGCTGCTGG oral swab
This file is easiest to create in a spreadsheet and then save it as a tab delimited text file. You should then check this file to make sure that it is in the right format.
check_id_map.py -m Mapping.txt -o mapping_output
If you are doing a properly blinded experiment then you should not know what the treatments are and so this should be just given a number or a simple code.
The first stage of the Qiime pipeline is splitting the sample by barcodes removing any low quality or ambiguous reads.
split_libraries.py -M3 -l12 -L250 -b variable_length -m Mapping.txt -f Example.fna -q Example.qual -o split_library_output
The command line arguments allow a mismatch in the primer of 3 a minimum sequence length of 12 and maximum of 250 and allow for variable barcode lengths. This minimum length is too short and for a real run with quality control it should be > 120.
It is important to check the output to make sure you are getting the number of sequences you expect from the data. The output is in the directory split_library_output.
Look at the histograms.txt file. If you get something like this then you have a problem.
# bins raw sequence lengths, length of sequences that pass quality filters before processing, and lengths of sequences that pass quality filters post processing.
Length Raw Before After
0 4602 0 28
10 130658 0 58
20 15425 10 55
30 46920 32 46
40 118695 58 28
50 95443 57 43
60 67658 39 102
70 65570 27 124
80 78656 54 99
90 101635 117 82
100 94073 114 108
110 99076 100 80
120 93229 76 116
130 88529 114 145
140 79631 94 506
150 79257 133 1991
160 114626 124 539
170 408446 1709 3027
180 768141 810 27
190 293266 1429 1
200 1051049 2103 1
210 31571 5 2
220 2318 0 0
230 1175 1 0
240 623 2 0
250 534 0 0
260 309 0 0
270 315 0 0
280 213 0 0
290 136 0 0
300 75 0 0
310 47 0 0
320 32 0 0
330 45 0 0
340 29 0 0
350 33 0 0
360 51 0 0
370 5 0 0
There might be a problem because the quality control is removing almost all of your data before it puts it into the different splits and removes the barcode sequences. In this case there is a lot of junk sequence because of short reads witjout primers. Looking at the top of the split_library_log.txt file makes this clear
Number raw input seqs 3932096
Length outside bounds of 12 and 250 26356
Num ambiguous bases exceeds limit of 6 0
Missing Qual Score 0
Mean qual score below minimum of 25 1222172
Max homopolymer run exceeds limit of 6 7327
Num mismatches in primer exceeds limit of 3: 2491654
Next Qiime will cluster the sequences together based on a threshold of 97% identity. It will then take a representative member of each cluster and compare it to the taxonomic database to create a list of Operational Taxonomic Units (OTUs). Qiime carries out all of the necessary steps including alignments and creating a clustering tree of the OTUs using the following command:
pick_de_novo_otus.py -i split_library_output/seqs.fna -o otus
You can see a summary of the output with the following command:
print_biom_table_summary.py -i otus/otu_table.biom
This is an example output for a sample with a small number of sequences
Num samples: 45
Num observations: 1313
Total count: 7180.0
Table density (fraction of non-zero values): 0.0461
Table md5 (unzipped): c3d9bcf2d50c7477f7eadd115f5a0b99
Counts/sample summary:
Min: 3.0
Max: 1831.0
Median: 53.0
Mean: 159.555555556
Std. dev.: 334.228834154
The OTUs can be visually displayed as a heat-map showing their relative abundance.
make_otu_heatmap_html.py -i otus/otu_table.biom -o otus/OTU_Heatmap/
This produces a webpage that can be hovered over or clicked on to provide more details about the samples.
Alternatively you can create a network for visualisation that can be viewed using Cytoscape (Networks are useful in some contexts but the relatedness of nodes here is unclear and so I prefer not to use this form of visualisation).
make_otu_network.py -m mapping.txt -i otus/otu_table.biom -o otus/OTU_Network
The OTUs are then used to cluster the samples. Samples which have the same OTUs present should be clustered together.
summarize_taxa_through_plots.py -i otus/otu_table.biom -o wf_taxa_summary -m mapping.txt
These can then be viewed as area charts under the wf_taxa_summary/taxa_summary_plots folder. The area charts are a bit of an awkward representation as they are shown as continuous plots and so the bar charts which are discrete should be used.
Ecologists calculate the within sample (alpha) and between samples (beta) diversities, to show the range of species found in the samples. If you want to use the Shannon Index you have to tell Qiime that you want to calculate it using the following command.
echo "alpha_diversity:metrics shannon,PD_whole_tree,chao1,observed_species" > alpha_params.txt
The diversity and rarefaction are then calculated using the following script.
alpha_rarefaction.py -i otus/otu_table.biom -m mapping.txt -o wf_arare/ -p alpha_params.txt -t otus/rep_set.tre
The curve is assymptotic to having a complete sampling of all of the species in the sample