Data Deluge: 2010

Thursday, 25 November 2010

Dante-R

Dante-R is a program for normalising mass spectroscopy data. It is downloaded as an executable installer from the pacific northwest national laboratory.

Note that this is not an R zip file that you have to manage from within R - this is an executable. The one tricky point with installation is making sure that R is in your path. In my case on Windows 7 the path was not amended to contain R and so I had to add it manually.

To do this I added the following:
;C:\Program Files (x86)\R\R-2.10.1\bin

You get to the path from the Control Panel.

Once the path is added the Danter-R installer will then install all of the required R packages - in particular GTK support.

Dante-R has been used for unlabelled mass-spec data where there are a large number of samples, in cases where there are fewer samples as in the case of processing labelled data the methods might not be robust [Private discussion with the author] and so this has to be checked.

Monday, 25 October 2010

Annotation

One of the best features of Bioconductor is the built in annotation of microarray probes. This makes it veryeasy to query a large number of databases to test out any potential biological explanations for a particular expression profile.

The following code snippet will take the top values above a p-value threshold for the corrected multiple testing (if Benjamini-Hochberg is used this will be the false discrovery rate) and produce an HTML file that summarises the significant results.

library("KEGG.db")
library("GO.db")
library("annaffy")
library("XML")
library("annotate")
library("hgu133a.db")

afarms1<-topTable(fit2farms, coef=1, adjust="BH", n=100, p.value=0.01)
gnfarms <-as.character(afarms1$ID)
symfa1 <-getSYMBOL(gnfarms, "hgu133a")
symfa1
atest <- aafTableAnn(gnfarms, "hgu133a.db", aaf.handler())
saveHTML(atest, file="reportfarms1.html")

Sunday, 24 October 2010

Linear Models and Testing

Perhaps the best reason for using R for building complex statistical models is the strength of the linear model packages, which make building complex models reasonably straight forward, even if you have multiple groups. This is important following cluster analysis where you are unlikely to only end up with two clusters.

The package for constructing linear models from microarray data is Limma.

The following code constructs a model for three different clusters (the 0 cluster is used to label all the arrays that did not fall into one of the three identified clusters). First a design matrix is constructed which will define that comparisons will be carried out. The basic model is then fitted, then a second model with more complex interactions is designed using the contrast matrix. This model is then fitted using empirical Bayes (this is most effective at small sample size).

library(limma)
f <- factor(f2LCgcrma1$Cluster, levels =c("0", "1", "2", "3"))
design <-model.matrix(~0 + f)
colnames(design) <- c("cl0", "cl1", "cl2", "cl3")
fitgcrma <- lmFit(f2LCgcrma1, design)
contrast.matrix <- makeContrasts(cl3-cl2, cl3-cl1, cl2-cl1, levels=design)
fit2gcrma <- contrasts.fit(fitgcrma, contrast.matrix)
fit2gcrma <- eBayes(fit2gcrma)
tab1gcrma<-topTable(fit2gcrma, coef=1, adjust="BH")
tab2gcrma<-topTable(fit2gcrma, coef=2, adjust="BH")
tab3gcrma<-topTable(fit2gcrma, coef=3, adjust="BH")

Saturday, 23 October 2010

Finding the Clusters

When is a cluster real or an artefact of the analysis or noise in the data? You want to find consistency of the dendrograms between methods and also for different levels of clustering. Confidence in the clustering falls as there are more contradictions between methods and between filtering cut-offs. Here are some ideas about what you should be looking at:

What is the effect of the different normalisation methods? - compare rma with gcrma for example
What is the effect of filter cut-off? - try samples of 2000 and 500 genes
Is the clustering affected by using different measures? Manhattan vs Euclidean
Is the clustering affected by the clustering algorithm? - agglomerative or divisive

Friday, 22 October 2010

Adding the Clustering to the Expression Data

The easiest way to add the clustering to the Expression Data is to edit a separate file which can be used to update the phenoData element of filtered exprtession set. This will be needed for subsequent statistical testing and it is easier to do it this way rather than trying to edit individual phenoData elements. The file has to be a text csv file that can be created in any spreadsheet program. This is then read into R and attached to each of the normalised and filtered data files.

pd<- read.AnnotatedDataFrame("pheno_cluster.csv",sep="\t",quote = "\"'")
phenoData(fLCgcrma1)<-pd

Thursday, 21 October 2010

Clustering the Arrays

If you are looking at biological variation you may want to cluster the arrays. When analysing a study of 54 tissue culture arrays from lung cancer cell lines the arrays were clustered after filtering using euclidean distance matrices for the log2 expression levels.

dgcrma1 <- dist(log2(t(exprs(fLCgcrma1))),method="euclidian")
hcgcrma1 <- hclust(dgcrma1, method = "average")
plot(hcgcrma1)

The cluster dendrograms from the different normalisation methods can then be compared, and a concensus devised. In this case RMA performed slightly worse than FARMS and GCRMA as the conserved clusters from the other two methods showed some variation in RMA. The factor that makes the biggest difference to the ordering is the number of genes in the clustering and so dendrograms are very sensitive to the filtering step.

Wednesday, 20 October 2010

Filtering the Genes

Carrying out a large number of comparisons for all genes is impossible because of the problems with false positives and the need for correction terms for multiple testing. There is also the curse of dimensionality to consider if machine learning or clustering techniques are going to be used.

As sample number is much smaller than gene number and thus gene number you have to filter the data before trying any statistical tests or machine learning. What you look for is gene variability. House keeping genes should have the same expression levels in all cells and will have no variability. You also need to have a robust measure to account for outliers, so the range: maximum expression in the dataset - minimum expression in the dataset, is not robust, so there also has to be a ratio measure. This is a script that combines both the ratio and difference tests of gene expression variability and applies them using the genefilter package.

mmfilt1 <- function(r = 2, d = 2, na.rm = TRUE) {
function(x) {
minval <- min(x, na.rm = na.rm)
maxval <- max(x, na.rm = na.rm)
(maxval/minval > r) && (maxval - minval > d)
}
}
mmfun1 <- mmfilt1()
ffun1 <- filterfun(mmfun1)
frma1 <- genefilter(eLCrma1, ffun1)
sum(frma1)

You need to adjust r and d until sum(output) gives you the number of genes you want. In this case I wanted it to be about 1500 probes as I am looking for genes that are different between multiple groups.

The output of this routine is a logical array. So this just lists the probes and gives TRUE if they are to be included and FALSE if they are not. So this has to be applied to the original expression data set to get the filtered expression data.

fLCrma <-LCrma[frma, ]

Tuesday, 19 October 2010

Normalisation - Detailed or Quick and Dirty

There is always the question of what method is the best for normalisation. There is no firm rule and different reviews have suggested different preferences. For Affymetrix arrays Mas 5.0 is definitely not the best normalisation technique.

The problem is that while there are hundreds of different ways of normalising data in the end the proof comes at the end of the data analysis, when you find interesting biology and so it is probably best to run a quick and simple normalisation such as rma and then to focus on the later stages of the analysis.

In this case I have two sets of Affymetrix data raw and raw1 and I processed them with rma, gcrma and farms to get six final normalised datasets.

library("gcrma")
library("farms")
LCrma <- rma(raw)
LCrma1 <- rma(raw1)
LCgcrma1 <- gcrma(raw1)
LCgcrma <- gcrma(raw)
LCfarms <- q.farms(raw)
LCfarms1 <- q.farms(raw1)

You can then look at the normalised histograms of the data to see how well normalisation has performed.

eLCfarms1<-exprs(LCfarms1)
hist(eLCfarms1)
boxplot(eLCfarms1, outline = FALSE, col="lightblue")

Monday, 18 October 2010

NUSE and RLE Plots

These are diagnostic plots that can be created from the affyPLM package.

Nuse is the Normalised Unscaled Standard Error and RLE is the Relative Log Expression. These comparisons only work within datasets - the same arrays and the experimental protocols and not between datasets. Arrays are outliers if they have very different NUSE values to all of the others or if they have a wider distribution.

library("affyPLM")
dataPLM <- fitPLM(raw)
boxplot(dataPLM, main="NUSE",ylim = c(0.95, 1.10),
outline = FALSE, col="lightblue",las=3,
whisklty=0, staplelty=0)

Mbox(dataPLM, main="RLE", ylim = c(-0.4, 0.4),
outline = FALSE, col="mistyrose", las=3,
whisklty=0, staplelty=0)

In my particular case there is one array that has poor values for RLE and NUSE and that is array GSM372776.CEL. This can be removed from subsequent analysis by creating a new object without this array.

badArray = match("GSM372776.CEL", sampleNames(raw))
raw1 <- raw[,-badArray]

Sunday, 17 October 2010

Measuring Between Array Distances

This will show you if you have any arrays where the median expression levels are very different from one another. This might indicate problems with annealing and reproducibility.

library(genefilter)
dd <- dist2(log2(exprs(raw)))
diag(dd)<-0
dd.row <-as.dendrogram(hclust(as.dist(dd)))
library("latticeExtra")
row.ord <- order.dendrogram(dd.row)
legend <-list(top=list(fun=dendrogramGrob,
args=list(x=dd.row, side="top")))
lp <- levelplot(dd[row.ord,row.ord],
scales=list(cex="0.5", x=list(rot=90)),xlab="",
ylab="", legend=legend)
lp

For my dataset this gave the following output.

Saturday, 16 October 2010

Quality Control of Microarrays

Once you have your raw data it is important to look at some array quality metrics. These will tell you if there are any problems with the array experiments, such as problems with the arrays themselves or with sample degredation. Each microarray manufacturer has their own set of built in tests, which are described in the array documentation.

To use the Bioconductor packages for assessing Microarray quality you need to install the ArrayQualityMetrics package.

biocLite("arrayQualityMetrics")
library(arrayQualityMetrics)
arrayQualityMetrics(expressionset = normalized_rma, outdir = "QCnorm", spatial = FALSE,
+ force = TRUE, intgroup = "Type", grouprep = TRUE)

This is limited by memory and that is why I have limited the number of CEL files in the code to the first 20. The processing will then take a while to run all the diagnostics. It worked in verions 2.10 but there are problems with R 2.12 and Bioconductor 2.7 that it does not automatically load the libraries.

There are also quality control measures for Affymetrix data in the simpleaffy package. This is run with the function qc().

library("simpleaffy")
rawqc <- qc(raw[,1:20])
plot (rawqc, cex=0.5)

This shows that there is an appreciable amount of RNA degredation on many of the arrays and this is affecting the 3 prime to 5 prime ratio of actin. Many of the actin measures are actually outliers. This might suggest problems with the experimental protocol for setting up the arrays.

Another alternative is the affyQCReport package, which produces a pdf output of the quality control data.

library("arrayQualityMetrics")
QCReport(raw[,1:20])

Friday, 15 October 2010

Micro-array Normalisation

The AffyComp II Benchmarking Study for the normalisation of gene chip expression measures showed that whilst RMA outperformed MAS 5.0, in certain measures GCRMA and FARMS outperformed both of them [Microarray data analysis: from disarray to consolidation and consensus D.B. Allison, X. Cui, G.P. Page and M. Sabripour Nature Reviews Geentics (2005)7 55-65].

It is possible that the appropriate method for normalisation will be data-set and biological question dependent as in some cases expression will be much stronger than in other and in some cases differences will be small and so more sensitive methods are needed at the risk of losing specificity.

Luckily both FARMS and GCRMA are both available as packages for R and so it is relatively easy to process the arrays using all of the potential methods and these normalised values can then be taken forward to the subsequent analysis where their different effects can be evaluated.

Thursday, 14 October 2010

Deeper look at the underlying statistical models for high throughput data.

Rocke and Durbin 2001

The additive multiplicative model for microarray data was first presented by Rocke and Durbin in 2001 [A Model for Measurement Error for Gene Expression Arrays, D.M. Rocke and B. Durbin J. Comp. Biol 8 557–569]. They based their model on error models used in analytical chemistry such as gas chromotography and mass spectroscopy. The model they give for general concentration based assays is:

y = α + β μ exp(η) + ε

Where y is the reponse and μ is the concentration and η and ε are normally distributed random variables with mean 0 and standard deviations σ(η) and σ(ε) respectively.

Thinking about how the terms ε and η play a role in the experimental data. The additive error ε will be a more significant component when concentrations are close to zero, whereas the multiplicative error η will be more significant at high concentrations.

The concentrations are then calculated by using a callibration curve to determine β but in the case of microarrays where there are two channel arrays or where the experiment is determining differences in expression there is no need for an absolute callibration and so β can be ignored.

This gives a three parameter model:

y = α + μ exp(η) + ε

The parameters can then either be estimated by the use of technical replicates or by the introduction of negative controls. The error ε becomes the B term of the general microarray model.

Ideker et al.2000

Ideker et al.2000 combined α and μ [Testing for Differentially-Expressed Genes by Maximum-Likelihood Analysis of Microarray Data, T. Ideker, V. Thorsenn, A.F Siegel and L. Hood J. Comp. Biol. 7 805–817.] By writing the model in the form:

y = μ + μ ε + δ

Where μ is the mean intensity and ε and δ are the multiplicative and additive errors respectively. Using this form we can combine the $mu; terms, which is equivalent to losing the $alpha; term from Rocke and Durbin. This gives the most commonly used form of the error model.

The multiplicative error μ can then be decomposed into different components.

Tuesday, 12 October 2010

Quantile normalisation

This model is very important as it has been widely applied to high throughput techniques, not only in the case of microarrays and transcriptomics but also in the case of proteomics. The most cited paper is:

Bolstad BM, Irizarry RA, Astrand M, and Speed T. (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2) 185-193.

This paper describes quantile normalisation and uses spike in data to evaluate the effects of the normalisation process. If normalisation has not affected the data in any way then the slope of the graph for spiked in concentration vs level of expression should have a gradient of 1. The linear model that is fitted is:

log2(E)= β0 + β1 log2(c) + ε

Note that this is a logarithmic linear model and so the errors in the untransformed data are not additive but multiplicative.

Tuesday, 14 September 2010

Underlying Statistical Model

The intensity y of a single probe is given by:

Y = B + αS

Where B is the background noise composed of many different terms including optical effects and non-specific binding effects (this is important as this means it will be non-systematic and so normally distributed according to the CLT). Alpha is a gain factor and S is the specific binding. S will also be a random variable to account for probe effects and measurement error, and these errors are assumed to be multiplicative.

log(S) = θ + φ + ε

The variability of B and α between arrays means that even if S is the same it will not appear to be the same. This can be demonstrated by the use of spiked in amounts of a gene. There is attenuation bias - low concentrations are more strongly affected by background. This can be seen from the example SpikeInSubset from the affy library.

library("affy")
library("SpikeInSubset")
data(spikein133)
Index <- which(probeNames(spikein133)%in%colnames(pData(spikein133)))
pms <- pm(spikein133)[Index,]
genenames <- probeNames(spikein133)[Index]
nominal <- t(sapply(probeNames(spikein133)[Index],function(i) pData(spikein133)[,i]))
x <- as.vector(log2(nominal))
y <- as.vector(log2(pms))
avg <- tapply(y,x,mean,na.rm=TRUE)

This extracts the data subset we want to plot from the SpikeInSubset and scales it to log2. This is then plotted with:

plot(jitter(x),y,las=1,pch=".",
ylab="Log (Base 2) PM Intensity",
xlab="Nominal Log (Base 2) Concentration")
lines(as.numeric(names(avg)),avg,lwd=3,col="green")

The additive-multiplicative model is confirmed by plotting a large series of probe intensities for a given nominal concentration (1 picomol), against a normally distributed error. The resulting QQ-plot lies mostly on the diagonal confirming that the experimental intensities are normally distributed (this could have been done with a KS-test).

This is plotted with:
qqnorm(y[x==0],pch=".")

qqline(y[x==0],col="red")

Monday, 6 September 2010

Getting Bioconductor

The Bioconductor Homepage.

Once you have R installed:
source("http://bioconductor.org/biocLite.R")
biocLite()

To install other packages from Bioconductor use the format:
> biocLite("SpikeInSubset")

Sunday, 5 September 2010

Where you should start

I should say before I begin this Blog that this is my set of notes and reminders for myself for my own research into high throughput biological analysis and that most of the code and examples come from the excellent book:

Bioinformatics and Computational Biology Solutions Using R and Bioconductor by Rober Gentleman, Vincent Carey, Wolfang Huber, Rafael Irizarry and Sandrine Dudoit.

This is an invaluable resource and you should buy it.