Data Deluge

Thursday, 25 November 2010

Dante-R

Dante-R is a program for normalising mass spectroscopy data. It is downloaded as an executable installer from the pacific northwest national laboratory.

Note that this is not an R zip file that you have to manage from within R - this is an executable. The one tricky point with installation is making sure that R is in your path. In my case on Windows 7 the path was not amended to contain R and so I had to add it manually.

To do this I added the following:
;C:\Program Files (x86)\R\R-2.10.1\bin

You get to the path from the Control Panel.

Once the path is added the Danter-R installer will then install all of the required R packages - in particular GTK support.

Dante-R has been used for unlabelled mass-spec data where there are a large number of samples, in cases where there are fewer samples as in the case of processing labelled data the methods might not be robust [Private discussion with the author] and so this has to be checked.

Monday, 25 October 2010

Annotation

One of the best features of Bioconductor is the built in annotation of microarray probes. This makes it veryeasy to query a large number of databases to test out any potential biological explanations for a particular expression profile.

The following code snippet will take the top values above a p-value threshold for the corrected multiple testing (if Benjamini-Hochberg is used this will be the false discrovery rate) and produce an HTML file that summarises the significant results.

library("KEGG.db")
library("GO.db")
library("annaffy")
library("XML")
library("annotate")
library("hgu133a.db")

afarms1<-topTable(fit2farms, coef=1, adjust="BH", n=100, p.value=0.01)
gnfarms <-as.character(afarms1$ID)
symfa1 <-getSYMBOL(gnfarms, "hgu133a")
symfa1
atest <- aafTableAnn(gnfarms, "hgu133a.db", aaf.handler())
saveHTML(atest, file="reportfarms1.html")

Sunday, 24 October 2010

Linear Models and Testing

Perhaps the best reason for using R for building complex statistical models is the strength of the linear model packages, which make building complex models reasonably straight forward, even if you have multiple groups. This is important following cluster analysis where you are unlikely to only end up with two clusters.

The package for constructing linear models from microarray data is Limma.

The following code constructs a model for three different clusters (the 0 cluster is used to label all the arrays that did not fall into one of the three identified clusters). First a design matrix is constructed which will define that comparisons will be carried out. The basic model is then fitted, then a second model with more complex interactions is designed using the contrast matrix. This model is then fitted using empirical Bayes (this is most effective at small sample size).

library(limma)
f <- factor(f2LCgcrma1$Cluster, levels =c("0", "1", "2", "3"))
design <-model.matrix(~0 + f)
colnames(design) <- c("cl0", "cl1", "cl2", "cl3")
fitgcrma <- lmFit(f2LCgcrma1, design)
contrast.matrix <- makeContrasts(cl3-cl2, cl3-cl1, cl2-cl1, levels=design)
fit2gcrma <- contrasts.fit(fitgcrma, contrast.matrix)
fit2gcrma <- eBayes(fit2gcrma)
tab1gcrma<-topTable(fit2gcrma, coef=1, adjust="BH")
tab2gcrma<-topTable(fit2gcrma, coef=2, adjust="BH")
tab3gcrma<-topTable(fit2gcrma, coef=3, adjust="BH")

Saturday, 23 October 2010

Finding the Clusters

When is a cluster real or an artefact of the analysis or noise in the data? You want to find consistency of the dendrograms between methods and also for different levels of clustering. Confidence in the clustering falls as there are more contradictions between methods and between filtering cut-offs. Here are some ideas about what you should be looking at:

What is the effect of the different normalisation methods? - compare rma with gcrma for example
What is the effect of filter cut-off? - try samples of 2000 and 500 genes
Is the clustering affected by using different measures? Manhattan vs Euclidean
Is the clustering affected by the clustering algorithm? - agglomerative or divisive

Friday, 22 October 2010

Adding the Clustering to the Expression Data

The easiest way to add the clustering to the Expression Data is to edit a separate file which can be used to update the phenoData element of filtered exprtession set. This will be needed for subsequent statistical testing and it is easier to do it this way rather than trying to edit individual phenoData elements. The file has to be a text csv file that can be created in any spreadsheet program. This is then read into R and attached to each of the normalised and filtered data files.

pd<- read.AnnotatedDataFrame("pheno_cluster.csv",sep="\t",quote = "\"'")
phenoData(fLCgcrma1)<-pd

Thursday, 21 October 2010

Clustering the Arrays

If you are looking at biological variation you may want to cluster the arrays. When analysing a study of 54 tissue culture arrays from lung cancer cell lines the arrays were clustered after filtering using euclidean distance matrices for the log2 expression levels.

dgcrma1 <- dist(log2(t(exprs(fLCgcrma1))),method="euclidian")
hcgcrma1 <- hclust(dgcrma1, method = "average")
plot(hcgcrma1)

The cluster dendrograms from the different normalisation methods can then be compared, and a concensus devised. In this case RMA performed slightly worse than FARMS and GCRMA as the conserved clusters from the other two methods showed some variation in RMA. The factor that makes the biggest difference to the ordering is the number of genes in the clustering and so dendrograms are very sensitive to the filtering step.

Wednesday, 20 October 2010

Filtering the Genes

Carrying out a large number of comparisons for all genes is impossible because of the problems with false positives and the need for correction terms for multiple testing. There is also the curse of dimensionality to consider if machine learning or clustering techniques are going to be used.

As sample number is much smaller than gene number and thus gene number you have to filter the data before trying any statistical tests or machine learning. What you look for is gene variability. House keeping genes should have the same expression levels in all cells and will have no variability. You also need to have a robust measure to account for outliers, so the range: maximum expression in the dataset - minimum expression in the dataset, is not robust, so there also has to be a ratio measure. This is a script that combines both the ratio and difference tests of gene expression variability and applies them using the genefilter package.

mmfilt1 <- function(r = 2, d = 2, na.rm = TRUE) {
function(x) {
minval <- min(x, na.rm = na.rm)
maxval <- max(x, na.rm = na.rm)
(maxval/minval > r) && (maxval - minval > d)
}
}
mmfun1 <- mmfilt1()
ffun1 <- filterfun(mmfun1)
frma1 <- genefilter(eLCrma1, ffun1)
sum(frma1)

You need to adjust r and d until sum(output) gives you the number of genes you want. In this case I wanted it to be about 1500 probes as I am looking for genes that are different between multiple groups.

The output of this routine is a logical array. So this just lists the probes and gives TRUE if they are to be included and FALSE if they are not. So this has to be applied to the original expression data set to get the filtered expression data.

fLCrma <-LCrma[frma, ]