Data Deluge: Filtering the Genes

Carrying out a large number of comparisons for all genes is impossible because of the problems with false positives and the need for correction terms for multiple testing. There is also the curse of dimensionality to consider if machine learning or clustering techniques are going to be used.

As sample number is much smaller than gene number and thus gene number you have to filter the data before trying any statistical tests or machine learning. What you look for is gene variability. House keeping genes should have the same expression levels in all cells and will have no variability. You also need to have a robust measure to account for outliers, so the range: maximum expression in the dataset - minimum expression in the dataset, is not robust, so there also has to be a ratio measure. This is a script that combines both the ratio and difference tests of gene expression variability and applies them using the genefilter package.

mmfilt1 <- function(r = 2, d = 2, na.rm = TRUE) {
function(x) {
minval <- min(x, na.rm = na.rm)
maxval <- max(x, na.rm = na.rm)
(maxval/minval > r) && (maxval - minval > d)
}
}
mmfun1 <- mmfilt1()
ffun1 <- filterfun(mmfun1)
frma1 <- genefilter(eLCrma1, ffun1)
sum(frma1)

You need to adjust r and d until sum(output) gives you the number of genes you want. In this case I wanted it to be about 1500 probes as I am looking for genes that are different between multiple groups.

The output of this routine is a logical array. So this just lists the probes and gives TRUE if they are to be included and FALSE if they are not. So this has to be applied to the original expression data set to get the filtered expression data.

fLCrma <-LCrma[frma, ]

Data Deluge

Wednesday, 20 October 2010

Filtering the Genes

No comments:

Post a Comment