Tuesday, 14 September 2010

Underlying Statistical Model

The intensity y of a single probe is given by:

Y = B + αS

Where B is the background noise composed of many different terms including optical effects and non-specific binding effects (this is important as this means it will be non-systematic and so normally distributed according to the CLT). Alpha is a gain factor and S is the specific binding. S will also be a random variable to account for probe effects and measurement error, and these errors are assumed to be multiplicative.

log(S) = θ + φ + ε

The variability of B and α between arrays means that even if S is the same it will not appear to be the same. This can be demonstrated by the use of spiked in amounts of a gene. There is attenuation bias - low concentrations are more strongly affected by background. This can be seen from the example SpikeInSubset from the affy library.

Index <- which(probeNames(spikein133)%in%colnames(pData(spikein133)))
pms <- pm(spikein133)[Index,]
genenames <- probeNames(spikein133)[Index]
nominal <- t(sapply(probeNames(spikein133)[Index],function(i) pData(spikein133)[,i]))
x <- as.vector(log2(nominal))
y <- as.vector(log2(pms))
avg <- tapply(y,x,mean,na.rm=TRUE)

This extracts the data subset we want to plot from the SpikeInSubset and scales it to log2. This is then plotted with:

ylab="Log (Base 2) PM Intensity",
xlab="Nominal Log (Base 2) Concentration")

The additive-multiplicative model is confirmed by plotting a large series of probe intensities for a given nominal concentration (1 picomol), against a normally distributed error. The resulting QQ-plot lies mostly on the diagonal confirming that the experimental intensities are normally distributed (this could have been done with a KS-test).

This is plotted with:


Monday, 6 September 2010

Getting Bioconductor

The Bioconductor Homepage.

Once you have R installed:

To install other packages from Bioconductor use the format:
> biocLite("SpikeInSubset")

Sunday, 5 September 2010

Where you should start

I should say before I begin this Blog that this is my set of notes and reminders for myself for my own research into high throughput biological analysis and that most of the code and examples come from the excellent book:

Bioinformatics and Computational Biology Solutions Using R and Bioconductor by Rober Gentleman, Vincent Carey, Wolfang Huber, Rafael Irizarry and Sandrine Dudoit.

This is an invaluable resource and you should buy it.