Thursday, 5 May 2011

GC Content Analysis of miRBase

I want to create a dataset for supervised learning based on miRBase. So I need to know the statistical properties of miRBase. Having used seqinr to import miRBase into R, I need to carry out some analysis. GC content is a first place to start as many of the sequences I am familiar with have a high AT content and so an AT bias might be a factor affecting the learning process.

There are 19724 sequences in the database. I could have used length(mature) instead of hard coding the number. You also need to initialise out as a variable first.

> for (x in c(1:19724)) out <- c(out, GC(mature[[x]])) > hist (out)



So I am happy with this as it is normally distributed and so I do not need to take any special care in the GC/AT content for my training sets.

No comments:

Post a Comment