biolearnr: 2015

Friday, 4 December 2015

Unit-testing using bioconductor MArrayLM as input

I have a range of analysed microarray datasets (stored as limma:MArrayLM objects in an R list, fits) and want to make summaries for specific genes across all of the datasets. The hierarchical modelling for the cross-dataset analysis will be done with jags.
I'm currently setting up the dataset to pass to jags; the resulting data.frame will look like:

dataset : id.idx : probe.coef
1 : 1 : 0.234
1 : 1 : 0.345 # multiple probes for each gene
1 : 1 : 0.789 # within a dataset
1 : 2 : -0.1
2 : 1 : 0.33 # A given gene can have different
2 : 1 : 0.79 # numbers of probes between datasets
2 : 2 : 0.3
2 : 2 : -0.5
....

Datasets are numbered from 1 to D, Genes are numbered from 1 to G; only genes that are present in every dataset are considered.

The MArrayLM objects, fits[[1]] to fits[[D]], each have a $coefficients table (from where the probeCoef is pulled out) and an entry in $genes giving the entrez.gene id for each row (these are mapped into 1..G so jags can use them as an index).

A function to pull out the probe coefficients and map the entrez-ids to gene indices should be pretty straightforward to write but I can't quite work out how to set up and test it (but use RUnit for testing). To do this I'd need to be able to create MArrayLM objects containing specific values (preferably without creating an eset and passing this through lmFit, for the sake of speed)

---

A relatively simple way to construct MArrayLM objects (that contain all the relevant structures for conversion into the format above) was found using
lm <- new(Class = 'MArrayLM')
lm$coefficients <- some.matrix
lm$genes <- some.dframe

---

Aside from the various exception handling steps, my method to convert the MArrayLM objects into a dataframe for input into jags is as follows:

get_multi_dataset_coefficients <- function(
fits,
contrast.name = NULL,
id.column = 'entrez.id',
ids,
warn = TRUE
){
# ... exception handling omitted ... #
filter.fits <- Filter(
function(fit) {
contrast.name %in% colnames(fit$coefficients)},
fits
)
# ... #

filter.fits <- lapply(filter.fits, function(fit){
if(length(setdiff(ids, fit$genes[[id.column]])) > 0){
stop("Some requested 'ids' are absent from at least one dataset")
}
fit <- fit[fit$genes[[id.column]] %in% ids, ]
fit
}
# ... #
jags.frames <- Map(
function(fit, idx){
data.frame(
dataset = idx,
id.idx = match(fit$genes[[id.column]], ids),
probe.coef = fit$coefficients[, contrast.name]
)
},
filter.fits,
1:length(filter.fits)
)
jags.data <- do.call('rbind', jags.frames)
rownames(jags.data) <- NULL
jags.data
}

Tuesday, 3 November 2015

setting up lyx/knitr/R for windows

New job at Glasgow doing Leukaemia research.

Currently having to use Windows for bioinformatics, which is always a pleasure.

I like to keep a running labbook in lyx, so that results / graphs from R can be automatically imported etc.

Installed R(v3.2.2, x86_64-w64-mingw32; on 3/11/2015) and installed a few packages:

install.packages('ggplot2')

source('https://bioconductor.org/biocLite.R')

biocLite()

biocLite('ggbio')

install.packages('knitr')

Downloaded the Lyx 2.1.4 bundle (that installs MikTex at the same time) from ftp://ftp.lyx.org/pub/lyx/bin/2.1.4/LyX-2.1.4-Bundle-4.exe

In Lyx, I added the path to the directory containing 'Rscript.exe' for the above version of R to 'Preferences>Paths>PATH prefix', reconfigured Lyx.

On reopening Lyx, I was able to add the 'Rnw(knitr)' module to a blank document and compile a lyx document to pdf (the document contained no knitr markup).

But, within the drop-down menu that you use to tell Lyx the paragraph type that you want to use, there is supposed to be an option to add a 'Sweave/Chunk': this was absent in Lyx 2.1.4

I installed Lyx 2.0.8 using the installer: ftp://ftp.lyx.org/pub/lyx/bin/2.0.8/LyX-2.0.8-Installer-1.exe (since MikTex had already been installed)

On setting the Rscript path and adding Rnw(knitr) to a Lyx document, I was now able to add Sweave 'Chunk's to my document.

# hopefully the last in the current run of 'boring software installation blogs'

Wednesday, 19 August 2015

Prepare py3 for probabilistic programming

Python3 environment for running code from Probabilistic Programming
Note that the latter is written using py2.7 but I want to update the code for py3.4

apt-get install apache2-dev
conda create --name phack python=3.4 \
ipython ipython-notebook matplotlib \
numpy pymc pyzmq scipy tornado jinja2

# praw and wsgiref not available via anaconda (the latter because the py3.4 version is called mod_wsgi)
source activate phack
pip install praw
pip install mod_wsgi # may be some differences with py2.7/wsgiref

# ipython check that they're all installed:
In [1]: import matplotlib
In [2]: import numpy
In [3]: import pymc
In [4]: import zmq
In [5]: import scipy
In [6]: import tornado
In [7]: import wsgiref
In [8]: import praw
In [9]: import jinja2
# nb, import 'zmq' and 'wsgiref' rather than 'pyzmq' and 'mod_wsgi'

Friday, 7 August 2015

Installing anaconda-python

Anaconda provides a python distribution that includes a range of python libraries for machine learning / statistics / data processing

I installed anaconda and various python libraries onto a new mint 17 install.
Downloaded anaconda from here and installed using
bash Anaconda-2.3.0-Linux-x86_64.sh

Installed into ~/anaconda and appended ~/anaconda/bin to ~/.bashrc

Updated anaconda:
conda update conda

The updated packages were:
conda: 3.14.1-py27_0 --> 3.15.1-py27_0
conda-env: 2.2.3-py27_0 --> 2.3.0-py27_0
pip: 7.0.3-py27_0 --> 7.1.0-py27_0
setuptools: 17.1.1-py27_0 --> 18.0.1-py27_0

Created a python3 environment

conda create --name py3 python=3.4

source activate py3

Forgot to install pandas etc with python 3.4:

conda remove --name py3 --all # can't do this as py3 is currently active

source deactivate

conda remove --name py3 --all

conda create --name py3 python=3.4 numpy pandas ipython

source activate py3

The last bit wasn't strictly necessary, I can add other libraries from anaconda into py3 using the following for scipy:

conda install --name py3 scipy

Wednesday, 5 August 2015

biolearnr intro

Bioinformatics would be a seriously lame duck without statistics.

This is my blog about statistics, machine learning, data mining and all those sorts of things. My background is in cancer genomics, network analysis and cell signalling. I'm currently working through a series of MOOCs that are relevant to this blog (many more interesting courses and resources are mentioned on the open source data science masters' website and machinelearningmastery):
- Ian Witten's introductory WEKA course
- The Caltech/JPL data analytics course
- Bill Howe's intro to data science

Also. I'm always working through statistics and machine learning books. At the moment I'm working through Bishop's PRML, Gelman's Bayesian Data Analysis and Casella & Berger's Statistical Inference.

All the above, and a variety of bioinformatics papers, will probably end up on here. Unless it ends up like biographr, where I wrote an intro and then nothing for 8 months