Friday, 4 December 2015

Unit-testing using bioconductor MArrayLM as input

I have a range of analysed microarray datasets (stored as limma:MArrayLM objects in an R list, fits) and want to make summaries for specific genes across all of the datasets. The hierarchical modelling for the cross-dataset analysis will be done with jags.
I'm currently setting up the dataset to pass to jags; the resulting data.frame will look like:

dataset : id.idx : probe.coef
1       : 1      : 0.234  
1       : 1      : 0.345      # multiple probes for each gene 
1       : 1      : 0.789      #   within a dataset  
1       : 2      : -0.1
2       : 1      : 0.33       # A given gene can have different
2       : 1      : 0.79       #   numbers of probes between datasets
2       : 2      : 0.3
2       : 2      : -0.5 
....

Datasets are numbered from 1 to D, Genes are numbered from 1 to G; only genes that are present in every dataset are considered.

The MArrayLM objects, fits[[1]] to fits[[D]], each have a $coefficients table (from where the probeCoef is pulled out) and an entry in $genes giving the entrez.gene id for each row (these are mapped into 1..G so jags can use them as an index).

A function to pull out the probe coefficients and map the entrez-ids to gene indices should be pretty straightforward to write but I can't quite work out how to set up and test it (but use RUnit for testing). To do this I'd need to be able to create MArrayLM objects containing specific values (preferably without creating an eset and passing this through lmFit, for the sake of speed)

--- 

A relatively simple way to construct MArrayLM objects (that contain all the relevant structures for conversion into the format above) was found using
lm <- new(Class = 'MArrayLM')
lm$coefficients <- some.matrix
lm$genes <- some.dframe

---

Aside from the various exception handling steps, my method to convert the MArrayLM objects into a dataframe for input into jags is as follows:

get_multi_dataset_coefficients <- function(
        fits,
        contrast.name = NULL,
        id.column = 'entrez.id',
        ids,
        warn = TRUE
        ){
    # ... exception handling omitted ... #
    filter.fits <- Filter(
        function(fit) {
            contrast.name %in% colnames(fit$coefficients)},
        fits
        )
    # ... #

    filter.fits <- lapply(filter.fits, function(fit){
        if(length(setdiff(ids, fit$genes[[id.column]])) > 0){
            stop("Some requested 'ids' are absent from at least one dataset")
            }
        fit <- fit[fit$genes[[id.column]] %in% ids, ]
        fit
        }
    # ... #
    jags.frames <- Map(
        function(fit, idx){
            data.frame(
                dataset    = idx,
                id.idx     = match(fit$genes[[id.column]], ids),
                probe.coef = fit$coefficients[, contrast.name]
                )
            },
        filter.fits,
        1:length(filter.fits)
        )
    jags.data <- do.call('rbind', jags.frames)
    rownames(jags.data) <- NULL
    jags.data
    }