biolearnr: May 2017

Monday, 22 May 2017

BFX clinic: Getting up to date microarray probe annotations

Downloading microarray datasets from NCBI/EBI databases

ArrayExpress and the Gene-Expression Omnibus have made thousands of microarray experiments available. These databases are most easily accessed in R using bioconductor::ArrayExpress and bioconductor::GEOquery, respectively. The ArrayExpress package is only really suited to downloading Affymetrix datasets. Datasets are simply downloaded using:

# From array-express
library(ArrayExpress)
library(oligo)
AEset <- ArrayExpress("E-GEOD-65552")
AEset.norm <- oligo::rma(AEset) # prepares an ExpressionSet

# From GEO:
library(GEOquery)
GEOset <- getGEO("GSE65552") # returns a list of 'ExpressionSet's
GEOset <- GEOset[[1]] # pulls out the first ExpressionSet

I've deliberately attempted to download the same dataset. You can specify what directory the datasets should be downloaded to, and various other tweaks - see the manpages for the packages.

The getGEO function downloads a pre-normalised version of the dataset, the ArrayExpress function downloads the raw .CEL files, and we've normalised them using RMA. They're the same dataset but they're actually quite different:

> dim(AEset.norm)
Features Samples
23332 4
> dim(GEOset)
Features Samples
1236087 4

There's different numbers of features because the arrays have been summarised at different levels (that's a story for another day), but indicates that sometimes it's better to start from the raw data and know how you've processed the arrays if you've downloading stuff from GEO or ArrayExpress (raw data is provided by both databases for most datasets).

Annotation of the probes / probesets

Another slight difference can be seen in the featureData slots of the ExpressionSets - viz, the ArrayExpress-derived datasets don't have annotation data for the features, whereas the GEO datasets do:

> featureData(AEset.norm)
An object of class 'AnnotatedDataFrame': none

> featureData(GEOset)
An object of class 'AnnotatedDataFrame'
featureNames: 4304920 4304921 ... 5617593 (1236087
total)
varLabels: ID GB_LIST ... probeset_type (16 total)

varMetadata: Column Description labelDescription

You could obtain annotations for the datasets from a range of sources
- Using getGEO("GPLxxxx") for whatever the GPL id is for the array that you're using (this is done automatically if you pull your dataset from GEO, hence GEOset already having some annotations present); however, this may not provide the annotation columns that you want, and the mappings may not be up to date (the AnnotGPL entries are pretty good though)
- Using the bioconductor annotation packages
- Using the biomaRt package (this is a bit long-winded if you have an annotation package available though)

Annotation packages:

The array used in the above study was "A-AFFY-98 - Affymetrix GeneChip Mouse Exon 1.0 ST Array [MoEx-1_0-st-v1]" according to ArrayExpress. This is a pretty well established affymetrix array. If we search in the bioconductor annotation_data page for affymetrix mouse datasets, we find packages "moex10stprobeset.db", and "moex10sttranscriptcluster.db". We install both of these:

source("https://bioconductor.org/biocLite.R")
biocLite("moex10sttranscriptcluster.db")
biocLite("moex10stprobeset.db")

These are biocnoductor annotationDbi packages, as such they respect the AnnotationDbi API - see here. That means, we can obtain the accession numbers in the two databases as follows:

# keys(moex10sttranscriptcluster.db)
# keys(moex10stprobeset.db)

Similarly, we can get the accession numbers for each probeset in a given ExpressionSet from it's rownames.
Let's check the overlap between the probeset ids in the microarray datasets and these annotation databases:

library(magrittr)

# For the ArrayExpress obtained dataset
> (rownames(AEset.norm)) %in% keys(moex10sttranscriptcluster.db) %>%
+ summary
Mode FALSE TRUE NA's
logical 1324 22008 0
> (rownames(AEset.norm)) %in% keys(moex10stprobeset.db) %>%
+ summary
Mode FALSE TRUE NA's
logical 16755 6577 0

# For the GEOquery-obtained dataset
> (rownames(GEOset)) %in% keys(moex10sttranscriptcluster.db) %>%
+ summary
Mode FALSE TRUE NA's
logical 1230789 5298 0
> (rownames(GEOset)) %in% keys(moex10stprobeset.db) %>%
+ summary
Mode FALSE TRUE NA's
logical 75 1236012 0

For the ArrayExpress dataset, almost all of the probeset ids are present in the transcript-cluster database, whereas for the GEOquery derived dataset, most of the probeset ids are present in the probeset database.

The moex***.db databases contain info for the following fields:
columns(moex10sttranscriptcluster.db)
[1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
[6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
[11] "GO" "GOALL" "IPI" "MGI" "ONTOLOGY"
[16] "ONTOLOGYALL" "PATH" "PFAM" "PMID" "PROBEID"

[21] "PROSITE" "REFSEQ" "SYMBOL" "UNIGENE" "UNIPROT"

So if we want to add gene-level annotations (ENSEMBL, ENTREZID and SYMBOL) to the probesets for the ArrayExpress dataset, we can first obtain them as follows (note that for these array annotationDBI objects, PROBEID is the primary key):

ae.annots <- AnnotationDbi::select(
x = moex10sttranscriptcluster.db,
keys = rownames(AEset.norm),
columns = c("PROBEID", "ENSEMBL", "ENTREZID", "SYMBOL"),
keytype = "PROBEID"
)

We note that there is at least one row in the annotations for each probe-id in the array dataset, but that some probes have multiple gene-level annotations:
dim(ae.annots)
# [1] 27099 4

all(rownames(AEset.norm) %in% ae.annots$PROBEID)
# TRUE

dup.ids <- ae.annots$PROBEID[duplicated(ae.annots$PROBEID)] %>%
unique %>%
sort

# An example probe that has multiple gene-level mappings
ae.annots[ae.annots$PROBEID == dup.ids[1], ]
# PROBEID ENSEMBL ENTREZID SYMBOL
# 6579 6747309 ENSMUSG00000033813 21399 Tcea1
# 6580 6747309 ENSMUSG00000025903 18777 Lypla1

We want to build a featureData object to add to the ArrayExpress-derived ExpressionSet. This must have one row for each and every probeset that is mentioned in the ExpressionSet; and the order of the probes must match between the ExpressionSet and the Annotation dataframe. To do this, we could pass a function to AnnotationDbi::mapIds (similar to select) that tells it to collapse any multiply mapped values into a single value, but this only works on a single column at a time.

collapser <- function(x){
x %>% unique %>% sort %>% paste(collapse = "|")
}
# Example:
collapser(ae.annots[ae.annots$PROBEID == dup.ids[1], "SYMBOL"])

[1] "Lypla1|Tcea1"

# Doing this for each required 'column' is a bit long-winded
#ae.annots <- AnnotationDbi::mapIds(
# x = moex10sttranscriptcluster.db,
# keys = rownames(AEset.norm),
# column = "ENSEMBL",
# keytype = "PROBEID",
# multiVals = collapser
# )

Alternatively, we can collapse multiply-mapping values using dplyr (which you should totally learn to use, see the transformations cheat sheet here).

# Redefinition of ae.annots
ae.annots <- AnnotationDbi::select(
x = moex10sttranscriptcluster.db,
keys = rownames(AEset.norm),
columns = c("PROBEID", "ENSEMBL", "ENTREZID", "SYMBOL"),
keytype = "PROBEID"
) %>%
group_by(PROBEID) %>%
summarise_each(funs(collapser)) %>%
ungroup

> ae.annots %>% tail
Source: local data frame [6 x 4]

PROBEID ENSEMBL ENTREZID SYMBOL
(chr) (chr) (chr) (chr)
1 7023069 ENSMUSG00000056673 20592 Kdm5d
2 7023072 ENSMUSG00000069049 26908 Eif2s3y
3 7023079 ENSMUSG00000069045 26900 Ddx3y
4 7023099 ENSMUSG00000069036 21674 Sry
5 7023132 ENSMUSG00000023452|ENSMUSG00... 236604|320951|... Pisd|Pisd-ps1|...

# Magically, the order of the probe ids in ae.annots matches that in AEset.norm
all(ae.annots$PROBEID == rownames(AEset.norm))

So, all we need to do now, is add the annotation data to the featureData of AEset.norm. Note that the featureData requires an AnnotatedDataFrame as input.

fd <- new("AnnotatedDataFrame",
data = data.frame(ae.annots[, -1], stringsAsFactors = FALSE)
)
rownames(fd) <- ae.annots$PROBEID
featureData(AEset.norm) <- fd

Thus, we've imported & annotated an ArrayExpress-derived dataset using bioconductor.
If you need to add further annotations, you just have to add extra columns to the featureData of the relevant ExpressionSet

Tuesday, 16 May 2017

Project organisation

How do you organise your projects? I'm sure no-one but me cares how I organise mine. Here, I'm trying to put down my rules and my reasons for them, so that when I start breaking my rules I can send myself to the corner of naughty bioinformaticians. Have a look at this if you want the same kind of info, but from someone with a lot more clout.

There are a few motivating factors behind my choice of approach:

1) It should be quick to set up a new project;

2) Each project should have a consistent structure;

3) A build tool should be used to track dependencies amongst code / data / results, so that updates to a given file trickle down to any dependent files (without forcing nondependent files to be remade)

[hence a Snakefile for use in snakemake is used at the head of each project];

4) I should be able to git-clone one of my projects from my personal bitbucket ...

[all configs, docs, scripts should be present and under version control];

5) ... and be able to setup and run it on most linux boxes (given the availability of the input data and modulo some configurations, eg, to link to that data) ...

[filepaths etc. should be hardware-independent within the project, non-standard shells and the like should be avoided];

6) ... and generate all results / reports for that project automatically ...

[related: any results quoted in reports should be dynamically updateable];

7) ... and it should only take a few lines of command line code to go from step [4] to step [6];

8) Each project should run in a self-specified, version-explicit, environment

[so we use conda and related projects anaconda & bioconda to manage all external libraries / packages / programs that are used];

9) If multiple projects need to use a particular script / library that I've written, each project should keep an internal copy of that script

[so that project-generic code can be updated without breaking a working project];

10) Big projects should be allowed to contain loosely-coupled subprojects to keep everything neat

[but the subprojects should respect the principles above];

[TODO: add in what you forgot: projects shouldn't tread on each other's toes; the value of packaging; whither exploratory analysis; the volatility of external data; how to limit code duplication; issues with data used by multiple projects?]

A typical project template looks like this:

<JOBNAME>/
- conf/
- check_these_dirs.txt
- copy_these_files.txt
- job_specific_vars.sh
- make_these_links.txt
- make_these_subdirs.txt
- touch_these_files.txt
- <JOB_SPECIFIC_CONFIG_FILES>
- data/
- ext/
- <READ_ONLY:EXTERNAL_DATA>
- int/
- <READ_ONLY:INTERNAL_DATA>
- job/
- <READ/WRITE_TO_THIS_DIR>
- doc/
- figure/
- notebook.[Rmd|lyx]
- <various_throwaway_ideas>.ipynb
- lib/
- <PKGNAME>/ # R package DESCRIPTION/R/tests etc
- <PKGNAME>.tar.gz
- Makefile
- conf/
- include_into_rpackage.txt
- global_rfuncs/
- R/
- <COPIES_OF_MULTIPROJECT_RFILES>
- tests/testthat/
- <... AND TESTS>
- local_rfuncs/
- R/
- <PROJECT_SPECIFIC_RFILES>
- tests/testthat/
- <... AND TESTS>
- setup.DESCRIPTION.R
- requirements.txt # OR environment.yaml
- results/
- scripts/
- check_env.sh
- setup_dirs.sh
- setup_libs.sh
- setup.sh
- <VARIOUS_SCRIPTS>
- Snakefile
- <OPTIONAL:subprojects/>
- temp/
- TODO.txt

Initialisation:
The default contents of ./conf/ and ./scripts are added by a script new_project.sh.

The script setup.sh defines all the rest of the project-specific directory structure and builds a project-specific R package based on the contents of the files in ./conf/ and the scripts setup_dirs.sh and setup_libs.sh. I modify the config scripts/files to specify the names of the job, the conda-environment in which the job is to run, and the optional job-specific R package and I specify other things, like whether an R kernel is required, which files should be copied/linked/touched before running the project.

Environment set up:
While adding code to a project you'll need to import external libraries / programs. For the sake of reproducibility you should keep track of the versions of the resources you are using. The easiest, and most portable, way to do this is by using conda to install any external dependencies of your code. I make a new environment for each main project that I'm working on
(`conda create --name some_project`)
and use
(`conda install -c <repos> <package>`)
to add extra stuff to it. Having done this, always output a requirements.txt file (or environment.yaml, see here) containing the currently-installed conda stuff for the project
(`conda list --explicit --name some_project > requirements.txt`)
and keep this under version control. Then if you need to run the same project on a different computer, you should be able to create an identical environment using
(`conda create --name some_project --file-spec requirements.txt`).

Code:
Code for the project gets put into either scripts/ or lib/.

./scripts mostly contains python / shell / R scripts and snakemake-recipe-scripts that are either imported into, or called as subworkflows, by the main project Snakefile.

./lib contains any R function files, class definitions or pipelines that are used in the project. Some of these R files are used by several projects - these are copied (not linked) from a separate repository into ./lib/global_rfuncs: I have to ensure that in the future, any given project will still run - if the global R files for a project were linked to the repository versions, any update to the repository versions may break the future compilation of this project. R functions (etc) that are written specifically for this project are put into ./lib/local_rfuncs. I use a makefile to build and install the project-specific R package. Having an R package for each project means the R-scripts I put in ./scripts are more lightweight than if they contained loads of function definitions.

I don't need an equivalent package / library for python functions because I write so few python scripts and the script vs function isolation provided by `if __name__ == '__main__'` means I can easily use functions from one python script inside another.

Unlike in the Noble paper above, I don't put separate src/ and bin/ directories in projects because I don't compile jack.

Data:
There are two main ways to pass project-specific data into a project: via filepaths in ./conf or in ./data. Minor routes exist: through hard-coding information into project-specific scripts, notebooks or the R package for the project, or (for non-project-specific data) by importing bioconductor packages.

If I want to be able to modify a datafile `by hand` without hard-coding the info in a script/library, the file has to go in ./conf - not in ./data. For example, I recently did a meta-analysis of a lot of GEO/ArrayExpress datasets - having checked the relevant papers/abstracts, not all of the datasets identified by my automated searches were of value to the meta-analysis, I encoded my inclusion/exclusion criteria for these datasets in a yaml file in ./conf. (because the info may have been used by multiple docs/scripts, it had to be somewhere other than ./scripts or ./lib).

All datasets generated by the current project (and most data used by the current project) is accessed via ./data, although ./data may contain links to files/dirs elsewhere on the file system. This means I can use relative subdir filepaths throughout the project, and indeed, I never use full paths within the scripts/docs/libs for the project (except during setup.sh if a full-path is mentioned in one of the config files). That is, any script/makefile/notebook used in a project uses filepaths relative to the main working directory for the project (ie, the dir where Snakefile is sitting) and none of them use filepaths that ../../../..... off into the wilderness (nonetheless, I do allow subproject Snakefiles to have dependencies on other parallel subprojects). The only way to use files that are external to the current working directory and it's subdirs, is to either copy them into one of the project directories or link to them from ./data/ext (for datasets downloaded from elsewhere) or ./data/int (for datasets generated by our lab, or for results generated by another one of my projects). Any datasets generated during running a project are put into ./data/job - so a project shouldn't write to ./data/int, ./data/ext or ./conf.

Docs:
I used to write project notebooks in lyx (with knitr and the like) but I've started writing it all in Rmarkdown and might move over to bookdown eventually (cos it looks like I can stitch several subproject notebooks together more easily in bookdown, but I (or someone) need to put bookdown on anaconda first). Reasons for switching: Rmarkdown and .Rmd files seem better at these things (some may be due to my own ignorance however):

Including dynamic results within paragraphs;
Automated building of pdfs (and other formats);
Working directory bewilderments
Readability in version control
Reproducibility and portability (lyx is a bit bloated for conda)

Running the whole damn thing:
- Clone the project
- Move into the project's main directory
- # Possibly modify config files in ./conf
- Create conda env
- Activate conda env
- Run ./scripts/setup.sh to setup all dir structure and data links and to build/install the R package into the conda envcompilation
- Run snakemake

Now I change my mind and rewrite the whole thing...

Friday, 5 May 2017

Design of experiments course

While working toward some stats exams, I recently found the Penn State Design of Experiments course (STAT503) for which all the course notes (if not any exercises) are available to all.