Downloading microarray datasets from NCBI/EBI databases
ArrayExpress and the Gene-Expression Omnibus have made thousands of microarray experiments available. These databases are most easily accessed in R using bioconductor::ArrayExpress and bioconductor::GEOquery, respectively. The ArrayExpress package is only really suited to downloading Affymetrix datasets. Datasets are simply downloaded using:
# From array-express
library(ArrayExpress)
library(oligo)
AEset <- ArrayExpress("E-GEOD-65552")
AEset.norm <- oligo::rma(AEset) # prepares an ExpressionSet
# From GEO:
library(GEOquery)
GEOset <- getGEO("GSE65552") # returns a list of 'ExpressionSet's
GEOset <- GEOset[[1]] # pulls out the first ExpressionSet
I've deliberately attempted to download the same dataset. You can specify what directory the datasets should be downloaded to, and various other tweaks - see the manpages for the packages.
The getGEO function downloads a pre-normalised version of the dataset, the ArrayExpress function downloads the raw .CEL files, and we've normalised them using RMA. They're the same dataset but they're actually quite different:
> dim(AEset.norm)
Features Samples
23332 4
> dim(GEOset)
Features Samples
1236087 4
There's different numbers of features because the arrays have been summarised at different levels (that's a story for another day), but indicates that sometimes it's better to start from the raw data and know how you've processed the arrays if you've downloading stuff from GEO or ArrayExpress (raw data is provided by both databases for most datasets).
Annotation of the probes / probesets
Another slight difference can be seen in the featureData slots of the ExpressionSets - viz, the ArrayExpress-derived datasets don't have annotation data for the features, whereas the GEO datasets do:> featureData(AEset.norm)
An object of class 'AnnotatedDataFrame': none
> featureData(GEOset)
An object of class 'AnnotatedDataFrame'
featureNames: 4304920 4304921 ... 5617593 (1236087
total)
varLabels: ID GB_LIST ... probeset_type (16 total)
varMetadata: Column Description labelDescription
You could obtain annotations for the datasets from a range of sources
- Using getGEO("GPLxxxx") for whatever the GPL id is for the array that you're using (this is done automatically if you pull your dataset from GEO, hence GEOset already having some annotations present); however, this may not provide the annotation columns that you want, and the mappings may not be up to date (the AnnotGPL entries are pretty good though)
- Using the bioconductor annotation packages
- Using the biomaRt package (this is a bit long-winded if you have an annotation package available though)
Annotation packages:
The array used in the above study was "A-AFFY-98 - Affymetrix GeneChip Mouse Exon 1.0 ST Array [MoEx-1_0-st-v1]" according to ArrayExpress. This is a pretty well established affymetrix array. If we search in the bioconductor annotation_data page for affymetrix mouse datasets, we find packages "moex10stprobeset.db", and "moex10sttranscriptcluster.db". We install both of these:source("https://bioconductor.org/biocLite.R")
biocLite("moex10sttranscriptcluster.db")
biocLite("moex10stprobeset.db")
These are biocnoductor annotationDbi packages, as such they respect the AnnotationDbi API - see here. That means, we can obtain the accession numbers in the two databases as follows:
# keys(moex10sttranscriptcluster.db)
# keys(moex10stprobeset.db)
Similarly, we can get the accession numbers for each probeset in a given ExpressionSet from it's rownames.
Let's check the overlap between the probeset ids in the microarray datasets and these annotation databases:
library(magrittr)
# For the ArrayExpress obtained dataset
> (rownames(AEset.norm)) %in% keys(moex10sttranscriptcluster.db) %>%
+ summary
Mode FALSE TRUE NA's
logical 1324 22008 0
> (rownames(AEset.norm)) %in% keys(moex10stprobeset.db) %>%
+ summary
Mode FALSE TRUE NA's
logical 16755 6577 0
# For the GEOquery-obtained dataset
> (rownames(GEOset)) %in% keys(moex10sttranscriptcluster.db) %>%
+ summary
Mode FALSE TRUE NA's
logical 1230789 5298 0
> (rownames(GEOset)) %in% keys(moex10stprobeset.db) %>%
+ summary
Mode FALSE TRUE NA's
logical 75 1236012 0
For the ArrayExpress dataset, almost all of the probeset ids are present in the transcript-cluster database, whereas for the GEOquery derived dataset, most of the probeset ids are present in the probeset database.
The moex***.db databases contain info for the following fields:
columns(moex10sttranscriptcluster.db)
[1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
[6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
[11] "GO" "GOALL" "IPI" "MGI" "ONTOLOGY"
[16] "ONTOLOGYALL" "PATH" "PFAM" "PMID" "PROBEID"
[21] "PROSITE" "REFSEQ" "SYMBOL" "UNIGENE" "UNIPROT"
So if we want to add gene-level annotations (ENSEMBL, ENTREZID and SYMBOL) to the probesets for the ArrayExpress dataset, we can first obtain them as follows (note that for these array annotationDBI objects, PROBEID is the primary key):
ae.annots <- AnnotationDbi::select(
x = moex10sttranscriptcluster.db,
keys = rownames(AEset.norm),
columns = c("PROBEID", "ENSEMBL", "ENTREZID", "SYMBOL"),
keytype = "PROBEID"
)
We note that there is at least one row in the annotations for each probe-id in the array dataset, but that some probes have multiple gene-level annotations:
dim(ae.annots)
# [1] 27099 4
all(rownames(AEset.norm) %in% ae.annots$PROBEID)
# TRUE
dup.ids <- ae.annots$PROBEID[duplicated(ae.annots$PROBEID)] %>%
unique %>%
sort
# An example probe that has multiple gene-level mappings
ae.annots[ae.annots$PROBEID == dup.ids[1], ]
# PROBEID ENSEMBL ENTREZID SYMBOL
# 6579 6747309 ENSMUSG00000033813 21399 Tcea1
# 6580 6747309 ENSMUSG00000025903 18777 Lypla1
We want to build a featureData object to add to the ArrayExpress-derived ExpressionSet. This must have one row for each and every probeset that is mentioned in the ExpressionSet; and the order of the probes must match between the ExpressionSet and the Annotation dataframe. To do this, we could pass a function to AnnotationDbi::mapIds (similar to select) that tells it to collapse any multiply mapped values into a single value, but this only works on a single column at a time.
collapser <- function(x){
x %>% unique %>% sort %>% paste(collapse = "|")
}
# Example:
collapser(ae.annots[ae.annots$PROBEID == dup.ids[1], "SYMBOL"])
[1] "Lypla1|Tcea1"
# Doing this for each required 'column' is a bit long-winded
#ae.annots <- AnnotationDbi::mapIds(
# x = moex10sttranscriptcluster.db,
# keys = rownames(AEset.norm),
# column = "ENSEMBL",
# keytype = "PROBEID",
# multiVals = collapser
# )
Alternatively, we can collapse multiply-mapping values using dplyr (which you should totally learn to use, see the transformations cheat sheet here).
# Redefinition of ae.annots
ae.annots <- AnnotationDbi::select(
x = moex10sttranscriptcluster.db,
keys = rownames(AEset.norm),
columns = c("PROBEID", "ENSEMBL", "ENTREZID", "SYMBOL"),
keytype = "PROBEID"
) %>%
group_by(PROBEID) %>%
summarise_each(funs(collapser)) %>%
ungroup
> ae.annots %>% tail
Source: local data frame [6 x 4]
PROBEID ENSEMBL ENTREZID SYMBOL
(chr) (chr) (chr) (chr)
1 7023069 ENSMUSG00000056673 20592 Kdm5d
2 7023072 ENSMUSG00000069049 26908 Eif2s3y
3 7023079 ENSMUSG00000069045 26900 Ddx3y
4 7023099 ENSMUSG00000069036 21674 Sry
5 7023132 ENSMUSG00000023452|ENSMUSG00... 236604|320951|... Pisd|Pisd-ps1|...
# Magically, the order of the probe ids in ae.annots matches that in AEset.norm
all(ae.annots$PROBEID == rownames(AEset.norm))
So, all we need to do now, is add the annotation data to the featureData of AEset.norm. Note that the featureData requires an AnnotatedDataFrame as input.
fd <- new("AnnotatedDataFrame",
data = data.frame(ae.annots[, -1], stringsAsFactors = FALSE)
)
rownames(fd) <- ae.annots$PROBEID
featureData(AEset.norm) <- fd
Thus, we've imported & annotated an ArrayExpress-derived dataset using bioconductor.
If you need to add further annotations, you just have to add extra columns to the featureData of the relevant ExpressionSet