Monday, 22 May 2017

BFX clinic: Getting up to date microarray probe annotations

Downloading microarray datasets from NCBI/EBI databases


ArrayExpress and the Gene-Expression Omnibus have made thousands of microarray experiments available. These databases are most easily accessed in R using bioconductor::ArrayExpress and bioconductor::GEOquery, respectively. The ArrayExpress package is only really suited to downloading Affymetrix datasets. Datasets are simply downloaded using:

# From array-express
library(ArrayExpress)
library(oligo)
AEset <- ArrayExpress("E-GEOD-65552")
AEset.norm <-  oligo::rma(AEset) # prepares an ExpressionSet

# From GEO:
library(GEOquery)
GEOset <- getGEO("GSE65552") # returns a list of 'ExpressionSet's
GEOset <- GEOset[[1]]        # pulls out the first ExpressionSet

I've deliberately attempted to download the same dataset. You can specify what directory the datasets should be downloaded to, and various other tweaks - see the manpages for the packages.

The getGEO function downloads a pre-normalised version of the dataset, the ArrayExpress function downloads the raw .CEL files, and we've normalised them using RMA. They're the same dataset but they're actually quite different:

> dim(AEset.norm)
Features  Samples 
   23332        4 
> dim(GEOset)
Features  Samples 
 1236087        4

There's different numbers of features because the arrays have been summarised at different levels (that's a story for another day), but indicates that sometimes it's better to start from the raw data and know how you've processed the arrays if you've downloading stuff from GEO or ArrayExpress (raw data is provided by both databases for most datasets).

Annotation of the probes / probesets

Another slight difference can be seen in the featureData slots of the ExpressionSets - viz, the ArrayExpress-derived datasets don't have annotation data for the features, whereas the GEO datasets do:

> featureData(AEset.norm)
An object of class 'AnnotatedDataFrame': none

> featureData(GEOset)
An object of class 'AnnotatedDataFrame'
  featureNames: 4304920 4304921 ... 5617593 (1236087
    total)
  varLabels: ID GB_LIST ... probeset_type (16 total)

  varMetadata: Column Description labelDescription

You could obtain annotations for the datasets from a range of sources
- Using getGEO("GPLxxxx") for whatever the GPL id is for the array that you're using (this is done automatically if you pull your dataset from GEO, hence GEOset already having some annotations present); however, this may not provide the annotation columns that you want, and the mappings may not be up to date (the AnnotGPL entries are pretty good though)
- Using the bioconductor annotation packages
- Using the biomaRt package (this is a bit long-winded if you have an annotation package available though)

Annotation packages:

The array used in the above study was "A-AFFY-98 - Affymetrix GeneChip Mouse Exon 1.0 ST Array [MoEx-1_0-st-v1]" according to ArrayExpress. This is a pretty well established affymetrix array. If we search in the bioconductor annotation_data page for affymetrix mouse datasets, we find packages "moex10stprobeset.db", and "moex10sttranscriptcluster.db". We install both of these:

source("https://bioconductor.org/biocLite.R")
biocLite("moex10sttranscriptcluster.db")
biocLite("moex10stprobeset.db")

These are biocnoductor annotationDbi packages, as such they respect the AnnotationDbi API - see here. That means, we can obtain the accession numbers in the two databases as follows:

# keys(moex10sttranscriptcluster.db)
# keys(moex10stprobeset.db)

Similarly, we can get the accession numbers for each probeset in a given ExpressionSet from it's rownames. 
Let's check the overlap between the probeset ids in the microarray datasets and these annotation databases:

library(magrittr)

# For the ArrayExpress obtained dataset
> (rownames(AEset.norm)) %in% keys(moex10sttranscriptcluster.db) %>%
+     summary
   Mode   FALSE    TRUE    NA's 
logical    1324   22008       0 
> (rownames(AEset.norm)) %in% keys(moex10stprobeset.db) %>%
+     summary
   Mode   FALSE    TRUE    NA's 
logical   16755    6577       0 


# For the GEOquery-obtained dataset
> (rownames(GEOset)) %in% keys(moex10sttranscriptcluster.db) %>%
+     summary
   Mode   FALSE    TRUE    NA's 
logical 1230789    5298       0 
> (rownames(GEOset)) %in% keys(moex10stprobeset.db) %>%
+     summary
   Mode   FALSE    TRUE    NA's 
logical      75 1236012       0 

For the ArrayExpress dataset, almost all of the probeset ids are present in the transcript-cluster database, whereas for the GEOquery derived dataset, most of the probeset ids are present in the probeset database.

The moex***.db databases contain info for the following fields:
columns(moex10sttranscriptcluster.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
[11] "GO"           "GOALL"        "IPI"          "MGI"          "ONTOLOGY"    
[16] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"         "PROBEID"     

[21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UNIGENE"      "UNIPROT"

So if we want to add gene-level annotations (ENSEMBL, ENTREZID and SYMBOL) to the probesets for the ArrayExpress dataset, we can first obtain them as follows (note that for these array annotationDBI objects, PROBEID is the primary key):

ae.annots <- AnnotationDbi::select(
  x       = moex10sttranscriptcluster.db,
  keys    = rownames(AEset.norm),
  columns = c("PROBEID", "ENSEMBL", "ENTREZID", "SYMBOL"),
  keytype = "PROBEID"
  )

We note that there is at least one row in the annotations for each probe-id in the array dataset, but that some probes have multiple gene-level annotations:
dim(ae.annots)
# [1] 27099     4

all(rownames(AEset.norm) %in% ae.annots$PROBEID)
# TRUE

dup.ids <- ae.annots$PROBEID[duplicated(ae.annots$PROBEID)] %>% 
  unique %>%
  sort

# An example probe that has multiple gene-level mappings
ae.annots[ae.annots$PROBEID == dup.ids[1], ]
#      PROBEID            ENSEMBL ENTREZID SYMBOL
# 6579 6747309 ENSMUSG00000033813    21399  Tcea1
# 6580 6747309 ENSMUSG00000025903    18777 Lypla1

We want to build a featureData object to add to the ArrayExpress-derived ExpressionSet. This must have one row for each and every probeset that is mentioned in the ExpressionSet; and the order of the probes must match between the ExpressionSet and the Annotation dataframe. To do this, we could pass a function to  AnnotationDbi::mapIds (similar to select) that tells it to collapse any multiply mapped values into a single value, but this only works on a single column at a time.

collapser <- function(x){
  x %>% unique %>% sort %>% paste(collapse = "|")
  }
# Example:
collapser(ae.annots[ae.annots$PROBEID == dup.ids[1], "SYMBOL"])

[1] "Lypla1|Tcea1"

# Doing this for each required 'column' is a bit long-winded
#ae.annots <- AnnotationDbi::mapIds(
#  x       = moex10sttranscriptcluster.db,
#  keys    = rownames(AEset.norm),
#  column  = "ENSEMBL",
#  keytype = "PROBEID",
#  multiVals = collapser
#  )

Alternatively, we can collapse multiply-mapping values using dplyr (which you should totally learn to use, see the transformations cheat sheet here).

# Redefinition of ae.annots
ae.annots <- AnnotationDbi::select(
  x       = moex10sttranscriptcluster.db,
  keys    = rownames(AEset.norm),
  columns = c("PROBEID", "ENSEMBL", "ENTREZID", "SYMBOL"),
  keytype = "PROBEID"
  ) %>%
  group_by(PROBEID) %>%
  summarise_each(funs(collapser)) %>%
  ungroup

> ae.annots %>% tail
Source: local data frame [6 x 4]

  PROBEID  ENSEMBL                         ENTREZID          SYMBOL
    (chr)  (chr)                            (chr)            (chr)
1 7023069  ENSMUSG00000056673              20592             Kdm5d
2 7023072  ENSMUSG00000069049              26908             Eif2s3y
3 7023079  ENSMUSG00000069045              26900             Ddx3y
4 7023099  ENSMUSG00000069036              21674             Sry
5 7023132  ENSMUSG00000023452|ENSMUSG00... 236604|320951|... Pisd|Pisd-ps1|...



# Magically, the order of the probe ids in ae.annots matches that in AEset.norm
all(ae.annots$PROBEID == rownames(AEset.norm))

So, all we need to do now, is add the annotation data to the featureData of AEset.norm. Note that the featureData requires an AnnotatedDataFrame as input.

fd <- new("AnnotatedDataFrame",
  data = data.frame(ae.annots[, -1], stringsAsFactors = FALSE)
  )
rownames(fd) <- ae.annots$PROBEID
featureData(AEset.norm) <- fd

Thus, we've imported & annotated an ArrayExpress-derived dataset using bioconductor.
If you need to add further annotations, you just have to add extra columns to the featureData of the relevant ExpressionSet

No comments:

Post a Comment