Friday 15 December 2017

upsetR plots

Venn diagrams blow. 
Multiset Venn diagrams both blow and suck:

  • Don't make them;
  •  and don't make me interpret them;
  •  and don't try and put them in your presentations because you'll get lost



UpSetR provides a way to do the multi-set comparison thing without looking horrific.

We sample a few sets from the letters `b-z`:

library("UpSetR")

set.seed(1)

bucket_names <- paste0("set", 1:6)

buckets <- Map(
  function(x){
    bucket_size <- sample(1:25, 1)
    bucket <- sample(letters[-1], bucket_size, replace = FALSE)
    },
  bucket_names
  )

lapply(buckets, sort)
$set1
[1] "f" "k" "n" "o" "t" "v" "x"

$set2
 [1] "c" "d" "f" "h" "i" "j" "k" "m" "n" "o" "q" "r" "s" "w" "y" "z"

$set3
 [1] "b" "e" "i" "k" "l" "m" "p" "v" "x" "y"

$set4
 [1] "b" "c" "d" "f" "g" "i" "k" "l" "n" "o" "p" "q" "s" "t" "u" "v" "w" "x" "y"
[20] "z"

$set5
 [1] "c" "f" "h" "j" "k" "n" "q" "r" "s" "t" "v" "w" "y"

$set6
 [1] "b" "c" "d" "e" "f" "g" "i" "j" "k" "l" "m" "n" "p" "q" "r" "s" "t" "u" "w"
[20] "x" "z"

The function `upset` takes a dataframe as input. But I've just defined my sets as a list of vectors. To convert these into a dataframe, use `UpSetR::fromList`:

fromList(buckets)
   set1 set2 set3 set4 set5 set6
1     1    1    1    1    1    1
2     1    1    0    1    0    0
3     1    0    1    1    1    0
4     1    1    0    1    1    1
5     1    0    0    1    1    1
6     1    0    1    1    0    1
7     1    1    0    1    1    1
8     0    1    0    1    1    1
9     0    1    1    1    1    0
10    0    1    0    1    1    1
11    0    1    0    0    1    1
12    0    1    0    1    1    1
13    0    1    0    0    1    1
14    0    1    0    0    1    0
15    0    1    1    0    0    1
16    0    1    0    1    0    1
17    0    1    1    1    0    1
18    0    1    0    1    0    1
19    0    1    0    1    1    1
20    0    0    1    1    0    1
21    0    0    1    1    0    1
22    0    0    1    0    0    1
23    0    0    1    1    0    1
24    0    0    0    1    0    1
25    0    0    0    1    0    1

The letter 'a' is absent from every one of these sets. you could add a `universe` entry or a `unobserved` entry to the set-list passed to `fromList` if that's what you're into.

Otherwise, to generate an upset plot do the following:

upset(
  fromList(buckets)

  )

This gives the following plot:


We note a couple of things: 

  • `set1` is absent from the image, although it contained 7 elements;
  • the elements of a given set are partitioned between the bars: eg, `set3` contains 9 elements, that are split over the 4-5th, 8-9th and 12-14th columns (the numbers above these columns add up to 9) so each element in set3 is put into a single column
    • that means, although the intersection between set3 and set6 is of size 8, there is only one element in the "set3 and set6" intersection column (the remaining 6 are partitioned across the other columns that contain both set3 and set6)

To ensure that all the sets are displayed in the figure, modify the `nsets` argument:



You can order the bars by size (`order.by = "freq"`) or by the number of sets that contributed to the intersection (`order.by = "degree"`) or by both:

# order by freq
upset(
  fromList(buckets), nsets = length(buckets), order.by = "freq"

)




# order by degree and then by freq (not shown)
upset(
  fromList(buckets), order.by = c("freq", "degree"),
  nsets = length(buckets)
  )
# order by freq and then by degree (not shown)
upset(
  fromList(buckets), order.by = c("degree", "freq"),
  nsets = length(buckets)

  )

I found the latter a bit weird, in that order.by = c("degree", "freq") sorts by increasing frequency, whereas order.by="freq" sorts by decreasing frequency.

That'll do

Wednesday 22 November 2017

snakemake: using input/output values in params

It says in the changelog for snakemake that from snakemake v3.10.2 you can use entries in the output directive when computing params. Similarly you can use input values. How and why would you do this?

I'm currently using snakemake 4.0.0.

The readthedocs gives this example for the syntax:

rule:
    input:
        ...
    params:
        prefix=lambda wildcards, output: output[0][:-4]
    output:
        "somedir/{sample}.csv"
    shell:
        "somecommand -o {params.prefix}"

Let's try and do that in a less brusque manner.

I'm modifying a snakemake workflow for RNA-Seq alignment and analysis at the moment. When I first used this workflow, each of my sequencing samples had been sequenced on each of six lanes of a NGS machine. I align the data for each sample/lane separately and then merge the data together using picard. This allows me to indicate the lane-of-origin for any given readpair within the aligned dataset.

To merge the data I use this snakemake rule:

rule merge_bams:
    message: """
             --- Merges different lanes into a single bam
             """
    input: get_rg_bams 
    output: bam = "{prefix}.merged.bam"
    params: input_formatted = format_bampaths_for_merging
    shell: """
        picard MergeSamFiles \
            {params.input_formatted} \
            O={output} \
            SORT_ORDER=coordinate \
            ASSUME_SORTED=true
           """

Here, the function `get_rg_bams` determines the names of all lane-specific files based on the value of `wildcards["prefix"]`. The function `format_bampaths_for_merging` converts the lane-specific files (it doesn't assume a fixed number of input files) into a single string of the form "I=input_1.bam I=input2.bam ...".

We've resequenced a subset of these samples, but this time we sequenced each sample on a single lane.

For the new dataset, there's no real reason to call MergeSamFiles since only a single input-file is available for any given sequencing sample (since there's seq-data from only a single lane) - we might as well just rename the input file. To do this, I could check if a single file is passed as the input, and rename it, and otherwise run MergeSameFiles.

I don't like to write too much shell code in these rules. So I'd rather determine how many files are input and pass that in as a parameter.

Toy code. I wrote a toy snakefile, and added three text files (file1, file2, file3) in its directory. It makes two files "a.txt" and "b.txt" by i) concatentating file1 and file2 into a.txt; ii) (trivially) concatenating file3 into b.txt.

# test1.smk
config_dict = {"a": ["file1", "file2"], "b": ["file3"]}

def get_input(wildcards):
    return config_dict[wildcards["prefix"]]
                                                                    rule all:
    input: ["a.txt", "b.txt"] 
            
rule catter:                                   
    message: """ --- concatenate all input --- """
    input:  get_input
    output: "{prefix}.txt"
    shell:  """cat {input} > {output}"""                                                               

To reflect the issue in my workflow, I only want to `cat` when more than one file is provided as input to `catter` (ie, when making a.txt) - when a single input file is provided I want to copy that file into the output.

# test2.smk
config_dict = {"a": ["file1", "file2"], "b": ["file3"]}

def get_input(wildcards):
    return config_dict[wildcards["prefix"]]
                                                                  rule all:
    input: ["a.txt", "b.txt"] 
            
rule catter:
    message: """ --- concatenate all input --- """
    input:  get_input
    output: "{prefix}.txt"
    shell:  """
        input_ary=({input})
        if [[ ${{#input_ary[@]}} -eq 1 ]];
        then
            cp {input} {output}
        else
            cat {input} > {output}
        fi
        """

That works fine. The only issue (for me) being the definition of input_ary and the determination of the length of that array within the shell command. I'd rather be able to precompute the number of files that are passed in using python. The syntax for this should be of the form 
  params: param = lambda wildcards, output, input: do_something

Just as stated in the docs (for the use of output in defining params), you can do this as follows

# test3.smk
# ... see the above for the set up ...
rule catter:
    message: """ --- concatenate all input --- """
    input:  get_input
    output: "{prefix}.txt"
    params: n_input = lambda wildcards, input: len(input)
    shell:  """
        if [[ {params.n_input} -eq 1 ]];
        then
            cp {input} {output}
        else
            cat {input} > {output}
        fi

        """

To my eyes, this version looks a lot better than the one where input_ary was defined by bash, since more code is ran within python/snakemake (and the bash equivalents for list/array manipulations are a bit opaque).

In the definition of n_input, I could have used any of the following (since params can be built on any of input / output / threads / resources and if any of these are used in the lambda-arguments the ordering of them doesn't matter, so long as they're mentioned after `wildcards`):

params: n_input = lambda wildcards, input: len(input)
params: n_input = lambda wildcards, input, output: len(input)
params: n_input = lambda wildcards, output, input: len(input)
params: n_input = lambda wildcards, threads, resources, output, input: len(input)

# and if you've something more involved you could call it like this:
def complex_function(input): return do_something(input)
... 
params: some_param = lambda wildcards, input: complex_function(input)

But, I couldn't use the following:
params: n_input = lambda input: len(input)

... since (as stated here) the first input to functions called within snakemake rules should always be 'wildcards'.


Thursday 29 June 2017

Helpful bash posts

2018-03-17

To show a precis (--oneline) of the recent commits, the branches that they affect (--decorate=auto) and an illustration of the current branches (--graph):

git log --decorate=auto --oneline --graph

Note that --decorate=auto is the default option from git v2.13. For earlier versions (as on my current WSL) you can add --decorate=auto option to your git config:

git config log.decorate auto

2018-03-02
Dont extract the whole of that archive!

If you just want to list the contents:

tar ---list -jf some_archive.tar.bz2
tar -jtf some_archive.tar.bz2

If you just want to extract a specific file from the archive (stores my_filename relative to the current directory making any subdirs that are required):
tar --xjf my_archive.tar.bz2 my_filename

If you want to extract the contents of a specific file, eg, to pipe the data into another command:
tar --to-stdout -xjf my_archive.tar.bz2 my_filename

Thanks to you at nixCraft for the listing code

2017-07-06
The command ":" is an odd one. Why have a NULL command? grep returns with a non-zero error code if it finds no matching values in the input. So in a pipeline like the following:

VAR1=$( do_something file.1 | command2 | grep "some_value" - )

The pipeline would return a non-zero error value if some_value wasn't found in the results of command2. If you're running with option -pipefail or set -e enabled then that final grep will kill your script, even if you're happy to pass an empty list to VAR1 (eg, you want to subsequently do further tests on VAR1). To prevent grep from killing your script you can use

 VAR1=$( do_something file.1 | command2 | grep "some_value" - ) || true
or
VAR1=$( do_something file.1 | command2 | grep "some_value" - ) || :

and the pipeline doesn't return with a non-zero error code. I think it's more common to use true than to use ':' here though.

2017-07-04
Needed to make a backup of an external drive, so that I could repartition the drive. To check that everything's copied over correctly, the md5sums of all the copied files can be checked.This is done recursively, so the md5sums are computed for each subdirectory as well. From here:

find . -type f -exec md5sum {} \; > .checks.md5
grep -v "\.\/\.checks\.md5" .checks.md5 > checks.md5
rm .checks.md5

Then to check all the file-copying has worked properly:

md5sum -c checks.md5

2017-07-03
Anyone else working with some big big files? To print out the amount of storage used by a given directory, use du:

du -h <directory>

This shows how much storage each subdir is using  (-h: in human readable form). To sum this all up and thus compute total disk-usage in and below <directory>:

du -sh <directory>

2017-07-02
sed and awk are fine, but perl provides me a bit more control over regexes. I use this if I need to pass an absolute path to a program that doesn't understand "~" or I want to compare the full paths of two files.

expand_tilde()
{
  echo "$1" | perl -lne 'BEGIN{$H=$ENV{"HOME"}}; s/~/$H/; print'
}

expand_tilde ~/temp/abc
# /home/me/temp/abc

2017-07-01
Find out what the different config files (/etc/profile, ~/.bashrc, ~/.bash_profile) do and when they are called. See here.

2017-06-30
I always used to use backticks for command substitution and only recently learned about using parentheses for this:

$(command) == `command`

Importantly, you can readily nest commands with the former:

$(command $(other_command))

See here for further info (especially about the interplay between quoting and substitution).

2017-06-29
Useful bash tips at shell-fu, for example, no need to open vim to remove vacuous whitespace:

vim -c "1,$ s/ \+$//g | wq" <filename>

2017-06-28
A variety of helpful bash things that I found on the web.

cp file{1,2}.txt == cp file1.txt file2.txt

See here

Wednesday 21 June 2017

Working directory in R markdown

Discussed elsewhere, I organise my bioinformatics projects like this:

./jobs/
    - <jobname>/
        - conf/
        - data/
        - doc/
            - notebook.Rmd
            - throwaway_script.ipynb
        - lib/
        - scripts/
        - Snakefile

Where the top level snakemake script controls the running of all scripts and the compiling of all documents. My labbooks are stored as R-markdown documents and get compiled to pdfs by the packages rmarkdown and knitr. Despite RStudio's appeal and my spending nigh on all of my time writing R packages, scripts and notebooks, I'm still working in vim.

When working on the project <jobname>, my working directory is ./jobs/<jobname> and, in the simple case when a given project has no subprojects, this working directory shouldn't be changed by any of the scripts or documents. knitr's a bit of a bugger though.

To compile a single *.Rmd notebook, I have the following snakemake recipe. It converts doc/some_file.Rmd into doc/some_file.pdf or doc/some_file.docx depending on the required output filename.

rule compile_markdown:
    input: script = "doc/{doc_name}.Rmd"
    output: "doc/{doc_name}.{ext,pdf|docx}"
    params: wd = working_dir
    run:
        R("""
            library(knitr)
            library(rmarkdown)
            library(tools)
            opts_knit$set(root.dir = "{params.wd}")
            doctype <- list(pdf  = "pdf_document",
                            docx = "word_document"
                            )[["{wildcards.ext}"]]
            rmd.script <- "{input.script}"
            render(rmd.script,
                   output_format = doctype,
                   output_file   = "{output}",
                   output_dir    = "doc",
                   quiet = TRUE)
        """)


Why is that opts_knit$set(root.dir = ...) line in there ...... ?

Assume I'm sitting in the working directory "~/HERE" on the command line.

Let's write a simple R markdown script (doc/temp.Rmd; just the purple bit) that just prints out the current working directory:

---
title: "temp.Rmd"
output:
    - pdf_document
---
```{r}
print(getwd())
```

... and then render that into a pdf:
~/HERE> Rscript -e "library(rmarkdown); render('doc/temp.Rmd')"

This prints out the working directory as "~/HERE/doc" (where the script is stored) rather than the directory "~/HERE", where I called Rscript from.

Note that if I put a similar .R script in ./doc, that prints out the current working directory, this doesn't happen:

# temp.R
print(getwd())
# end of temp.R
~/HERE> Rscript ./doc/temp.R
# [1] "/home/russh/HERE"

This indicates that the R-working directory is HERE (the same as the calling directory; as does Rscript -e "source('doc/temp.R')").

There's a few reasons that I don't like the working directory being changed within a script. When you want to read some data from a file into a script that you're running, you can either i) write the filepath for your data within the script, or ii) you can provide the script with the filepath as an argument.

Let's suppose you had two different scripts that access the same file, but where one of those scripts changes the working directory. Then:

In case i:  you'd have to write the filepath differently for the two scripts. Here, if I wanted to access ./data/my_counts.tsv from within ./doc/temp.R and ./doc/temp.Rmd, I'd have to write the filepath as "./data/my_counts.tsv" within the former, and "../data/my_counts.tsv" within the latter.

In case ii: you'd have to similarly mangle the filepaths. A script that changes the working directory should be provided filepaths relative to the working directory chosen by that script, so you have to think as if you're in that directory; or use absolute paths (NO!!!!!!).

I know it seems trivial, and described as above it only seems like a mild inefficiency to have to write different scripts in slightly different ways. And I know it's all just personal preference: and so in that light, please don't change the working directory.

Others are somewhat more forceful - see the here_here package  (though they're discussing a very different issue) for the delighful statement: "If the first line of your #rstats script is setwd("C:\Users\jenny\path\that\only\I\have"), I will come into your lab and SET YOUR COMPUTER ON FIRE.". I hope they don't mind, because I'm about to change the working directory back to what it should have been all along... (IMO).

Can we use setwd() to change directory in R-markdown? Write another script:

---
title: "temp2.Rmd"
output:
    - pdf_document
---
```{r}
# we already know this is ./doc
print(getwd())
setwd("..")
print(getwd())
```
```{r}
# surely setting the working directory has done it's job
print(getwd())
```
~/HERE> Rscript -e "library(rmarkdown); render('doc/temp2.Rmd')"

The pdf that results from rendering this from the command line prints, the equivalent of:
# Before using setwd()
~/HERE/doc
# After using setwd()
~/HERE
# In the second R block:
~/HERE/doc

So setwd() can change the working directory from ./doc back to `my` working directory, but this change doesn't persist between the different blocks of code in an R-markdown document. (and also, we get the warning message:
In in_dir(input_dir(), evaluate(code, envir = env, new_device = FALSE,  :  You changed the working directory to <...HERE> (probably via setwd()). It will be restored to <...HERE>/doc. See the Note section in ?knitr::knit)

So that's not how to change dirs within R-markdown.

How to do that is shown on Yihui Xie's website for knitr.

So if we write the script:
---
title: "temp3.Rmd"
output:
    - pdf_document
---
```{r}
library(knitr)
print(getwd())
opts_knit$set(root.dir = "..")
print(getwd())
```
```{r}
print(getwd())
```
~/HERE> Rscript -e "library(rmarkdown); render('doc/temp3.Rmd')"

This time it runs and prints the following filepaths:
# Before using opts_knit$set:
~/HERE/doc
# After using opts_knit$set, but within the same block:
~/HERE/doc
# In the second R block:
~/HERE

This has set the working directory to the required path, and it's done it without us hard-coding the required working directory in the script (so at least I won't get my computer set on fire). But the change in the wd only kicks in for blocks that are subsequent to the opts_knit call.

So, we can set the knitr root directory inside an R-markdown script, but we can also set it from outside the R-markdown script:
# - knitr has to be loaded before opts_knit can be used.
# - dollar-signs have to be escaped in bash.
Rscript -e "library(rmarkdown); library(knitr); opts_knit\$set(root.dir = getwd()); render('doc/temp3.Rmd')"

# OR

Rscript -e "library(rmarkdown); render('doc/temp3.Rmd', knit_root_dir = getwd())"

Now, all the filepaths printed within the pdf are the same (they are all ~/HERE): the call to change the root.dir within the .Rmd file has no effect, and all code within the .Rmd file runs as if it's in my working directory.

Now we'll try something a little bit more complicated. You can include child documents within a given *.Rmd file:
---
title: "temp_master.Rmd"
output:
    - pdf_document
---
Calling child document (path relative to the calling process and knit_root_dir):
```{r, child = "doc/temp.Rmd"}
```
In master document:
```{r}
print(getwd())
```
~/HERE> Rscript -e "library(rmarkdown); render('doc/temp_master.Rmd', knit_root_dir = getwd())"

This fails. knitr can't find the file ./doc/temp.Rmd (although it definitely exists, relative to the dir that I started R in, the same directory that I set knit_root_dir to be).

So we rewrite as doc/temp_master2.Rmd:
---
title: "temp_master2.Rmd"
output:
    - pdf_document
---
Calling child document (path relative to the parent document):
```{r, child = "temp.Rmd"}
```
In master document:
```{r}
print(getwd())
```
~/HERE> Rscript -e "library(rmarkdown); render('doc/temp_master2.Rmd', knit_root_dir = getwd())"

This works fine. In both the parent and the child document, the value of getwd() is ~/HERE, so knit_root_dir seems to be passed through from parent to child. But, the child document has to be specified relative to the parent document.

So what have I learned:

  • Never setwd() in an R-markdown file - it just gets changed back to the original working directory
  • If you want your R-markdown files to compile as if ran from a specific directory, set the knitr root-directory using opts_knit$set(root.dir = "some.dir") or using the knit_root_dir argument to rmarkdown::render
  • You can set the knitr root directory from either inside an R-markdown file (put it in the first block), or in an R script that calls rmarkdown::render
    • but only set it once, 
    • and it's probably best to set it from outside the R-markdown file (if you might subsequently want to include that file as a child of another file)
  • If you do include child documents, be aware that the filepath you put into your parent document should be written relative to the parent document, not relative to the knit_root_dir

Aside from that last point, write your paths relative to the root directory for your project...

---

Friday 9 June 2017

R formatting conventions

[I changed my conventions; disregard this, future Russ]

When it comes to naming things, we've all got our own preferences. Here, boringly, are my preferences for naming things in R (it's partly based on Google and Wickham ; but I also found this interesting). You can call it a style guide, if you like, but I've very little by way of style. To me it's just a bunch of conventions that I try to stick to.

I use a different naming format for functions (underscore-separated) than for non-functions (dot-separated). I really don't care that dot-separation would confuse java or python, because these are my conventions for R. 

An exception to the function_name underscore-convention is when I'm making functions that are of a particular type: I put the type as a dot-separated prefix, eg, if I'm writing a function that returns a function, I call it 'builder.more_specific_name', it it's a function that reads or writes to files I call it 'io.what_does_it_do'  etc.

I try to push volatile side-effecting or state-dependent shizz (IO, changes to options, printing, plotting, setting/using RNG streams; but not stop/message/warning and the like) out to the sides. All variables used by a function are passed in as arguments or defined therein, so that modulo stop (etc), most functions are pure.

Use at most 80 characters per line

Where-ever possible, use <- instead of =.
woop <- TRUE  # purple means good
woops = FALSE  # red means bad

Never use <<- 
... and never recommend a course that teaches how to use <<- to anyone
... and never, ever, mention <<- on your blog.
Similarly: attach, ->

Try not to overwrite the names of existing objects (you can check using exists("possible.name")):

j5.df <- data.frame(abc = 123)
df <- data.frame(abc = 1:3)   # see stats::df
keep <- which(j5.df$abc == 1) # keep is not currently defined
drop <- which(df$abc == 2)    # but drop is: see base::drop

Use 2-spaces to indent:

my_func <- function(arg1){
  # some code
  }

If there's more than one argument to a function you are defining, put each arg on a separate line and offset these by an extra 2 spaces:

my_fancy_function <- function(
    arg1,
    another.arg = NULL
  ){
  if(missing(arg1)){
    stop("ARGGGGG!")
    }
  body("goes", "here", arg1, another.arg)
  }

Use the magrittr pipe to connect several functions together (and write the functions on separate lines):
my.results <- abc.def %>%
  function1(some.other.arg) %>%
  some_other_fn

All the rest, rather briefly:

job.specific.package.name

ClassName ## and also ClassName == ConstructorName

object.name

function_name

<function_type>.function_name

<function_type>.function_name.<name_of_pipeline>

.non_exported_function

pipeline.name_of_pipeline

Re the last thing, my pipelines aren't just strings of functions, they're strings of functions that log intermediate results (summary tables, summary statistics, plots etc; this is all done functionally, so you can't log base-R graphics or any other side-effect dependent stuff) and return both a final-processed dataset and the specified intermediate results. I'm sure I'll write about that some other time (when it's less hacked together).

Neat, related tools: formatR, (thanks to Lovelace), lintr, ...

I noted that formatR is able to identify and fix some examples of poor coding style in R, however, it didn't seem particularly pretty to me: 
- There was no way to specify that line widths should be at most 80 characters (it's line.width thing splits lines after at least a given width); 
- It wrapped function calls / function definitions into as dense a space as possible

lintr is also able to identify examples of poor coding style in R. Some of my choices don't fit it's default checkers though - notably dot-separation in variable/function names, but I still like to have two forms of punctuation available. I'm going to write up a specific lintr script to call from my git pre-commit hook.

Both are available via conda.

Monday 22 May 2017

BFX clinic: Getting up to date microarray probe annotations

Downloading microarray datasets from NCBI/EBI databases


ArrayExpress and the Gene-Expression Omnibus have made thousands of microarray experiments available. These databases are most easily accessed in R using bioconductor::ArrayExpress and bioconductor::GEOquery, respectively. The ArrayExpress package is only really suited to downloading Affymetrix datasets. Datasets are simply downloaded using:

# From array-express
library(ArrayExpress)
library(oligo)
AEset <- ArrayExpress("E-GEOD-65552")
AEset.norm <-  oligo::rma(AEset) # prepares an ExpressionSet

# From GEO:
library(GEOquery)
GEOset <- getGEO("GSE65552") # returns a list of 'ExpressionSet's
GEOset <- GEOset[[1]]        # pulls out the first ExpressionSet

I've deliberately attempted to download the same dataset. You can specify what directory the datasets should be downloaded to, and various other tweaks - see the manpages for the packages.

The getGEO function downloads a pre-normalised version of the dataset, the ArrayExpress function downloads the raw .CEL files, and we've normalised them using RMA. They're the same dataset but they're actually quite different:

> dim(AEset.norm)
Features  Samples 
   23332        4 
> dim(GEOset)
Features  Samples 
 1236087        4

There's different numbers of features because the arrays have been summarised at different levels (that's a story for another day), but indicates that sometimes it's better to start from the raw data and know how you've processed the arrays if you've downloading stuff from GEO or ArrayExpress (raw data is provided by both databases for most datasets).

Annotation of the probes / probesets

Another slight difference can be seen in the featureData slots of the ExpressionSets - viz, the ArrayExpress-derived datasets don't have annotation data for the features, whereas the GEO datasets do:

> featureData(AEset.norm)
An object of class 'AnnotatedDataFrame': none

> featureData(GEOset)
An object of class 'AnnotatedDataFrame'
  featureNames: 4304920 4304921 ... 5617593 (1236087
    total)
  varLabels: ID GB_LIST ... probeset_type (16 total)

  varMetadata: Column Description labelDescription

You could obtain annotations for the datasets from a range of sources
- Using getGEO("GPLxxxx") for whatever the GPL id is for the array that you're using (this is done automatically if you pull your dataset from GEO, hence GEOset already having some annotations present); however, this may not provide the annotation columns that you want, and the mappings may not be up to date (the AnnotGPL entries are pretty good though)
- Using the bioconductor annotation packages
- Using the biomaRt package (this is a bit long-winded if you have an annotation package available though)

Annotation packages:

The array used in the above study was "A-AFFY-98 - Affymetrix GeneChip Mouse Exon 1.0 ST Array [MoEx-1_0-st-v1]" according to ArrayExpress. This is a pretty well established affymetrix array. If we search in the bioconductor annotation_data page for affymetrix mouse datasets, we find packages "moex10stprobeset.db", and "moex10sttranscriptcluster.db". We install both of these:

source("https://bioconductor.org/biocLite.R")
biocLite("moex10sttranscriptcluster.db")
biocLite("moex10stprobeset.db")

These are biocnoductor annotationDbi packages, as such they respect the AnnotationDbi API - see here. That means, we can obtain the accession numbers in the two databases as follows:

# keys(moex10sttranscriptcluster.db)
# keys(moex10stprobeset.db)

Similarly, we can get the accession numbers for each probeset in a given ExpressionSet from it's rownames. 
Let's check the overlap between the probeset ids in the microarray datasets and these annotation databases:

library(magrittr)

# For the ArrayExpress obtained dataset
> (rownames(AEset.norm)) %in% keys(moex10sttranscriptcluster.db) %>%
+     summary
   Mode   FALSE    TRUE    NA's 
logical    1324   22008       0 
> (rownames(AEset.norm)) %in% keys(moex10stprobeset.db) %>%
+     summary
   Mode   FALSE    TRUE    NA's 
logical   16755    6577       0 


# For the GEOquery-obtained dataset
> (rownames(GEOset)) %in% keys(moex10sttranscriptcluster.db) %>%
+     summary
   Mode   FALSE    TRUE    NA's 
logical 1230789    5298       0 
> (rownames(GEOset)) %in% keys(moex10stprobeset.db) %>%
+     summary
   Mode   FALSE    TRUE    NA's 
logical      75 1236012       0 

For the ArrayExpress dataset, almost all of the probeset ids are present in the transcript-cluster database, whereas for the GEOquery derived dataset, most of the probeset ids are present in the probeset database.

The moex***.db databases contain info for the following fields:
columns(moex10sttranscriptcluster.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
[11] "GO"           "GOALL"        "IPI"          "MGI"          "ONTOLOGY"    
[16] "ONTOLOGYALL"  "PATH"         "PFAM"         "PMID"         "PROBEID"     

[21] "PROSITE"      "REFSEQ"       "SYMBOL"       "UNIGENE"      "UNIPROT"

So if we want to add gene-level annotations (ENSEMBL, ENTREZID and SYMBOL) to the probesets for the ArrayExpress dataset, we can first obtain them as follows (note that for these array annotationDBI objects, PROBEID is the primary key):

ae.annots <- AnnotationDbi::select(
  x       = moex10sttranscriptcluster.db,
  keys    = rownames(AEset.norm),
  columns = c("PROBEID", "ENSEMBL", "ENTREZID", "SYMBOL"),
  keytype = "PROBEID"
  )

We note that there is at least one row in the annotations for each probe-id in the array dataset, but that some probes have multiple gene-level annotations:
dim(ae.annots)
# [1] 27099     4

all(rownames(AEset.norm) %in% ae.annots$PROBEID)
# TRUE

dup.ids <- ae.annots$PROBEID[duplicated(ae.annots$PROBEID)] %>% 
  unique %>%
  sort

# An example probe that has multiple gene-level mappings
ae.annots[ae.annots$PROBEID == dup.ids[1], ]
#      PROBEID            ENSEMBL ENTREZID SYMBOL
# 6579 6747309 ENSMUSG00000033813    21399  Tcea1
# 6580 6747309 ENSMUSG00000025903    18777 Lypla1

We want to build a featureData object to add to the ArrayExpress-derived ExpressionSet. This must have one row for each and every probeset that is mentioned in the ExpressionSet; and the order of the probes must match between the ExpressionSet and the Annotation dataframe. To do this, we could pass a function to  AnnotationDbi::mapIds (similar to select) that tells it to collapse any multiply mapped values into a single value, but this only works on a single column at a time.

collapser <- function(x){
  x %>% unique %>% sort %>% paste(collapse = "|")
  }
# Example:
collapser(ae.annots[ae.annots$PROBEID == dup.ids[1], "SYMBOL"])

[1] "Lypla1|Tcea1"

# Doing this for each required 'column' is a bit long-winded
#ae.annots <- AnnotationDbi::mapIds(
#  x       = moex10sttranscriptcluster.db,
#  keys    = rownames(AEset.norm),
#  column  = "ENSEMBL",
#  keytype = "PROBEID",
#  multiVals = collapser
#  )

Alternatively, we can collapse multiply-mapping values using dplyr (which you should totally learn to use, see the transformations cheat sheet here).

# Redefinition of ae.annots
ae.annots <- AnnotationDbi::select(
  x       = moex10sttranscriptcluster.db,
  keys    = rownames(AEset.norm),
  columns = c("PROBEID", "ENSEMBL", "ENTREZID", "SYMBOL"),
  keytype = "PROBEID"
  ) %>%
  group_by(PROBEID) %>%
  summarise_each(funs(collapser)) %>%
  ungroup

> ae.annots %>% tail
Source: local data frame [6 x 4]

  PROBEID  ENSEMBL                         ENTREZID          SYMBOL
    (chr)  (chr)                            (chr)            (chr)
1 7023069  ENSMUSG00000056673              20592             Kdm5d
2 7023072  ENSMUSG00000069049              26908             Eif2s3y
3 7023079  ENSMUSG00000069045              26900             Ddx3y
4 7023099  ENSMUSG00000069036              21674             Sry
5 7023132  ENSMUSG00000023452|ENSMUSG00... 236604|320951|... Pisd|Pisd-ps1|...



# Magically, the order of the probe ids in ae.annots matches that in AEset.norm
all(ae.annots$PROBEID == rownames(AEset.norm))

So, all we need to do now, is add the annotation data to the featureData of AEset.norm. Note that the featureData requires an AnnotatedDataFrame as input.

fd <- new("AnnotatedDataFrame",
  data = data.frame(ae.annots[, -1], stringsAsFactors = FALSE)
  )
rownames(fd) <- ae.annots$PROBEID
featureData(AEset.norm) <- fd

Thus, we've imported & annotated an ArrayExpress-derived dataset using bioconductor.
If you need to add further annotations, you just have to add extra columns to the featureData of the relevant ExpressionSet