Tuesday 20 February 2018

a few of my favouRite things

2018-05-04

Matrix::Matrix

See here and here

Matrix provides sparse-matrix objects to R. So if you're making matrices that are mostly zero use `Matrix` not `matrix`.

I recently used Matrix while trying to work out the overlap sizes between a few hundred different sets of genes. The genesets were represented as a list of vectors of gene-ids; each vector being a single geneset. My initial code `map`ped over the collection of genesets twice, to pull out each pair of genesets, and then compared the intersection sizes of the two genesets.

THIS TOOK FOREVER!

So, instead for the G genes and the S genesets, I made a sparse G x S binary matrix, where the i,j'th entry indicated whether gene i was present in geneset j. In graph theory language, this is a biadjacency matrix over a bipartite graph, wherein the edges represent the presence of a gene in a geneset and there is a node for each gene and each geneset.

Let M be that matrix. We can construct it as follows:

make_biadjacency_from_list <- function(
    genesets,
    universe = NULL
  ) {

  if (is.null(universe)) {
    universe <- genesets %>%
      purrr::reduce(union) %>%
      sort()

    }

  incidence <- Matrix::Matrix(
    0,
    nrow = length(universe),
    ncol = length(genesets),
    sparse = TRUE
  ) %>%
    magrittr::set_rownames(universe)

  for (j in seq_along(genesets)) {
    genes <- genesets[[j]]
    rows <- which(universe %in% genes)
    incidence[rows, j] <- 1
  }
  incidence
}

M <- make_biadjacency_from_list(my_gene_sets, my_gene_universe)

That runs in seconds. Then the geneset overlap sizes can be pulled out from t(M) %*% M since the i,j entry of this matrix is the number of genes present in both geneset i and geneset j.

get_overlap_counts <- function(
    biadj
  ) {
  # determine the overlap counts by taking the inner product
  # - note that codegree is a Matrix provided `biadj` is one
  codegree <- t(biadj) %*% biadj

  # the diagonal elements define the number of genes in each
  # geneset
  set_sizes <- diag(codegree)
    
  # identify indices for pairs of genesets with non-zero overlap,
  # refer to them as set1 and set2
  overlapping_sets <- 
    which(as.matrix(codegree) != 0, arr.ind = TRUE) %>%
    as_data_frame() %>%
    set_colnames(c("set1", "set2")) %>%
    filter(set1 != set2)

  # add the geneset overlaps and the sizes of both set1 and set2
  # - note the pattern for vectorised extraction from a Matrix
  # - codegree[set1, set2] would return a subMatrix not a vector
  overlapping_sets %>%
    mutate(
      set1_size = set_sizes[set1],
      set2_size = set_sizes[set2],
      set_overlap = codegree[cbind(set1, set2)]
    ) %>%
    mutate_all(as.integer)
}

Then I ran my fisher tests using the geneset sizes and overlap sizes returned by get_overlap_counts(M).

Thank you, sparseness.


2018-05-02

tidyr::nest

See here and here

In dplyr, I often want to group-by some columns, apply a function to the subtables defined by grouping, and then dissolve away the group-by.
The function applied to the subtables may return

- a single row for each group: in which case you'd use dplyr::summarise() or summarize()
- a row for each row in the subtable: in which case there may be an appropriate window function
- any number of rows for each subtable

In the latter case, you want to use tidyr::nest()

Suppose I've got a data-frame with gene-expression data for three different tissues. The columns are "tissue", "gene_id" and "expression". For each tissue, I want to find the top-10 most highly expressed genes (according to the value of expression).

df <- tibble(
  tissue = rep(LETTERS[1:3], each = 26),
  gene_id = rep(letters, 3),
  expression = rnorm(78)
)

The following function would extract the top-10s for a given tissue if there was only one tissue in the dataframe

get_top10 <- function(.df){
  arrange(.df, desc(expression)) %>%
    head(10)
}

I could `split` the data-frame into a list based on the "tissue" values, order the subtables by expression, take the head of those subtables and join them back up. Something like

df %>% 
  split(f = df$tissue) %>% 
  map_df(get_top10)

Which works perfectly fine.

Or, I could use nest

df %>%
  group_by(tissue) %>%
  nest() %>%
  mutate(data = map(data, get_top10)) %>%
  unnest()

The value of nest() over split() is that you can group_by multiple different columns at the same time and you don't have to do that ugly df$my_column stuff.


2018-03-29

knit_expand

- If anyone reading this is aware of a way to use knit_expand to include a dynamic number of figures in the same way as the following (for tables), please comment -

Rmarkdown is great. It's lightweight and it's versatile; compared with .lyx and jupyter-notebooks it doesn't surround your code with tag-crap and it's consequently easier on git (and you don't get an equivalent of `pushed-a-compiled-ipynb-to-bitbucket` hell).

But sometimes you need to know little tricks.

You can only print a single result from a given R code-block. And sometimes you want to print out the same `kind` of result multiple times.  For example, you might have ran 8 analyses and want to print out a table for each analysis.

Pffft, just write my_tables[[1]], my_tables[[2]], .... my_tables[[8]] in 8 different blocks - DONE!.... ?

That's some ugly stuff going on right there. What if you add another analysis. What if you want to subfilter each table. What if you want to do XXX that means you'll have to touch each of those 8 lines of code or add an extra identical line of code?

Use `knit_expand` inside a map and then knit the result - this formats your code ready for printing . Each call to knit_expand will convert your code to a knittable form.

```{r}
out <- purrr::map_chr(
  my_tables,
  function(tt){
    TT <- modify_my_table(tt)
    knit_expand(text = "{{TT}}")
  }
)
```

`out` now contains a few different tables. It's contents will be printed when the current .Rmd document is compiled if you put the following where you want the tables outputting:

`r paste(knit(text = out), collapse = "\n")`

So you don't have to write a separate block for each table.


2018-03-26

map an extraction

I can't believe I only just found this out, considering I've been using `purrr` for a while. I often have to map over a list of identically structured elements, and extract a subcomponent of those elements.

Suppose you want to extract the 'y' element of each element of the following list into a separate list

# input:
my_list <- list(
  a = list(x = 123, y = 234),
  b = list(x = 567, y = 789)
)
# intended output:
# list(a = 234, b = 789)

Here's how I'd have done it:
purrr::map(my_list, function(z) extract2(z, "y"))
# or
purrr::map(my_list, function(z) z[["y"]])

Then you end up doing this stuff quite often, and so you write a shortcut, but it's ugly because you can't easily partialise `extract2` or `[[`

map_extract2 <- function(xs, i){
  map(xs, function(x) extract2(x, i))
}

ys <- map_extract2(my_list, "y")

Then you think: `this is stupid, surely there's already a shortcut for this!`

There is, and its' dead simple.

ys <- purrr::map(my_list, "y")

*Bangs head*


2018-03-02

styler / lintr

Your code makes me wince; so does my right-brain's code. Get it pretty, stat!.

My R code now looks like this (as of 2018):
ClassName
object_name / column_name
verb_prefixed_function_name
.non_exported_function_name

Regarding stuff like

  • '<-' vs '=' (walk with the idiomatic former);
  • column/parenthesis alignment (pfft);
  • double- vs single-quotes (arbitrarily choose the former);
  • superfluosity of whitespace (yes around args, no at the end of lines); 
  • keeping your code in a scannable page width
... I do care about this sort of stuff but not while I'm writing.

  1. `styler` catches some of these things and automatically alters them to an approximation of tidyR style
  2. and some things that you wouldn't trust automated tools to alter can be indicated by `lintr`. 
I mostly write within vim and `lintr` can be set up to run automatically every time you save and I have a shell script to call `styler` on the current file.

When I'm writing in Rstudio, styler can be ran from the `Addins` and lintr from the console (lintr works great for packages in Rstudio, but is a bit of a chore to set up which linters to use for a general analysis project vs in vim, IMO).

I've also a script that fixes my roxygen comments for questionable and finicky reasons.

And all three tools are being pushed into a git hook that runs over all R/Rmd files prior to commit.

2018-02-21

ggplot2::`%+%`

Although it's mostly related to the objects-first languages that I rarely use, I've been reading Refactoring by Martin Fowler. In this book they defined a number of `code smells` - signals of underlying filth in your programs. Deep within that dognest lies "Duplicated Code" - a smell that is far-from restricted to OO programming. So I wrote a little function to find duplicated code in my R scripts (because I couldn't find a good package to do that - if you know of one pleeease comment) and let it run.

BANG! The main source of duplicated code in my scripts was ggplot2 calls of the form:

data %>% filter %>% define_aesthetics + define_geoms + format

Sometimes the first half of the call was duplicated, and sometimes the latter.

If you find yourself writing the same

define_aesthetics + define_geoms + format

steps over and over, use ggplot2::%+% instead.

my_df <- data_frame(a = 1:6, b = rep(letters[1:2], each = 3))
my_df %>% filter(b == "a") %>% ggplot(aes(x = a)) + geom_density()
my_df %>% filter(b == "b") %>% ggplot(aes(x = a)) + geom_density()

Rethink that duplicated code:

# define the graph for the first subset
p <- my_df %>% 
     filter(b == "a") %>% ggplot(aes(x = a)) + geom_density()

# plot the data for the first subset:
p
# plot the data for the second subset by replacing the data used by ggplot2
p %+% filter(my_df, b == "b")

[Admittedly that was a simple example, and has better solutions using faceting]


2018-02-20

gtools::mixedsort

Converting strings that contain numbers into factors and wondering why G10 comes between G1 and G2? That's because '10' is alphabetically-lower than '2'. To fix this, you can either specify your factor levels explicitly, or you can let `mixedsort` do it for you.

# Que?
paste0("G", 1:10) %>% factor()
# [1] G1 G2 G3 G4 G5 G6 G7 G8 G9 G10
# Levels: G1 G10 G2 G3 G4 G5 G6 G7 G8 G9

# But check this:

library(gtools)
paste0("G", 1:10) %>% factor(levels = mixedsort(.))
# [1] G1 G2 G3 ... G10
# Levels: G1 G2 G3 G4 G5 G6 G7 G8 G9 G10

I tend to use a wrapped function to do this.

factor_mixedsort <- function(x){
  factor(x, levels = gtools::mixedsort(unique(x)))
  }

so that I can use a single function call when I'm making plots etc


META
Some things are so unequivocally in my favouRite things, and are so widely known, that I haven't written about them:
%>%
roxygen2
rmarkdown
testthat


TODO
- here (again)
filter_at & friends
- stringr::str_interp (vs paste / sprintf)
- suppressPackageStartupMessages / suppressMessages
- the `eval` flag in Rmarkdown chunks
- tidyr::unite


ASIDES
Here are some other peoples favourite things:
- some simple rstudio tips


THE CONTINUALLY UPDATED LIST OF THINGS I WANT TO LEARN
Here are some things that might one day make it into my favourite things:
- Hmisc::escapeRegex
- brms
- future
- patchwork (vs cowplot / arrangeGrob)
- tailr
- tabr
- TBC

Thursday 15 February 2018

Regression lines in ggplot

Based on a question at today's code-club, here's how you plot a scatter plot with an overlain regression line in ggplot2:

Set up some data and import the necessaries:

library(tibble)
library(ggplot2)

set.seed(1)

x <- rnorm(30)
y <- rnorm(30, x)
my_df <- data_frame(x = x, y = y)

Here, x and y are sampled from a normal distribution and the expected slope of the regression line of y on x is 1.

A scatter plot of y against x is as follows:

gg <- ggplot(my_df, aes(x = x, y = y)) +
  geom_point()

gg




If you're happy for R to compute the appropriate regression line for you, you can use the following to overlay the line:

gg + geom_smooth(method = "lm", se = FALSE)




Otherwise, you can compute the parameters of any line that you want to include, and then overlay them using geom_abline.

To do this, we first calculate the regression coefficients for y against x

fit <- lm(y ~ x)
coef(fit)
# (Intercept)         x
#    0.129...   1.04...

.. and then overlay the associated linegraph (we've used col = #3366FF to match the defaults for GeomSmooth)

gg +
  geom_abline(
    intercept = coef(fit)["(Intercept)"],
    slope = coef(fit)["x"],
    col = "#3366FF"
    )



You can pick your own slopes / intercepts.