Friday, 2 December 2016

Moments of a weighted sum of distributions

Let \[X\] be a random variable with distribution \(g(x) = \Sigma_{i=1}^{N}\lambda_{i}f_{i}(x)\), where each \(f_{i}\) is a probability distribution for the RV \(X_{i}\) (with mean \(\mu_{i}\) and variance \(\sigma_{i}^2\)), \(\lambda_{i}\geq0\) and \(\Sigma_{i=1}^{N}\lambda_{i}=1\).

We determine the mean, \(\mathbb{E}[X]=\mu\) and variance, \(\mathbb{V}[X]=\sigma^2\), of \(X\) in terms of those of the \(X_{i}\).

We have
\[ \mu =
   \mathbb{E}[X] =
   \mathbb{E}[\Sigma_{i=1}^{N}\lambda_{i}X_{i}] =
   \Sigma\lambda_{i}\mathbb{E}[X_{i}] =
   \Sigma_{i=1}^{N}\lambda_{i}\mu_{i}
   \]
by linearity.

Then on expanding the square we have
\[
  \sigma^2 = \mathbb{V}[X]=\mathbb{E}[(X-\mu)^{2}]=\mathbb{E}[X^{2}]-\mu^{2}
\],
but that isn't really any help to us, because we haven't determined \( \mathbb{E}[X^2] \).

Alternatively,
\[
  \sigma^{2}
  = \int dx(x-\mu)^{2}f(x)
  = \int dx(x-\mu)^{2}\Sigma_{i}\lambda_{i}f_{i}(x)
  = \Sigma_{i}\lambda_{i}\int(x-\mu)^{2}f_{i}(x)
\]
\[
  = \Sigma_{i}\lambda_{i}\int dxf_{i}(x)(x-\mu_{i}+\mu_{i}-\mu)^{2}
\]
\[
   = \Sigma_{i}\lambda_{i}\int dxf_{i}(x)[(x-\mu_{i})^{2}+2(x-\mu_{i})(\mu_{i}-\mu)+(\mu_{i}-\mu)^{2}]
\]

\[
  =\Sigma_{i}\lambda_{i}[\int dxf_{i}(x)(x-\mu_{i})^{2}+2(\mu_{i}-\mu)\int dxf_{i}(x)(x-\mu_{i})+(\mu_{i}-\mu)^{2}\int dxf_{i}(x)]
\]

\[
  = \Sigma_{i=1}^{N}\lambda_{i}(\mathbb{E}[(X_{i}-\mu_{i})^{2}]+2(\mu_{i}-\mu)(\mathbb{E}[X_{i}]-\mu_{i})+(\mu_{i}-\mu)^{2})
\]

\[
  = \Sigma_{i=1}^{N}\lambda_{i}(\mathbb{V}[X_{i}]+(\mu_{i}-\mu)^{2})
\]
\[
  \geq \Sigma_{i} \lambda_{i} \mathbb{V}[X_{i}]
\]


Wednesday, 2 November 2016

New ubuntu setup

Ubuntu 16.10 dual booted with windows 10 on new 64bit dell precision (25/10/2016); notes on setting up a similar system on my home laptop (single-boot Ub 16.04, 32 bit) are also included.

--------------------------
A)
Apps:
Set up my Thunderbird account to point to uni emails
Installed chromium from ubuntu software
Removed firefox / amazon from the taskbar
TODO: install thunderbird::calender plugin (? lightning plugin - see here)
Printing set up as described here

B)
Added 'tools' to home dir
cd 
mkdir tools

Basic command line tools and .bashrc setup:

Installed vim
sudo apt-get install vim ### installs vim / vim-runtime

vim was then set as the default editor:
sudo update-alternatives --config editor # selected vim.basic from the options

Change default 'alias ll="ls -alF"' to 'alias ll="ls -l"' in .bashrc

Changed default PS1 definition (\u@\t instead of \u@\h):

... 
# For colour prompt (eg, standard terminal; read double-slashes as singles)
PS1='${debian_chroot:+($debian_chroot)}\\[\033[01;32m\\]\u@\t\\[\033[00m\\]:\\[\033[01;34m\\]\w\\[\033[00m\\]\$ '...
# For non-colour prompt (eg, in tmux)
PS1='${debian_chroot:+($debian_chroot)}\u@\t:\w\$ '
...

Already installed:
awk / sed / perl5.22 / python2.7.12 / python3.5.2 / make4.1 / curl 7.50.1 / wget 1.18 / ssh / scp / gzip (etc) / grep / sqlite3

To be installed:
java / sbt / git / tmux

sudo apt-get install git  ### git2.9.3 also git-man / liberror-perl
git config --global user.name "*** **"
git config --global user.email "***_***@***.**"

sudo apt-get install tmux # installed tmux 2.2
See here for info about tmux

Installation of java
sudo apt-get install openjdk-8-jdk

Installation of sbt
echo "deb https://dl.bintray.com/sbt/debian /" |\
  sudo tee -a /etc/apt/sources.list.d/sbt.list
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 \
                 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823
sudo apt-get update
sudo apt-get install sbt
sudo apt-get install scala #  so I can run scala .jars at CLI

Installation of intellij with scala and vim plugins (within ~/tools/intellij/...); downloaded from website

sudo apt-get install lsb-core

Installed gnome-tweak-tool so that I could disable the caps-lock
sudo apt-get install gnome-tweak-tool
# disabled caps-lock

Installed icedtea plugin so that I can use java webstart on IGV etc
sudo apt-get install icedtea-8-plugin

C)
cd ~/tools
touch README.sh # code to set up bfx tools added here

BFX Workflow tools:

Currently don't have R / conda / lyx / jupyter / snakemake / menedeley installed or any of the standard NGS analysis kit (picard / tabix / samtools / hisat / bwa etc...)
Can be installed from bioconda: gnu-parallel

First install miniconda (py3.5 version):
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86.sh # used on 32 bit
bash Miniconda3-latest-Linux-x86*.sh
# accepted license; installed to /home/<me>/tools/miniconda3; added path to miniconda3 to ~/.bashrc

Reopened terminal
conda list # (conda 4.2.12, conda-env 2.6.0, ...)
conda update conda
conda list # (conda 4.2.12, conda-env 2.6.0, ..., python 3.5.2, ...)
Added channels for R, conda-forge, bioconda, biobuilds (a bunch of biofx tools) and my own channel. Did this in the wrong order to begin with - which meant that the channel 'r' was searched last and my own channel was searched first:
# conda config --add channels r
# conda config --add channels conda-forge
# conda config --add channels bioconda
# conda config --add channels biobuilds
# conda config --add channels russh

Prefer the idea of searching R, conda-forge and bioconda before biobuilds and my own channel (note that conda-forge, defaults, r and bioconda are added in the order specified in the bioconda docs).

conda config --add channels russh
conda config --add channels biobuilds
conda config --add channels conda-forge
conda config --add channels defaults
conda config --add channels r
conda config --add channels bioconda

Set up a global environment for general / exploratory work: requires snakemake, R, bioconductor. To make/use snakemake environment.yaml files, we need conda-env.
conda install anaconda-client 

conda install -c bioconda \
                 snakemake 
# (installs several other packages)

TODO: Split these packages into globally required packages and work-package specific requirements
Couldn't install snakemake on my 32bit home computer using the same approach
# pip install snakemake ## used on 32bit


# Installed base R and a few important packages
# - Note that the most up-to-date R-base is not appropriate (15/11/2016) since despite the availability of libcairo2-dev, etc, it installs with capabilities('cairo') == FALSE, and hence can't draw transparencies in ggplot (for example; see  here). 

conda install -c r \
  r-base="3.3.1 1"

conda install -c r \
  r-devtools  \
  r-gdata     \
  r-gridextra \
  r-roxygen2  \
  r-testthat  \
  rpy2 \
  r-essentials

conda install -c r rstudio
# checked that capabilities('cairo') == TRUE after each step

conda install -c bioconda \
  bioconductor-biobase \
  bioconductor-geoquery \
  bioconductor-limma \
  bioconductor-org.hs.eg.db \
  bioconductor-org.mm.eg.db \
  bioconductor-s4vectors

# All the above bioconda installs are absent from conda for 32bit (TODO - add 32bit version of these to my conda channel)

conda install -c bioconda \
  samtools \
  picard \
  htslib

conda install -c biobuilds \
  tabix 

# picard was installed from bioconda, and installs java-jdk-8.0.92 from bioconda also (note this version of picard is a commandline tool, rather than a .jar)

# Some workpackage-specific libraries
# TODO - workpackage-specific conda channel
conda install -c bioconda \
  r-ggally \
  r-gplots \
  bioconductor-pcamethods

To run snakemake.remote.HTTP we require python packages 'boto', 'moto', 'filechunkio', 'pysftp', 'dropbox', 'requests', 'ftputil'
conda install boto \
              dropbox \
              filechunkio \
              ftputil \
              moto \
              pysftp \
              requests

Installs the above, and updates `requests` to a version with an identical version number.

My current workpackage uses 'doBy' and 'moments' from CRAN; these aren't available through conda. Hence I'm going to try to get them loaded onto anaconda cloud (as explained here). First I need to install conda-build

conda install anaconda-build \
              conda-build

conda skeleton cran --recursive moments
conda build r-moments
anaconda upload \
  ${MINICONDA}/conda-bld/linux-64/r-moments-0.14-r3.3.1_0.tar.bz2

This uploaded r-moments to my (russH) anaconda account, from which it could be installed into my miniconda R:

conda install -c russh r-moments

However, I was not able to upload doBy using the same mechanism - the conda skeleton ... call worked fine (it urged me to install CacheControl, which I did),
conda skeleton cran --recursive doBy
# warning =>
rm -R r-doby
conda install cachecontrol
conda skeleton cran --recursive doBy
# error => 
# "In order to use the FileCache you must have lockfile installed."
conda install lockfile
rm -R r-dody
conda skeleton cran --recursive doBy
conda build r-doby
#  had it worked, this would have installed r-base "3.3.1_5" and a bunch of other packages without giving me a chance to prevent it - thankfully I didn't lose 'cairo' again. And the error message was:
+ source <CONDA>/bin/activate <CONDA>/conda-bld/r-doby_1479296072493/_b_env_placehold_pl
+ mv DESCRIPTION DESCRIPTION.old
+ grep -v '^Priority: ' DESCRIPTION.old
+ <CONDA>/conda-bld/r-doby_1479296072493/_b_env_placehold_pl/bin/R CMD INSTALL --build .
<CONDA>/conda-bld/r-doby_1479296072493/_b_env_placehold_pl/bin/R: 12: [: Linux: unexpected operator
<CONDA>/conda-bld/r-doby_1479296072493/_b_env_placehold_pl/lib/R/bin/R: 12: [: Linux: unexpected operator
* installing to library ‘<CONDA>/conda-bld/r-doby_1479296072493/_b_env_placehold_pl/lib/R/library’
Error: error reading file '<CONDA>/conda-bld/r-doby_1479296072493/work/doBy/DESCRIPTION'
Command failed: /bin/bash -x -e <CONDA>/conda-bld/r-doby_1479296072493/work/doBy/conda_build.sh

... and there's a bizarre directory called _b_env_placehold_placehold_<ad_infinitum> in the r-doby_1477.../ directory
<I rewrote my work package to use lapply(split(), f()) instead of doBy and dplyr::arrange_(df, colname) instead of doBy::orderBy>

The version of GEOquery that I downloaded from bioconda was 2.38.4. This caused some issues with one of my scripts since it didn't download GPL annotation files from NCBI. The cause of this was 'http' URLs being hardcoded into GEOquery::2.38 whereas these http sites all fail when accessing the NCBI URLs (NCBI have updated to https). The latest version of GEOquery (2.40 as of 17/11/2016) has the 'https' URLs included, but is not yet available on bioconda. Having little experience with anaconda/bioconda I tried to get GEOquery uploaded into my bit of anaconda cloud using anaconda upload. Some initial errors (~ identical to those for r-doBy) were observed, but these seem to result from using filenames that are too long.
i) downloaded and unzipped the bioconda github repository:
cd ~/tools
wget https://github.com/bioconda/bioconda-recipes/archive/master.zip
unzip master.zip
ii) setup a conda environment for building/uploading/installing bioconductor packages
cd bioconda-recipes-master/scripts/bioconductor/
conda create --name bioconductor-recipes --file requirements.txt
source activate bioconductor-recipes
iii) tried setting up the build recipe for GEOquery:
./bioconductor_skeleton.py --recipes ../../recipes GEOquery
# failed - lack of pyaml in requirements.txt
conda install pyaml
./bioconductor_skeleton.py --recipes ../../recipes GEOquery
# did nothing, since bioconductor-geoquery was already in ../../recipes
./bioconductor_skeleton.py --recipes ../../recipes GEOquery --force
# finally made the recipe for geoquery
iv) Build geoquery
cd ../../recipes
conda build bioconductor-geoquery
# failed with bizarre placehold_placehold_... directory put into ~/tools/miniconda3/conda-bld (as above for doBy)
Searched on conda bugs at github, found the source of this error may be that my filename is too long while trying to build the package. Therefore specified a shorter dirname to output the conda build to:
conda build bioconductor-geoquery --croot ~/temp
Package builds fine. 
v) Therefore uploaded to anaconda-cloud:
cd ~/temp
anaconda upload ./linux-64/bioconductor-geoquery-2.40.0-r3.3.1_0.tar.bz2

source activate <...>
conda install bioconductor-geoquery=2.40

SWOOP!  - It's now available in my work package.
So I repeated it for :
ArrayExpress;
ArrayExpress' dependency bioconductor-oligo;
oligo's dependencies affxparser, oligoClasses and r-ff
 since none of these are available on bioconda / anaconda's main channels

conda config --add channels terradue
conda install -c terradue r-ff # installs r-ff and r-bit
# note that I couldn't build r-ff on my ubuntu box
# terradue must be an available channel, so that oligoClasses installs

source activate bioconductor-recipes

cd ~/tools/bioconda-recipes-master/scripts/bioconductor/
./bioconductor_skeleton.py --recipes ../../recipes affxparser
cd ../../recipes
conda build bioconductor-affxparser --croot ~/temp
cd ~/temp
anaconda upload ./linux-64/bioconductor-affxparser

cd ~/tools/bioconda-recipes-master/scripts/bioconductor/
./bioconductor_skeleton.py --recipes ../../recipes oligoClasses
cd ../../recipes
conda build bioconductor-oligoclasses --croot ~/temp
cd ~/temp
anaconda upload ./linux-64/bioconductor-oligoclasses

cd ~/tools/bioconda-recipes-master/scripts/bioconductor/
./bioconductor_skeleton.py --recipes ../../recipes oligo
cd ../../recipes
conda build bioconductor-oligo --croot ~/temp -c russh -c terradue
cd ~/temp
anaconda upload ./linux-64/bioconductor-oligo-1.38.0-r3.3.1_0.tar.bz2

cd ~/tools/bioconda-recipes-master/scripts/bioconductor/
./bioconductor_skeleton.py --recipes ../../recipes ArrayExpress
cd ../../recipes
conda build bioconductor-arrayexpress --croot ~/temp -c russh -c terradue
cd ~/temp
anaconda upload ./linux-64/bioconductor-arrayexpress-1.34.0-r3.3.1_0.tar.bz2

jupyter install:
conda install jupyter
# r-irkernel was already installed

lyx (2.2.0-2build1was installed from Ubuntu Software (most recent version is 2.2.2, so the UbS version should be OK). However, although I could open lyx documents and convert basic lyx docs into pdf, I was not able to convert lyx documents that contained any R code (despite adding the knitr module and adding the Rscript path). Therefore, I removed lyx, and installed it using apt-get - this also didn't compile knitr projects. So I tried downloading and installing lyx myself, this required automake, zlib and libqt:
sudo apt-get install automake \
                     zlib1g-dev \
                     libqt4-dev

wget ftp://ftp.lyx.org/pub/lyx/stable/2.2.x/lyx-2.2.2.tar.gz
tar -xzvf lyx*.tar.gz
cd lyx-2.2.2/
./autogen.sh
./configure
make
sudo make install

I noted that the newly installed .lyx was able to compile knitr-containing chunks, but only when lyx was opened from the command line (rather than using the icon in Ubuntu's search your computer thing). I'm sure there's just some options that could get the latter to compile knitr lyx-notebooks, but I haven't found it yet. Another thing of note is that in lyx-2.2.2, knitr chunks can be written directly into tex-code blocks (Ctrl-L) or using "Insert">"Custom Insets">"Chunk" (which is a bit mouse-heavy for my tastes), rather than using the knitr-'Chunk' option in lyx's paragraph-type drop-list.

I couldn't find any info regarding when the knitr-lyx integration changed though. Nonetheless, using Ctrl-L makes writing R code quicker in lyx (and my existing lyx 2.0 notebooks still compile regardless of the change).

TODO: Notes on setting up work-package specific options for lyx notebooks

<IGNORE - Using r-base 3.3.1_1 eliminated the need for this code:>
Fonts weren't correct for use in jupyter/IRKernel. For example, on putting "hist(rnorm(100))" into a jupyter/IRkernel page on chrome, I received the error "X11 font -adobe-helvetica-%s-%s---%d-------*, face 1 at size 6 could not be loaded" - suggesting that jupyter/R was using x11 rather than cairo to work with graphics and that some of the relevant x11 fonts were missing. I initially installed R::cairo into my conda environment, but this didn't fix the error (indeed, capabilities('cairo') remained FALSE within jupyter). So I installed a range of x11 fonts libraries, rebooted the computer and this particular error disappeared.
sudo apt-get install \
  x11-xfs-utils \
  xfonts-base \
  xfonts-scalable \
  xfonts-75dpi \
  xfonts-100dpi \
  xfsprogs \
  xfslibs-dev

Although this fixed the issue with fonts, other issues were subsequently highlighted: eg, when using ggplot with transparencies, the error 'Warning message: In grid.Call.graphics(L_polygon, x$x, x$y, index) : semi-transparency is not supported on this device: reported only once per page' was seen, and a series of blank scatter plots resulted (that is, the axes / backround were all visible but the data points were missing from the plot). Therefore, I installed libcairo2-dev
sudo apt-get install libcairo2-dev
# installs a few dependencies as well

D)
Disk setup:
The computer has a couple of separate harddrives, one of which (OS) is a solid-state drive (~ 500Gb, split between windows and ubuntu) and the other (DATA) is for storing data / workpackages etc (~2Tb). My workpackages are kept in a folder called jobs_llr on DATA, to which I link from ${HOME}/jobs_llr. External datasets are stored in ext_data on DATA and internal datasets are stored in int_data on DATA. If files in the latter two directories are required by a work package, a soft link from jobs_llr/job_name/data/<sub_dir(s)>/<file_name> should be made.

## Added to cron using crontab -e to ensure any internal raw data is backed up locally:
## Note this makes an int_data directory within /My_Passport/pog/
# (check &) back up all raw data from /media/DATA/int_data to my external HDD
#   at 5-past-midnight on Tuesday of each week
# -r = recursive
# -t = keep modification times
5 0 * * 2 /usr/bin/rsync -rt /media/<me>/DATA/int_data /media/<me>/My_Passport/pog

E)
Clone my analysis repositories:
git clone https://***@bitbucket.org/***/***



Tuesday, 31 May 2016

qsub and snakemake

I've previously written about using snakemake to develop bioinformatics workflows. Recently I've started running more and more batch analysis stuff on our university's cluster: these jobs have to be submitted using qsub. Snakemake provides some integration with qsub, according to the snakemake docs, but I've never called qsub in this setting.

My cluster set-up:
I have a miniconda set up on the cluster with python3 / R / Rscript / snakemake etc installed (eg, ~/miniconda3/envs/bfx/bin). I use jags a lot, but run this from ~/tools/bin (via R).

############################################
# ./scripts/test_script.R:
get.poisson <- function(seed = NULL){
  if(! is.null(seed)){set.seed(seed)}
  rpois(100, 1)
  }

args <- commandArgs(trailingOnly = TRUE)
seed <- NULL
if(length(args) >= 1){
  seed <- scan(args[1], what = integer())
  }
if(length(args) >= 2){
  out.path <- file.path(args[2])
  } else {
  out.path <- stdout()
  }
vals <- get.poisson(seed)
write(x = vals, file = out.path, sep = "\t")
#############################################

Using snakemake to call a simple R script:
###############################
## ./Snakefile
import os

rule all:
    input: os.path.join("data", "output.txt")

rule input_to_output:
    input:  seed   = os.path.join("data", "seed.txt"),
            script = os.path.join("scripts", "test_script.R")
    output: os.path.join("data", "output.txt")
    shell: """ Rscript {input.script} {input.seed} {output} """

rule define_seed:
    output: os.path.join("data", "seed.txt")
    shell: "echo 1 > {output}"

# or alternatively if the script is written to be sourced
# run: R""" # do something with {input.seed} 
#           source('{input.script}') 
#           # do something with {output}
#           """

Using qsub to call the same script (assuming 'data/seed.txt' has been set up) isn't so straightforward - 'Rscript' can't be called, since it isn't an ascii shell script - you have to set up a shell script that contains the following:
## ./scripts/script_runner.sh
Rscript scripts/test_script.R \
        data/seed.txt \
        data/output.txt

.. and then call it as follows:
RH> qsub -d ${PWD} \
     -l walltime=00:01:00,cput=00:01:00,nodes=1:onegpc,os=centos6 \
     scripts/rscript_runner.sh
     
#############
Alternatively, you could set up an Rscript-calling shell script that takes named arguments from qsub:
## ./scripts/rscript_runner2.sh
## .. do some checks ..
Rscript "${script_name}" ${script_args}

and then do the following on the command line
RH> qsub -d ${pwd} \
    -v script_name="scripts/test_script.R",script_args="data/seed.txt data/output.txt" \
    -l walltime=00:01:00,cput=00:01:00,nodes=1:onegpc,os=centos6 \
    scripts/rscript_runner2.sh
    
##############
These two methods make me a bit uneasy, though. I use many more shell commands than just Rscript in an average workflow and don't want to have to write a massive shell script that calls qsub for each step, passing a shell script wrapper each time.

Can I therefore get snakemake to pass shell commands to the cluster using qsub?
Supposedly yes, because it says this in the help for snakemake:
"--cluster CMD, -c CMD
                       Execute snakemake rules with the given submit command,
                        e.g. qsub. Snakemake compiles jobs into scripts that
                        are submitted to the cluster with the given command,
                        once all input files for a particular job are present.
                        The submit command can be decorated to make it aware
                        of certain job properties (input, output, params,
                        wildcards, log, threads and dependencies (see the
                        argument below)), e.g.: $ snakemake --cluster 'qsub
                        -pe threaded {threads}'."

So, let's try the following (note that there are a couple of shell steps in the above Snakefile: defining the seed, and calling the R-script - so this should qsub into the cluster twice):

RH> rm data/*
RH> snakemake \
  -s Snakefile \
  --cluster "qsub -d ${PWD} -l walltime=00:01:00,cput=00:01:00,nodes=1:onegpc,os=centos6" \
  --jobs=1

It actually worked straight off. You have to specify --jobs=<N> to prevent snakemake from sending an unending stream of jobs to the cluster. My working directory has the following files in it:
 snakejob.define_seed.0.sh.e310371 
 snakejob.define_seed.0.sh.o310371 
 snakejob.input_to_output.1.sh.e310372 
 snakejob.input_to_output.1.sh.o310372

The prefix 'snakejob.<recipe_name>.<N>.sh' is the name of the random script that snakemake passed to qsub.

The '-d ${PWD}' argument is used to ensure the correct working directory when I qsub into the cluster. Within the snakemake --cluster argument, it isn't really required: I suspect snakemake passes full pathnames around within the random scripts.

######################
Next, how do I specify that different recipes will require different lengths of time to run within the snakefile?....


Thursday, 5 May 2016

Snakemake stuff for code-club presentation

Code-Club presentation for 6th May 2016

Snakemake for managing bioinformatics workflows

I'm currently reanalysing some of the ENCODE epigenetics datasets. I'm partly doing this to set up a ChIP-Seq workflow for use on our own data, and partly because of some misgivings I have with the processed ENCODE data: they dont' readily give me any info regarding the quality of the reads / alignments; they aligned against hg19 (and I'm using GRCh38/1KG); the peak calling stategy for a given target varies between labs; and in some cases, it looks like the control reads are wayy shorter than the experimental reads which should monkey about with the comparative peak calling. .. and partly cos I'm a bit data nosy.

So, various ENCODE .fastq files are being downloaded, trimmed and aligned against GRCh38/1KG, QC summaries are being made, peaks are being called, coverage is being summarised. Eventually there will be biology, there will be statistics, there will be science. But now, now there is just workflow. Thankfully, there is Snakemake so that's relatively painless.

Snakemake is a tools for organising your workflows. Similarly to make, snakemake allows you to specify a set of output files and a series of recipes that allow you to construct those files from some input files. If you rerun snakemake, it shouldn't remake a file F if that file has already been created (unless you force it to, or the script/input files from which F was made are updated subsequent to F's creation).

You need Python3 because of some multiprocessing stuff that's absent in Python2; but I only use Python2 in extremis. And you should read the tutorial and the docs since this is neither tutorial nor reference, this is just some notes about how I got snakemake to do what I wanted it to.

------------------------------------

Order of play:

Basic stuff:

A) Specify your inputs: config.yaml
B) Specify your outputs: rule all
C) Map your inputs to your outputs: rule *
D) Run your workflow: snakemake -s <snakefile> [-n] -p [-j N]
E) Look at your workflow DAG: snakemake ... --dag | dot

Advanced stuff:

A) Remote files
B) Shadow rules
C) Functions as input
D) Cluster work

--------------------------------------

Basic stuff

A) Specify your inputs: config.yaml

Configuration files for snakemake can either be specified in JSON or YAML format. They both specify key:value pairs. Since adding comments to JSON is seemingly frowned upon, I use YAML (make sure you have PyYAML installed). For my encode analysis, the config file contains some pointers to file positions (eg, for the indexed genome I use during alignment) and indicates which contigs the analysis should be restricted to. More importantly, I specify 

  • which ENCODE ChIP-Seq samples I'm interested in, 
  • whether they are controls or experimental samples (and for the latter, which controls they should be compared against) and 
  • various info about the samples: 
    • what length should they be trimmed to, 
    • to which protein target do they correspond, 
    • which MACS algorithm should I use in calling ChIP-Seq peaks.

The config file can be specified within your snakemake script using
configfile: os.path.join(".", "some_dir", "config.yaml"
and can be overridden using 
snakemake <other args> --configfile <config.yaml>  
at the command line. Similarly, you can override specific key/values in the config file using
snakemake <other args> --config CONFIG_VAR=SOME_VALUE

Example config.yaml:

---
# Note that '^---', above,  and '^...', below, are YAML syntax
REMAKE: 1

CONTIGS: chr1,chr2,chr3,chr4,chr5,chr6,chr7,and,so,on

samples:
- ENCFF000BXX
- ENCFF000BYG

controls:
- ENCFF000BWK

# H3K4me1 from Bernstein lab:
ENCFF000BXX:
  "chip_or_control" : "chipseq"
  "geo_id"          : "GSM733692"
  "control"         : "ENCFF000BWK"
  "trimmed_rlen"    : 36
  "original_rlen"   : 36
  "target"          : "H3K4me1"
  "peak_type"       : ["broadPeak"]

# etc

# Control from Bernstein lab
ENCFF000BWK:
  'chip_or_control' : 'control'
  'geo_id'          : 'GSM733780'
  'control'         : None
  'trimmed_rlen'    : 36
  'original_rlen'   : 36
  'target'          : 'control'
...
So this gives the ENCODE and GEO ids of a couple of samples and the corresponding control from the K562 ChipSeq stuff, indicates the control / expt status etc

B) Specify your outputs: rule all

Within the snakemake script, the key-value pairs that were read in from your config file are stored in the python dict 'config'. I use the sample-specific information in config to determine which files should be constructed / which comparisons should be performed for each sample (or chipseq target).

The steps in the ChIP-Seq workflow can be split up according to the level of dependence between different samples as follows

  • independence between all samples (eg, aligning reads against GRCh38 - read alignment for sample X is independent of the alignment for sample Y)
  • dependence between specific sample pairs (eg, peak calling between chipseq and control samples)
  • dependence between many samples (eg, summarising the regions covered by multiple different H3K4me1 samples)
To tell snakemake which files I want to make, I use the following
rule all:
    input: REQD_FILES

For this to work, I predefine a list of required files using python syntax and the config dictionary mentioned above. For example, the following adds some 'fastqc' extensions to the name of each downloaded fastq and adds the fastqc output to the list of required files.

ALIGN_FQ_TEMPLATE = os.path.join(
  ALIGN_DIR,  "{enc}",  "{enc}.fastq.gz")
FASTQ_QC_TEMPLATE = ALIGN_FQ_TEMPLATE + "{ext}"
FASTQ_QCS = [FASTQ_QC_TEMPLATE.format(
             enc = enc,
             ext = ext
             )
             for enc in config["samples"]
             for ext in ['_fastqc.html', '_fastqc.zip']
             ]
REQD_FILES = REQD_FILES + FASTQ_QCS
# eg, "ALIGN_DIR/ENCFF000...BWK/ENCFF000...BWK.fastq.gz_fastqc.html"
#     "ALIGN_DIR/ENCFF000...BWK/ENCFF000...BWK.fastq.gz_fastqc.zip"

C) Map your inputs to your outputs: rule *

This 'rule all' syntax, above, is pretty standard for snakemake. Most recipes you'll define will have the form
rule RULE_NAME:
    input:  in1='some', in2='file' # or input: ['some', 'file']
    params: param1='not_necessary'
    output: out1='some.out', out2='another.out'
    shell: """ cat {input.in1} > {output.out1};
               cat {input.in2} | do {params.param1} {config[genome]} - | something -o {output.out2}
               """
You can use 'shell', 'run' or 'script' to call command line functions, python commands or python/R scripts respectively.

Simple example - copy an external fastq to a local position:
rule remote_fastq_to_current:
    params: hyperlink = "https://www.encodeproject.org/files/{enc}/@@download/{enc}.fastq.gz"
    output: temp(ALIGN_FQ_TEMPLATE)
    shell:  "wget {params.hyperlink} -O {output}"


Here, the 'temp()' directive indicates the fastq is to be deleted once all dependents have finished (so you don't have to wait til you've processed 100 files before you start deleting them).

The placeholders '{input.in1}' etc, are expanded the same way that '{abc}.def'.format(abc='BCA') is in base Python. This can cause some confusion, for example, 

  • if you want to use "{.}..".format() in the recipe definition (split up the string); or
  • if you want to use shell commands like "... ${some_command} ${some_file} {input.in1}" (use double braces)

Another useful widget within the recipe definitions are the wildcards. Here {prefix} and {rlen} are imputed from the filenames passed to this rule by snakemake (resp. the file-prefix and the read length that I want after trimming). 
rule fastq_trimmer:
    input:  "{prefix}.fastq.gz"
    output: temp( "{prefix}.{rlen,\d+}nt.trimmed.fastq.gz")
    shell:  """cat {input} | gunzip - |\
               fastx_trimmer -z -f1 -l{wildcards.rlen} > {output}"""


Note 

  • that {rlen,\d+} stipulates numeric matches for the read-length wildcard (the regex is defined in the output filename; and I think any such stipulation on the regex should be present in every one of the output files that matches - like output: out1='{prefix}{m,\d+}.txt', out2='{prefix}{m,\d+}.html')
  • that to use wildcards in the shell recipe you have to use {wildcards.NAME} whereas to use wildcards in input, output or params you only use {NAME}
  • Multiple output and input filenames are allowed for any given rule, thankfully (this, and the greater flexibility in filenaming conventions is a huge improvement over make for workflow building)
Integrate several files into one output
I wanted to compare a control vs an experimental sample using MACS and using a QC script called phantompeakqualtools. The .bam files for the two samples have very similar names. I have three separate filename templates that ultimately look like this (where enc and ctrl are the encode identifiers for the experimental and control samples, resp.):
# EXPT_SAMPLE.. ALIGN_DIR + "/{enc}/{enc}.{rlen}nt.bam

# CTRL_SAMPLE.. ALIGN_DIR + "/{ctrl}/{ctrl}.{rlen}nt.bam
# EXPT_VS_CTRL.. ALIGN_DIR + "/{enc}/{enc}_vs_{ctrl}.{rlen,\d+}nt.phantompeak.qc"

rule bam_to_phantom_qc:
    input:
        chipseq_bam = EXPT_SAMPLE_TEMPLATE + ".{rlen}nt" + ".bam",
        control_bam = CTRL_SAMPLE_TEMPLATE + ".{rlen}nt" + ".bam"
    output:
        qc = EXPT_VS_CTRL_TEMPLATE + ".{rlen,\d+}nt" + ".phantompeak.qc"
    shell:
        """ Rscript ./tools/source/phantompeakqualtools/run_spp.R \
                -c={input.chipseq_bam} \
                -i={input.control_bam} \
                -filtchr='random|_|HLA|EBV' \
                -out={output.qc} """

D) Run your workflow: snakemake -s <snakefile> [-n] -p [-j N]

Snakemake assumes all file locations are specified relative to the current working directory. If you have a snakemake script called 'Snakefile' in your current working directory you can run it using  snakemake. I don't do this. I put all my scripts in './scripts/' and call the relevant snakemake script using 'snakemake -s scripts/snake.somescriptname'. This does mean the results will be different if I run from the top-level of my job directory than if I run from in 'scripts' (or elsewhere), but does mean I should be able to clone the scripts/data for a job onto a different computer and run identically.

At this point, I typically like to know about --debug, --verbose, -p and -n because the script's not going to work first-time is it.

Snakemake can run independent rules in parallel, and will use up to N threads if you provide the '-jN' argument (default 1). Further, you can specify how many threads each specific rule requires by doing the following (note that if N<4, then at most N threads will be passed to this rule)

rule rulename
    input: 'input.file'
    output: 'output.file'
    threads: 4
    shell: "some call that uses threads={threads}"

E) Look at your workflow DAG: snakemake ... --dag | dot -Tpng > <dag.png>

This is quite neat, but can quickly become unreadable if you're analysing loads of samples. Snakemake outputs a directed acyclic graph (if snakemake can't convert your script into a DAG of dependencies between files, you're screwed) that can be viewed with Graphviz (there's also a D3.js output thingme, but I've never used it)

---------------------------------------

A) Remote files

In copying the external .fastq file to a local folder, I used wget. But I originally wanted to use one of the recent functions that snakemake provides for making a local copy of remote files.

from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP=HTTPRemoteProvider()
...
rule download_data:
    input: HTTP.remote("www.somewhere.com/something.ext")
    output: "something_else.txt"
    shell: "cat {input} | manipulate | me > {output}"

The remote file is saved relative to the current directory, is manipulated by the shell command, and then is deleted once it's been used.

B) Shadow rules

I thought this was a weird idea, but it's turned out to be quite useful. When you're running a snakemake rule that has a
shadow: "shallow" 
directive, snakemake makes a mirror of the current directory structure and works within that. Then at the end of running the rule, it's output files are copied back to the main directory structure. 
You can use this to prevent two calls to a snakemake-called script from writing to the same filename. For example, there are two main MACS ChIP-Seq calling algorithms: narrowPeaks and broadPeaks. Calling macs with either of these choices, outputs an R script called '*_model.r', where * is some prefix. So if you're running MACS in both broad and narrow ways, one will overwrite the script that was written by the other (this is a somewhat contrived example, as I could have just chosen a more appropriate prefix). Then running this R script outputs a pdf file *_model.pdf.

By running the two models separately in shadow directories, I can prevent one from overwriting the script/figure that was generated by the other. For example, I can move the intermediate figure to a more appropriately filename before it is copied back into the main directory.

rule macs2_rscript_to_pdf:
    input:   "{prefix}_model.{peak_type}.r"
    params: intermediate = "{prefix}_model.pdf"
    output:  "{prefix}_model.{peak_type}.pdf"
    shadow: 'shallow'
    shell:   "Rscript {input}; mv {params.intermediate} {output}"


C) Functions as input

Snakemake rules look like python functions (with a funny syntax), but they aren't. They take filenames and parameters as input, call shell functions or python/R scripts, save results to the file system. They don't really return a value, and they almost always have side-effects. 

In looks like most of the things that behave like python variables are computed before a snakemake rule is even called. So you can't pass in a dynamically computed value into a snakemake rule. But, you can set up the wildcards generated in a call to a snakemake rule so that they can be used in a python function.

As an example, my ChipSeq analysis used ENCODE identifiers throughout the workflow. In a final bit of the processing, I want to merge together those samples that relate to the same ChIP-Seq target: for example, I want to pull out all files related to H3K4me1 and identify how frequently each position was called as part of a significant region by MACS across the different samples. How do I map the biological target to the set of samples where that target was assessed? I set up a function that returns the filenames for all samples that relate to that target:

def chipseq_target_to_filelist(target_name, file_class):
    """ Pull the samples that correspond to the target_name out from config environment
       then construct the filename for the file_class using the sample-specific
       data in config."""

    assert file_class in ['broadPeak', 'narrowPeak']
    samples = [s for s in config['samples']
                 if config[s]['chip_or_control'] == "chipseq"
                 and config[s]['target'] == target_name
                 ]
    assert len(samples) > 0

    if file_class == 'narrowPeak' or file_class == 'broadPeak':
        # Template should look like:
        # ALIGN_DIR + "/{enc}/{enc}_vs_{ctrl}.{rlen}nt_peaks.[broad|narrow]Peak"
        file_template = EXPT_VS_CTRL_TEMPLATE + ".{rlen}nt_peaks." + file_class
        file_list = [file_template.format(
                           enc = s,
                           ctrl = config[s]['control'],
                           rlen = config[s]['trimmed_rlen'])
                       for s in samples
                       ]
        return file_list
    # shouldn't be able to get to here
    return []



def chipseq_target_to_broadPeak_files(wildcards):
    files = chipseq_target_to_filelist(wildcards.target, 'broadPeak')
    return files


rule target_region_counter__broadPeaks:
    input:  chipseq_target_to_broadPeak_files
    output: "./data/targets/{target}/{target}.regioncount.broadPeak.bed.gz"
    shell:  "multiIntersectBed -i {input} | bgzip --stdout /dev/stdin > {output}"


Note that your functions can take only a single argument if they are used as the input to a snakemake rule. The wildcards for the latter rule contains the current target's name, all samples for that target (and all relevant filenames are returned) are identified by chipseq_target_to_filelist.

D) Cluster work

I don't have much experience with this yet, but snakemake has methods for interacting with qsub that look pretty powerful. TODOOOO...

---------------------------------------