Tuesday 31 May 2016

qsub and snakemake

I've previously written about using snakemake to develop bioinformatics workflows. Recently I've started running more and more batch analysis stuff on our university's cluster: these jobs have to be submitted using qsub. Snakemake provides some integration with qsub, according to the snakemake docs, but I've never called qsub in this setting.

My cluster set-up:
I have a miniconda set up on the cluster with python3 / R / Rscript / snakemake etc installed (eg, ~/miniconda3/envs/bfx/bin). I use jags a lot, but run this from ~/tools/bin (via R).

############################################
# ./scripts/test_script.R:
get.poisson <- function(seed = NULL){
  if(! is.null(seed)){set.seed(seed)}
  rpois(100, 1)
  }

args <- commandArgs(trailingOnly = TRUE)
seed <- NULL
if(length(args) >= 1){
  seed <- scan(args[1], what = integer())
  }
if(length(args) >= 2){
  out.path <- file.path(args[2])
  } else {
  out.path <- stdout()
  }
vals <- get.poisson(seed)
write(x = vals, file = out.path, sep = "\t")
#############################################

Using snakemake to call a simple R script:
###############################
## ./Snakefile
import os

rule all:
    input: os.path.join("data", "output.txt")

rule input_to_output:
    input:  seed   = os.path.join("data", "seed.txt"),
            script = os.path.join("scripts", "test_script.R")
    output: os.path.join("data", "output.txt")
    shell: """ Rscript {input.script} {input.seed} {output} """

rule define_seed:
    output: os.path.join("data", "seed.txt")
    shell: "echo 1 > {output}"

# or alternatively if the script is written to be sourced
# run: R""" # do something with {input.seed} 
#           source('{input.script}') 
#           # do something with {output}
#           """

Using qsub to call the same script (assuming 'data/seed.txt' has been set up) isn't so straightforward - 'Rscript' can't be called, since it isn't an ascii shell script - you have to set up a shell script that contains the following:
## ./scripts/script_runner.sh
Rscript scripts/test_script.R \
        data/seed.txt \
        data/output.txt

.. and then call it as follows:
RH> qsub -d ${PWD} \
     -l walltime=00:01:00,cput=00:01:00,nodes=1:onegpc,os=centos6 \
     scripts/rscript_runner.sh
     
#############
Alternatively, you could set up an Rscript-calling shell script that takes named arguments from qsub:
## ./scripts/rscript_runner2.sh
## .. do some checks ..
Rscript "${script_name}" ${script_args}

and then do the following on the command line
RH> qsub -d ${pwd} \
    -v script_name="scripts/test_script.R",script_args="data/seed.txt data/output.txt" \
    -l walltime=00:01:00,cput=00:01:00,nodes=1:onegpc,os=centos6 \
    scripts/rscript_runner2.sh
    
##############
These two methods make me a bit uneasy, though. I use many more shell commands than just Rscript in an average workflow and don't want to have to write a massive shell script that calls qsub for each step, passing a shell script wrapper each time.

Can I therefore get snakemake to pass shell commands to the cluster using qsub?
Supposedly yes, because it says this in the help for snakemake:
"--cluster CMD, -c CMD
                       Execute snakemake rules with the given submit command,
                        e.g. qsub. Snakemake compiles jobs into scripts that
                        are submitted to the cluster with the given command,
                        once all input files for a particular job are present.
                        The submit command can be decorated to make it aware
                        of certain job properties (input, output, params,
                        wildcards, log, threads and dependencies (see the
                        argument below)), e.g.: $ snakemake --cluster 'qsub
                        -pe threaded {threads}'."

So, let's try the following (note that there are a couple of shell steps in the above Snakefile: defining the seed, and calling the R-script - so this should qsub into the cluster twice):

RH> rm data/*
RH> snakemake \
  -s Snakefile \
  --cluster "qsub -d ${PWD} -l walltime=00:01:00,cput=00:01:00,nodes=1:onegpc,os=centos6" \
  --jobs=1

It actually worked straight off. You have to specify --jobs=<N> to prevent snakemake from sending an unending stream of jobs to the cluster. My working directory has the following files in it:
 snakejob.define_seed.0.sh.e310371 
 snakejob.define_seed.0.sh.o310371 
 snakejob.input_to_output.1.sh.e310372 
 snakejob.input_to_output.1.sh.o310372

The prefix 'snakejob.<recipe_name>.<N>.sh' is the name of the random script that snakemake passed to qsub.

The '-d ${PWD}' argument is used to ensure the correct working directory when I qsub into the cluster. Within the snakemake --cluster argument, it isn't really required: I suspect snakemake passes full pathnames around within the random scripts.

######################
Next, how do I specify that different recipes will require different lengths of time to run within the snakefile?....


No comments:

Post a Comment