Wednesday 22 November 2017

snakemake: using input/output values in params

It says in the changelog for snakemake that from snakemake v3.10.2 you can use entries in the output directive when computing params. Similarly you can use input values. How and why would you do this?

I'm currently using snakemake 4.0.0.

The readthedocs gives this example for the syntax:

rule:
    input:
        ...
    params:
        prefix=lambda wildcards, output: output[0][:-4]
    output:
        "somedir/{sample}.csv"
    shell:
        "somecommand -o {params.prefix}"

Let's try and do that in a less brusque manner.

I'm modifying a snakemake workflow for RNA-Seq alignment and analysis at the moment. When I first used this workflow, each of my sequencing samples had been sequenced on each of six lanes of a NGS machine. I align the data for each sample/lane separately and then merge the data together using picard. This allows me to indicate the lane-of-origin for any given readpair within the aligned dataset.

To merge the data I use this snakemake rule:

rule merge_bams:
    message: """
             --- Merges different lanes into a single bam
             """
    input: get_rg_bams 
    output: bam = "{prefix}.merged.bam"
    params: input_formatted = format_bampaths_for_merging
    shell: """
        picard MergeSamFiles \
            {params.input_formatted} \
            O={output} \
            SORT_ORDER=coordinate \
            ASSUME_SORTED=true
           """

Here, the function `get_rg_bams` determines the names of all lane-specific files based on the value of `wildcards["prefix"]`. The function `format_bampaths_for_merging` converts the lane-specific files (it doesn't assume a fixed number of input files) into a single string of the form "I=input_1.bam I=input2.bam ...".

We've resequenced a subset of these samples, but this time we sequenced each sample on a single lane.

For the new dataset, there's no real reason to call MergeSamFiles since only a single input-file is available for any given sequencing sample (since there's seq-data from only a single lane) - we might as well just rename the input file. To do this, I could check if a single file is passed as the input, and rename it, and otherwise run MergeSameFiles.

I don't like to write too much shell code in these rules. So I'd rather determine how many files are input and pass that in as a parameter.

Toy code. I wrote a toy snakefile, and added three text files (file1, file2, file3) in its directory. It makes two files "a.txt" and "b.txt" by i) concatentating file1 and file2 into a.txt; ii) (trivially) concatenating file3 into b.txt.

# test1.smk
config_dict = {"a": ["file1", "file2"], "b": ["file3"]}

def get_input(wildcards):
    return config_dict[wildcards["prefix"]]
                                                                    rule all:
    input: ["a.txt", "b.txt"] 
            
rule catter:                                   
    message: """ --- concatenate all input --- """
    input:  get_input
    output: "{prefix}.txt"
    shell:  """cat {input} > {output}"""                                                               

To reflect the issue in my workflow, I only want to `cat` when more than one file is provided as input to `catter` (ie, when making a.txt) - when a single input file is provided I want to copy that file into the output.

# test2.smk
config_dict = {"a": ["file1", "file2"], "b": ["file3"]}

def get_input(wildcards):
    return config_dict[wildcards["prefix"]]
                                                                  rule all:
    input: ["a.txt", "b.txt"] 
            
rule catter:
    message: """ --- concatenate all input --- """
    input:  get_input
    output: "{prefix}.txt"
    shell:  """
        input_ary=({input})
        if [[ ${{#input_ary[@]}} -eq 1 ]];
        then
            cp {input} {output}
        else
            cat {input} > {output}
        fi
        """

That works fine. The only issue (for me) being the definition of input_ary and the determination of the length of that array within the shell command. I'd rather be able to precompute the number of files that are passed in using python. The syntax for this should be of the form 
  params: param = lambda wildcards, output, input: do_something

Just as stated in the docs (for the use of output in defining params), you can do this as follows

# test3.smk
# ... see the above for the set up ...
rule catter:
    message: """ --- concatenate all input --- """
    input:  get_input
    output: "{prefix}.txt"
    params: n_input = lambda wildcards, input: len(input)
    shell:  """
        if [[ {params.n_input} -eq 1 ]];
        then
            cp {input} {output}
        else
            cat {input} > {output}
        fi

        """

To my eyes, this version looks a lot better than the one where input_ary was defined by bash, since more code is ran within python/snakemake (and the bash equivalents for list/array manipulations are a bit opaque).

In the definition of n_input, I could have used any of the following (since params can be built on any of input / output / threads / resources and if any of these are used in the lambda-arguments the ordering of them doesn't matter, so long as they're mentioned after `wildcards`):

params: n_input = lambda wildcards, input: len(input)
params: n_input = lambda wildcards, input, output: len(input)
params: n_input = lambda wildcards, output, input: len(input)
params: n_input = lambda wildcards, threads, resources, output, input: len(input)

# and if you've something more involved you could call it like this:
def complex_function(input): return do_something(input)
... 
params: some_param = lambda wildcards, input: complex_function(input)

But, I couldn't use the following:
params: n_input = lambda input: len(input)

... since (as stated here) the first input to functions called within snakemake rules should always be 'wildcards'.