Tuesday, 16 May 2017

Project organisation

How do you organise your projects? I'm sure no-one but me cares how I organise mine. Here, I'm trying to put down my rules and my reasons for them, so that when I start breaking my rules I can send myself to the corner of naughty bioinformaticians. Have a look at this if you want the same kind of info, but from someone with a lot more clout.

There are a few motivating factors behind my choice of approach:

1) It should be quick to set up a new project;

2) Each project should have a consistent structure;

3) A build tool should be used to track dependencies amongst code / data / results, so that updates to a given file trickle down to any dependent files (without forcing nondependent files to be remade)
[hence a Snakefile for use in snakemake is used at the head of each project];

4) I should be able to git-clone one of my projects from my personal bitbucket ...
[all configs, docs, scripts should be present and under version control];

5) ... and be able to setup and run it on most linux boxes (given the availability of the input data and modulo some configurations, eg, to link to that data)  ...
[filepaths etc. should be hardware-independent within the project, non-standard shells and the like should be avoided];

6) ... and generate all results / reports for that project automatically ...
[related: any results quoted in reports should be dynamically updateable];

7) ... and it should only take a few lines of command line code to go from step [4] to step [6];

8) Each project should run in a self-specified, version-explicit, environment
[so we use conda and related projects anaconda & bioconda to manage all external libraries / packages / programs that are used];

9) If multiple projects need to use a particular script / library that I've written, each project should keep an internal copy of that script
[so that project-generic code can be updated without breaking a working project];

10) Big projects should be allowed to contain loosely-coupled subprojects to keep everything neat
[but the subprojects should respect the principles above];

[TODO: add in what you forgot: projects shouldn't tread on each other's toes; the value of packaging; whither exploratory analysis; the volatility of external data; how to limit code duplication; issues with data used by multiple projects?]

A typical project template looks like this:

    - conf/
        - check_these_dirs.txt
        - copy_these_files.txt
        - job_specific_vars.sh
        - make_these_links.txt
        - make_these_subdirs.txt
        - touch_these_files.txt
    - data/
        - ext/
        - int/
        - job/
          - <READ/WRITE_TO_THIS_DIR>
    - doc/
        - figure/
        - notebook.[Rmd|lyx]
        - <various_throwaway_ideas>.ipynb
    - lib/
        - <PKGNAME>/ # R package DESCRIPTION/R/tests etc
        - <PKGNAME>.tar.gz
        - Makefile
        - conf/
            - include_into_rpackage.txt
        - global_rfuncs/
            - R/
            - tests/testthat/
              - <... AND TESTS>
        - local_rfuncs/
            - R/
            - tests/testthat/
              - <... AND TESTS>
        - setup.DESCRIPTION.R
    - requirements.txt # OR environment.yaml
    - results/
    - scripts/
        - check_env.sh
        - setup_dirs.sh
        - setup_libs.sh
        - setup.sh
    - Snakefile
    - <OPTIONAL:subprojects/>
    - temp/
    - TODO.txt

The default contents of ./conf/ and ./scripts are added by a script new_project.sh

The script setup.sh defines all the rest of the project-specific directory structure and builds a project-specific R package based on the contents of the files in ./conf/ and the scripts setup_dirs.sh and setup_libs.sh. I modify the config scripts/files to specify the names of the job, the conda-environment in which the job is to run, and the optional job-specific R package and I specify other things, like whether an R kernel is required, which files should be copied/linked/touched before running the project.

Environment set up:
While adding code to a project you'll need to import external libraries / programs. For the sake of reproducibility you should keep track of the versions of the resources you are using. The easiest, and most portable, way to do this is by using conda to install any external dependencies of your code. I make a new environment for each main project that I'm working on 
(`conda create --name some_project`)
 and use 
(`conda install -c <repos> <package>`)
to add extra stuff to it. Having done this, always output a requirements.txt file (or environment.yaml, see here) containing the currently-installed conda stuff for the project 
(`conda list --explicit --name some_project > requirements.txt`)
and keep this under version control. Then if you need to run the same project on a different computer, you should be able to create an identical environment using
(`conda create --name some_project --file-spec requirements.txt`).

Code for the project gets put into either scripts/ or lib/.

./scripts mostly contains python / shell / R scripts and snakemake-recipe-scripts that are either imported into, or called as subworkflows, by the main project Snakefile.

./lib contains any R function files, class definitions or pipelines that are used in the project. Some of these R files are used by several projects - these are copied (not linked) from a separate repository into ./lib/global_rfuncs: I have to ensure that in the future, any given project will still run - if the global R files for a project were linked to the repository versions, any update to the repository versions may break the future compilation of this project. R functions (etc) that are written specifically for this project are put into ./lib/local_rfuncs. I use a makefile to build and install the project-specific R package. Having an R package for each project means the R-scripts I put in ./scripts are more lightweight than if they contained loads of function definitions.

I don't need an equivalent package / library for python functions because I write so few python scripts and the script vs function isolation provided by `if __name__ == '__main__'` means I can easily use functions from one python script inside another.

Unlike in the Noble paper above, I don't put separate src/ and bin/ directories in projects because I don't compile jack.

There are two main ways to pass project-specific data into a project: via filepaths in ./conf or in ./data. Minor routes exist: through hard-coding information into project-specific scripts, notebooks or the R package for the project, or (for non-project-specific data) by importing bioconductor packages.

If I want to be able to modify a datafile `by hand` without hard-coding the info in a script/library, the file has to go in ./conf - not in ./data. For example, I recently did a meta-analysis of a lot of GEO/ArrayExpress datasets - having checked the relevant papers/abstracts, not all of the datasets identified by my automated searches were of value to the meta-analysis, I encoded my inclusion/exclusion criteria for these datasets in a yaml file in ./conf. (because the info may have been used by multiple docs/scripts, it had to be somewhere other than ./scripts or ./lib).

All datasets generated by the current project (and most data used by the current project) is accessed via ./data, although ./data may contain links to files/dirs elsewhere on the file system. This means I can use relative subdir filepaths throughout the project, and indeed, I never use full paths within the scripts/docs/libs for the project (except during setup.sh if a full-path is mentioned in one of the config files). That is, any script/makefile/notebook used in a project uses filepaths relative to the main working directory for the project (ie, the dir where Snakefile is sitting) and none of them use filepaths that ../../../..... off into the wilderness (nonetheless, I do allow subproject Snakefiles to have dependencies on other parallel subprojects). The only way to use files that are external to the current working directory and it's subdirs, is to either copy them into one of the project directories or link to them from ./data/ext (for datasets downloaded from elsewhere) or ./data/int (for datasets generated by our lab, or for results generated by another one of my projects). Any datasets generated during running a project are put into ./data/job - so a project shouldn't write to ./data/int, ./data/ext or ./conf.

I used to write project notebooks in lyx (with knitr and the like) but I've started writing it all in Rmarkdown and might move over to bookdown eventually (cos it looks like I can stitch several subproject notebooks together more easily in bookdown, but I (or someone) need to put bookdown on anaconda first). Reasons for switching: Rmarkdown and .Rmd files seem better at these things (some may be due to my own ignorance however):

Including dynamic results within paragraphs;
Automated building of pdfs (and other formats);
Working directory bewilderments
Readability in version control
Reproducibility and portability (lyx is a bit bloated for conda)

Running the whole damn thing:
- Clone the project
- Move into the project's main directory
- # Possibly modify config files in ./conf
- Create conda env
- Activate conda env
- Run ./scripts/setup.sh to setup all dir structure and data links and to build/install the R package into the conda envcompilation
- Run snakemake

Now I change my mind and rewrite the whole thing...

No comments:

Post a Comment