Thursday, 26 January 2017

Git hooks, notebooks and beyond

URGGH! I've been migrating from using knitr in lyx to using jupyter notebooks to write up R-based data analyses (although to be honest I think I'll still use knitr to write up formal reports). With knitr, the file containing the write-up and code was distinct from the file containing the compiled results. In jupyter, on executing your code, the results are stored in the same file that you write your code. 

I only keep the free text and code sections under version control - that is, I only keep those bits of an analysis that have been written by me under version control - not the executed sections of the analysis (output pdfs / figures / tables etc; since these bits take up much more space and change much more rapidly than the code responsible for generating them). This means that if I've been working on a jupyter notebook and want to push my new work into git/bitbucket, I have to clear all the executed output from my notebook, save the file, then git add -> commit -> push. [Then recompile the notebook and do more writing]

Trouble is, before you add/commit a jupyter notebook to version control it's very easy to either forget to clear all output from the notebook or forget to save the notebook after you've cleared the output. Either way, you might end up committing locally (which is annoying but easily reset), or even worse, pushing an executed notebook onto the distant shores of bitbucket (after which resetting / reverting to an earlier unexecuted notebook is a ballache).

Enter git hooks.

I'd never used these before, but git hooks are a bunch of tools that can check various properties of your files before allowing you to successfully add / commit / push them into version control. Examples of these scripts are provided in the .git/hooks/ directory of a git repository. They work at different stages of the git workflow, as indicated by their names (pre-commit, pre-push etc; you have to drop the .sample extension from their default filenames before they are activated: eg, change pre-commit.sample to pre-commit)

Suppose I've staged a modifed jupyter notebook using git-add. I wanted to ensure that it doesn't contain any executed cells prior to committing it to my local git repository.  So I modified the pre-commit hook. The default version of pre-commit.sample has a #!/bin/sh shebang line, but you can rewrite them in any scripting language you want, so long as you can execute it at the command line. The example code starts:

if git rev-parse --verify HEAD >/dev/null 2>&1
then
        against=HEAD
else
        # Initial commit: diff against an empty tree object
        against=4b825dc642cb6eb9a060e54bf8d69288fbee4904
fi

This sets the script up to compare your new files against HEAD (if you've previously made a commit) or against an empty repository.

In a jupyter notebook, the JSON for a non-executed cell contains the line
   "execution_count": null,
whereas an executed cell contains a line of the form
   "execution_count": <some integer>,

To identify and throw an error when an executed notebook is passed to git commit, we use the following code in the hookfile pre-commit:
# When committing a jupyter notebook, ensure that no executed cells are present
#   in the commited file:
if test $( git diff --cached $against |\
         grep -e '\"execution_count\": [0-9]\+' |\
          wc -l ) != 0 ;
then
  cat <<\EOF
Error: Attempt to commit a jupyter notebook that contains executed cells
EOF
  exit 1
fi

Now, I get an error and the commit refuses to complete whenever an executed cell is found in any of the files
Hopefully, I'll never upload another huge notebook to bitbucket.

No comments:

Post a Comment