Thursday, 29 June 2017

Helpful bash posts

2018-03-17

To show a precis (--oneline) of the recent commits, the branches that they affect (--decorate=auto) and an illustration of the current branches (--graph):

git log --decorate=auto --oneline --graph

Note that --decorate=auto is the default option from git v2.13. For earlier versions (as on my current WSL) you can add --decorate=auto option to your git config:

git config log.decorate auto

2018-03-02
Dont extract the whole of that archive!

If you just want to list the contents:

tar ---list -jf some_archive.tar.bz2
tar -jtf some_archive.tar.bz2

If you just want to extract a specific file from the archive (stores my_filename relative to the current directory making any subdirs that are required):
tar --xjf my_archive.tar.bz2 my_filename

If you want to extract the contents of a specific file, eg, to pipe the data into another command:
tar --to-stdout -xjf my_archive.tar.bz2 my_filename

Thanks to you at nixCraft for the listing code

2017-07-06
The command ":" is an odd one. Why have a NULL command? grep returns with a non-zero error code if it finds no matching values in the input. So in a pipeline like the following:

VAR1=$( do_something file.1 | command2 | grep "some_value" - )

The pipeline would return a non-zero error value if some_value wasn't found in the results of command2. If you're running with option -pipefail or set -e enabled then that final grep will kill your script, even if you're happy to pass an empty list to VAR1 (eg, you want to subsequently do further tests on VAR1). To prevent grep from killing your script you can use

 VAR1=$( do_something file.1 | command2 | grep "some_value" - ) || true
or
VAR1=$( do_something file.1 | command2 | grep "some_value" - ) || :

and the pipeline doesn't return with a non-zero error code. I think it's more common to use true than to use ':' here though.

2017-07-04
Needed to make a backup of an external drive, so that I could repartition the drive. To check that everything's copied over correctly, the md5sums of all the copied files can be checked.This is done recursively, so the md5sums are computed for each subdirectory as well. From here:

find . -type f -exec md5sum {} \; > .checks.md5
grep -v "\.\/\.checks\.md5" .checks.md5 > checks.md5
rm .checks.md5

Then to check all the file-copying has worked properly:

md5sum -c checks.md5

2017-07-03
Anyone else working with some big big files? To print out the amount of storage used by a given directory, use du:

du -h <directory>

This shows how much storage each subdir is using  (-h: in human readable form). To sum this all up and thus compute total disk-usage in and below <directory>:

du -sh <directory>

2017-07-02
sed and awk are fine, but perl provides me a bit more control over regexes. I use this if I need to pass an absolute path to a program that doesn't understand "~" or I want to compare the full paths of two files.

expand_tilde()
{
  echo "$1" | perl -lne 'BEGIN{$H=$ENV{"HOME"}}; s/~/$H/; print'
}

expand_tilde ~/temp/abc
# /home/me/temp/abc

2017-07-01
Find out what the different config files (/etc/profile, ~/.bashrc, ~/.bash_profile) do and when they are called. See here.

2017-06-30
I always used to use backticks for command substitution and only recently learned about using parentheses for this:

$(command) == `command`

Importantly, you can readily nest commands with the former:

$(command $(other_command))

See here for further info (especially about the interplay between quoting and substitution).

2017-06-29
Useful bash tips at shell-fu, for example, no need to open vim to remove vacuous whitespace:

vim -c "1,$ s/ \+$//g | wq" <filename>

2017-06-28
A variety of helpful bash things that I found on the web.

cp file{1,2}.txt == cp file1.txt file2.txt

See here

Wednesday, 21 June 2017

Working directory in R markdown

Discussed elsewhere, I organise my bioinformatics projects like this:

./jobs/
    - <jobname>/
        - conf/
        - data/
        - doc/
            - notebook.Rmd
            - throwaway_script.ipynb
        - lib/
        - scripts/
        - Snakefile

Where the top level snakemake script controls the running of all scripts and the compiling of all documents. My labbooks are stored as R-markdown documents and get compiled to pdfs by the packages rmarkdown and knitr. Despite RStudio's appeal and my spending nigh on all of my time writing R packages, scripts and notebooks, I'm still working in vim.

When working on the project <jobname>, my working directory is ./jobs/<jobname> and, in the simple case when a given project has no subprojects, this working directory shouldn't be changed by any of the scripts or documents. knitr's a bit of a bugger though.

To compile a single *.Rmd notebook, I have the following snakemake recipe. It converts doc/some_file.Rmd into doc/some_file.pdf or doc/some_file.docx depending on the required output filename.

rule compile_markdown:
    input: script = "doc/{doc_name}.Rmd"
    output: "doc/{doc_name}.{ext,pdf|docx}"
    params: wd = working_dir
    run:
        R("""
            library(knitr)
            library(rmarkdown)
            library(tools)
            opts_knit$set(root.dir = "{params.wd}")
            doctype <- list(pdf  = "pdf_document",
                            docx = "word_document"
                            )[["{wildcards.ext}"]]
            rmd.script <- "{input.script}"
            render(rmd.script,
                   output_format = doctype,
                   output_file   = "{output}",
                   output_dir    = "doc",
                   quiet = TRUE)
        """)


Why is that opts_knit$set(root.dir = ...) line in there ...... ?

Assume I'm sitting in the working directory "~/HERE" on the command line.

Let's write a simple R markdown script (doc/temp.Rmd; just the purple bit) that just prints out the current working directory:

---
title: "temp.Rmd"
output:
    - pdf_document
---
```{r}
print(getwd())
```

... and then render that into a pdf:
~/HERE> Rscript -e "library(rmarkdown); render('doc/temp.Rmd')"

This prints out the working directory as "~/HERE/doc" (where the script is stored) rather than the directory "~/HERE", where I called Rscript from.

Note that if I put a similar .R script in ./doc, that prints out the current working directory, this doesn't happen:

# temp.R
print(getwd())
# end of temp.R
~/HERE> Rscript ./doc/temp.R
# [1] "/home/russh/HERE"

This indicates that the R-working directory is HERE (the same as the calling directory; as does Rscript -e "source('doc/temp.R')").

There's a few reasons that I don't like the working directory being changed within a script. When you want to read some data from a file into a script that you're running, you can either i) write the filepath for your data within the script, or ii) you can provide the script with the filepath as an argument.

Let's suppose you had two different scripts that access the same file, but where one of those scripts changes the working directory. Then:

In case i:  you'd have to write the filepath differently for the two scripts. Here, if I wanted to access ./data/my_counts.tsv from within ./doc/temp.R and ./doc/temp.Rmd, I'd have to write the filepath as "./data/my_counts.tsv" within the former, and "../data/my_counts.tsv" within the latter.

In case ii: you'd have to similarly mangle the filepaths. A script that changes the working directory should be provided filepaths relative to the working directory chosen by that script, so you have to think as if you're in that directory; or use absolute paths (NO!!!!!!).

I know it seems trivial, and described as above it only seems like a mild inefficiency to have to write different scripts in slightly different ways. And I know it's all just personal preference: and so in that light, please don't change the working directory.

Others are somewhat more forceful - see the here_here package  (though they're discussing a very different issue) for the delighful statement: "If the first line of your #rstats script is setwd("C:\Users\jenny\path\that\only\I\have"), I will come into your lab and SET YOUR COMPUTER ON FIRE.". I hope they don't mind, because I'm about to change the working directory back to what it should have been all along... (IMO).

Can we use setwd() to change directory in R-markdown? Write another script:

---
title: "temp2.Rmd"
output:
    - pdf_document
---
```{r}
# we already know this is ./doc
print(getwd())
setwd("..")
print(getwd())
```
```{r}
# surely setting the working directory has done it's job
print(getwd())
```
~/HERE> Rscript -e "library(rmarkdown); render('doc/temp2.Rmd')"

The pdf that results from rendering this from the command line prints, the equivalent of:
# Before using setwd()
~/HERE/doc
# After using setwd()
~/HERE
# In the second R block:
~/HERE/doc

So setwd() can change the working directory from ./doc back to `my` working directory, but this change doesn't persist between the different blocks of code in an R-markdown document. (and also, we get the warning message:
In in_dir(input_dir(), evaluate(code, envir = env, new_device = FALSE,  :  You changed the working directory to <...HERE> (probably via setwd()). It will be restored to <...HERE>/doc. See the Note section in ?knitr::knit)

So that's not how to change dirs within R-markdown.

How to do that is shown on Yihui Xie's website for knitr.

So if we write the script:
---
title: "temp3.Rmd"
output:
    - pdf_document
---
```{r}
library(knitr)
print(getwd())
opts_knit$set(root.dir = "..")
print(getwd())
```
```{r}
print(getwd())
```
~/HERE> Rscript -e "library(rmarkdown); render('doc/temp3.Rmd')"

This time it runs and prints the following filepaths:
# Before using opts_knit$set:
~/HERE/doc
# After using opts_knit$set, but within the same block:
~/HERE/doc
# In the second R block:
~/HERE

This has set the working directory to the required path, and it's done it without us hard-coding the required working directory in the script (so at least I won't get my computer set on fire). But the change in the wd only kicks in for blocks that are subsequent to the opts_knit call.

So, we can set the knitr root directory inside an R-markdown script, but we can also set it from outside the R-markdown script:
# - knitr has to be loaded before opts_knit can be used.
# - dollar-signs have to be escaped in bash.
Rscript -e "library(rmarkdown); library(knitr); opts_knit\$set(root.dir = getwd()); render('doc/temp3.Rmd')"

# OR

Rscript -e "library(rmarkdown); render('doc/temp3.Rmd', knit_root_dir = getwd())"

Now, all the filepaths printed within the pdf are the same (they are all ~/HERE): the call to change the root.dir within the .Rmd file has no effect, and all code within the .Rmd file runs as if it's in my working directory.

Now we'll try something a little bit more complicated. You can include child documents within a given *.Rmd file:
---
title: "temp_master.Rmd"
output:
    - pdf_document
---
Calling child document (path relative to the calling process and knit_root_dir):
```{r, child = "doc/temp.Rmd"}
```
In master document:
```{r}
print(getwd())
```
~/HERE> Rscript -e "library(rmarkdown); render('doc/temp_master.Rmd', knit_root_dir = getwd())"

This fails. knitr can't find the file ./doc/temp.Rmd (although it definitely exists, relative to the dir that I started R in, the same directory that I set knit_root_dir to be).

So we rewrite as doc/temp_master2.Rmd:
---
title: "temp_master2.Rmd"
output:
    - pdf_document
---
Calling child document (path relative to the parent document):
```{r, child = "temp.Rmd"}
```
In master document:
```{r}
print(getwd())
```
~/HERE> Rscript -e "library(rmarkdown); render('doc/temp_master2.Rmd', knit_root_dir = getwd())"

This works fine. In both the parent and the child document, the value of getwd() is ~/HERE, so knit_root_dir seems to be passed through from parent to child. But, the child document has to be specified relative to the parent document.

So what have I learned:

  • Never setwd() in an R-markdown file - it just gets changed back to the original working directory
  • If you want your R-markdown files to compile as if ran from a specific directory, set the knitr root-directory using opts_knit$set(root.dir = "some.dir") or using the knit_root_dir argument to rmarkdown::render
  • You can set the knitr root directory from either inside an R-markdown file (put it in the first block), or in an R script that calls rmarkdown::render
    • but only set it once, 
    • and it's probably best to set it from outside the R-markdown file (if you might subsequently want to include that file as a child of another file)
  • If you do include child documents, be aware that the filepath you put into your parent document should be written relative to the parent document, not relative to the knit_root_dir

Aside from that last point, write your paths relative to the root directory for your project...

---

Friday, 9 June 2017

R formatting conventions

[I changed my conventions; disregard this, future Russ]

When it comes to naming things, we've all got our own preferences. Here, boringly, are my preferences for naming things in R (it's partly based on Google and Wickham ; but I also found this interesting). You can call it a style guide, if you like, but I've very little by way of style. To me it's just a bunch of conventions that I try to stick to.

I use a different naming format for functions (underscore-separated) than for non-functions (dot-separated). I really don't care that dot-separation would confuse java or python, because these are my conventions for R. 

An exception to the function_name underscore-convention is when I'm making functions that are of a particular type: I put the type as a dot-separated prefix, eg, if I'm writing a function that returns a function, I call it 'builder.more_specific_name', it it's a function that reads or writes to files I call it 'io.what_does_it_do'  etc.

I try to push volatile side-effecting or state-dependent shizz (IO, changes to options, printing, plotting, setting/using RNG streams; but not stop/message/warning and the like) out to the sides. All variables used by a function are passed in as arguments or defined therein, so that modulo stop (etc), most functions are pure.

Use at most 80 characters per line

Where-ever possible, use <- instead of =.
woop <- TRUE  # purple means good
woops = FALSE  # red means bad

Never use <<- 
... and never recommend a course that teaches how to use <<- to anyone
... and never, ever, mention <<- on your blog.
Similarly: attach, ->

Try not to overwrite the names of existing objects (you can check using exists("possible.name")):

j5.df <- data.frame(abc = 123)
df <- data.frame(abc = 1:3)   # see stats::df
keep <- which(j5.df$abc == 1) # keep is not currently defined
drop <- which(df$abc == 2)    # but drop is: see base::drop

Use 2-spaces to indent:

my_func <- function(arg1){
  # some code
  }

If there's more than one argument to a function you are defining, put each arg on a separate line and offset these by an extra 2 spaces:

my_fancy_function <- function(
    arg1,
    another.arg = NULL
  ){
  if(missing(arg1)){
    stop("ARGGGGG!")
    }
  body("goes", "here", arg1, another.arg)
  }

Use the magrittr pipe to connect several functions together (and write the functions on separate lines):
my.results <- abc.def %>%
  function1(some.other.arg) %>%
  some_other_fn

All the rest, rather briefly:

job.specific.package.name

ClassName ## and also ClassName == ConstructorName

object.name

function_name

<function_type>.function_name

<function_type>.function_name.<name_of_pipeline>

.non_exported_function

pipeline.name_of_pipeline

Re the last thing, my pipelines aren't just strings of functions, they're strings of functions that log intermediate results (summary tables, summary statistics, plots etc; this is all done functionally, so you can't log base-R graphics or any other side-effect dependent stuff) and return both a final-processed dataset and the specified intermediate results. I'm sure I'll write about that some other time (when it's less hacked together).

Neat, related tools: formatR, (thanks to Lovelace), lintr, ...

I noted that formatR is able to identify and fix some examples of poor coding style in R, however, it didn't seem particularly pretty to me: 
- There was no way to specify that line widths should be at most 80 characters (it's line.width thing splits lines after at least a given width); 
- It wrapped function calls / function definitions into as dense a space as possible

lintr is also able to identify examples of poor coding style in R. Some of my choices don't fit it's default checkers though - notably dot-separation in variable/function names, but I still like to have two forms of punctuation available. I'm going to write up a specific lintr script to call from my git pre-commit hook.

Both are available via conda.