Wednesday 21 June 2017

Working directory in R markdown

Discussed elsewhere, I organise my bioinformatics projects like this:

./jobs/
    - <jobname>/
        - conf/
        - data/
        - doc/
            - notebook.Rmd
            - throwaway_script.ipynb
        - lib/
        - scripts/
        - Snakefile

Where the top level snakemake script controls the running of all scripts and the compiling of all documents. My labbooks are stored as R-markdown documents and get compiled to pdfs by the packages rmarkdown and knitr. Despite RStudio's appeal and my spending nigh on all of my time writing R packages, scripts and notebooks, I'm still working in vim.

When working on the project <jobname>, my working directory is ./jobs/<jobname> and, in the simple case when a given project has no subprojects, this working directory shouldn't be changed by any of the scripts or documents. knitr's a bit of a bugger though.

To compile a single *.Rmd notebook, I have the following snakemake recipe. It converts doc/some_file.Rmd into doc/some_file.pdf or doc/some_file.docx depending on the required output filename.

rule compile_markdown:
    input: script = "doc/{doc_name}.Rmd"
    output: "doc/{doc_name}.{ext,pdf|docx}"
    params: wd = working_dir
    run:
        R("""
            library(knitr)
            library(rmarkdown)
            library(tools)
            opts_knit$set(root.dir = "{params.wd}")
            doctype <- list(pdf  = "pdf_document",
                            docx = "word_document"
                            )[["{wildcards.ext}"]]
            rmd.script <- "{input.script}"
            render(rmd.script,
                   output_format = doctype,
                   output_file   = "{output}",
                   output_dir    = "doc",
                   quiet = TRUE)
        """)


Why is that opts_knit$set(root.dir = ...) line in there ...... ?

Assume I'm sitting in the working directory "~/HERE" on the command line.

Let's write a simple R markdown script (doc/temp.Rmd; just the purple bit) that just prints out the current working directory:

---
title: "temp.Rmd"
output:
    - pdf_document
---
```{r}
print(getwd())
```

... and then render that into a pdf:
~/HERE> Rscript -e "library(rmarkdown); render('doc/temp.Rmd')"

This prints out the working directory as "~/HERE/doc" (where the script is stored) rather than the directory "~/HERE", where I called Rscript from.

Note that if I put a similar .R script in ./doc, that prints out the current working directory, this doesn't happen:

# temp.R
print(getwd())
# end of temp.R
~/HERE> Rscript ./doc/temp.R
# [1] "/home/russh/HERE"

This indicates that the R-working directory is HERE (the same as the calling directory; as does Rscript -e "source('doc/temp.R')").

There's a few reasons that I don't like the working directory being changed within a script. When you want to read some data from a file into a script that you're running, you can either i) write the filepath for your data within the script, or ii) you can provide the script with the filepath as an argument.

Let's suppose you had two different scripts that access the same file, but where one of those scripts changes the working directory. Then:

In case i:  you'd have to write the filepath differently for the two scripts. Here, if I wanted to access ./data/my_counts.tsv from within ./doc/temp.R and ./doc/temp.Rmd, I'd have to write the filepath as "./data/my_counts.tsv" within the former, and "../data/my_counts.tsv" within the latter.

In case ii: you'd have to similarly mangle the filepaths. A script that changes the working directory should be provided filepaths relative to the working directory chosen by that script, so you have to think as if you're in that directory; or use absolute paths (NO!!!!!!).

I know it seems trivial, and described as above it only seems like a mild inefficiency to have to write different scripts in slightly different ways. And I know it's all just personal preference: and so in that light, please don't change the working directory.

Others are somewhat more forceful - see the here_here package  (though they're discussing a very different issue) for the delighful statement: "If the first line of your #rstats script is setwd("C:\Users\jenny\path\that\only\I\have"), I will come into your lab and SET YOUR COMPUTER ON FIRE.". I hope they don't mind, because I'm about to change the working directory back to what it should have been all along... (IMO).

Can we use setwd() to change directory in R-markdown? Write another script:

---
title: "temp2.Rmd"
output:
    - pdf_document
---
```{r}
# we already know this is ./doc
print(getwd())
setwd("..")
print(getwd())
```
```{r}
# surely setting the working directory has done it's job
print(getwd())
```
~/HERE> Rscript -e "library(rmarkdown); render('doc/temp2.Rmd')"

The pdf that results from rendering this from the command line prints, the equivalent of:
# Before using setwd()
~/HERE/doc
# After using setwd()
~/HERE
# In the second R block:
~/HERE/doc

So setwd() can change the working directory from ./doc back to `my` working directory, but this change doesn't persist between the different blocks of code in an R-markdown document. (and also, we get the warning message:
In in_dir(input_dir(), evaluate(code, envir = env, new_device = FALSE,  :  You changed the working directory to <...HERE> (probably via setwd()). It will be restored to <...HERE>/doc. See the Note section in ?knitr::knit)

So that's not how to change dirs within R-markdown.

How to do that is shown on Yihui Xie's website for knitr.

So if we write the script:
---
title: "temp3.Rmd"
output:
    - pdf_document
---
```{r}
library(knitr)
print(getwd())
opts_knit$set(root.dir = "..")
print(getwd())
```
```{r}
print(getwd())
```
~/HERE> Rscript -e "library(rmarkdown); render('doc/temp3.Rmd')"

This time it runs and prints the following filepaths:
# Before using opts_knit$set:
~/HERE/doc
# After using opts_knit$set, but within the same block:
~/HERE/doc
# In the second R block:
~/HERE

This has set the working directory to the required path, and it's done it without us hard-coding the required working directory in the script (so at least I won't get my computer set on fire). But the change in the wd only kicks in for blocks that are subsequent to the opts_knit call.

So, we can set the knitr root directory inside an R-markdown script, but we can also set it from outside the R-markdown script:
# - knitr has to be loaded before opts_knit can be used.
# - dollar-signs have to be escaped in bash.
Rscript -e "library(rmarkdown); library(knitr); opts_knit\$set(root.dir = getwd()); render('doc/temp3.Rmd')"

# OR

Rscript -e "library(rmarkdown); render('doc/temp3.Rmd', knit_root_dir = getwd())"

Now, all the filepaths printed within the pdf are the same (they are all ~/HERE): the call to change the root.dir within the .Rmd file has no effect, and all code within the .Rmd file runs as if it's in my working directory.

Now we'll try something a little bit more complicated. You can include child documents within a given *.Rmd file:
---
title: "temp_master.Rmd"
output:
    - pdf_document
---
Calling child document (path relative to the calling process and knit_root_dir):
```{r, child = "doc/temp.Rmd"}
```
In master document:
```{r}
print(getwd())
```
~/HERE> Rscript -e "library(rmarkdown); render('doc/temp_master.Rmd', knit_root_dir = getwd())"

This fails. knitr can't find the file ./doc/temp.Rmd (although it definitely exists, relative to the dir that I started R in, the same directory that I set knit_root_dir to be).

So we rewrite as doc/temp_master2.Rmd:
---
title: "temp_master2.Rmd"
output:
    - pdf_document
---
Calling child document (path relative to the parent document):
```{r, child = "temp.Rmd"}
```
In master document:
```{r}
print(getwd())
```
~/HERE> Rscript -e "library(rmarkdown); render('doc/temp_master2.Rmd', knit_root_dir = getwd())"

This works fine. In both the parent and the child document, the value of getwd() is ~/HERE, so knit_root_dir seems to be passed through from parent to child. But, the child document has to be specified relative to the parent document.

So what have I learned:

  • Never setwd() in an R-markdown file - it just gets changed back to the original working directory
  • If you want your R-markdown files to compile as if ran from a specific directory, set the knitr root-directory using opts_knit$set(root.dir = "some.dir") or using the knit_root_dir argument to rmarkdown::render
  • You can set the knitr root directory from either inside an R-markdown file (put it in the first block), or in an R script that calls rmarkdown::render
    • but only set it once, 
    • and it's probably best to set it from outside the R-markdown file (if you might subsequently want to include that file as a child of another file)
  • If you do include child documents, be aware that the filepath you put into your parent document should be written relative to the parent document, not relative to the knit_root_dir

Aside from that last point, write your paths relative to the root directory for your project...

---

No comments:

Post a Comment