Creating a Reproducible Example

Authors: Colin Gillespie & Jack Walton

Published: May 31, 2022

tags: r, python, reprex, reticulate, docker

Maintaining training materials

Over the last few years, we increased both the number and types of training courses we offer. In addition to our usual R courses in {dplyr} and {shiny}, we also offer training on Docker, Python, Stan, TensorFlow, and others.

As the number of courses we offer increased, so did the maintenance burden of our associated training materials (lecture notes, slides, exercises, and more). To ease this burden, and to assist in ensuring that our training materials build consistently, we developed an R package called {jrNotes2}. Amongst other things, this package ensures that all courses:

have identical “template files”: .gitlab-ci.yml, .gitignore, Makefiles, index.Rmd, …;
have the same directory structure, and
pass a set of quality-assurance checks.

To make a change to course content, a team member must push their suggestions to a branch on GitLab. This action launches a CI job, which runs a Docker container that performs a set of checks. The templated .gitlab-ci.yml file ensures that every course undergoes the same build process and quality-assurance checks. If the content passes these checks, and an eligible approver approves the changes, then the changes are merged into the main branch.

Cartoon showing arrows from Data scientist to GitLab to Docker container to Continuous Integration

This means course content in a main branch should never fail our checks. Well, not quite…

Why we can’t freeze all dependencies

When teaching a course, we want to teach with the exact same packages an attendee would get via an install.packages() or pip install command. This means we must always use the latest versions of packages available on CRAN and PyPI. However, always using the latest available packages has it dangers: a change to a package used by a course can suddenly cause our teaching materials to begin failing our build checks.

To try and pre-empt package changes breaking our training materials we use scheduled CI runs. That is, at regular intervals a CI job automatically runs our tests and checks against a course’s training materials. If a course’s materials fail these checks, we are notified via a message in a Slack channel. Around early January, we started getting notifications about our Introduction to Python course:

Screenshot of slack notification showing the failed pipeline, where failed job is notes-build.

The problem

Unfortunately, the traceback given by the CI wasn’t the most enlightening:

Strangely, the course materials

built successfully on Colin’s laptop;
failed to build on Jack’s laptop, and
failed to build on the CI runner.

As far as we could see, everything appeared roughly the same on all three systems: with all three running the same operating system, the same R version, and using the same package versions.

Whilst we could reproduce the error in a docker container, the error was difficult to debug as

the container used a large number of internal Jumping Rivers R packages;
the materials build process involved a set of non-trivial Rmd files, and
the error wasn’t encountered until around eight minutes into the build and test process.

In short, whilst we had a reproducible example of the error, it was only reproducible by a Jumping Rivers employee, and it was far from a minimal example.

Simplifying the problem

To make progress, we had to simplify the docker container. We asked ourselves the following questions:

Can we remove all unnecessary files, such as presentation slides? Yes.
Can we simplify the course notes? Yes: we were able to find a single Python code chunk that caused the issue.
Can we remove all of our custom Rmd styling? Yes: a simpler Rmd file with the same chunk gave the same error.
Can we reproduce the issue without R Markdown? Yes: a simple R script can reproduce the same error.
Does the Dockerfile need to be complex? No: we can remove most of the unnecessary Python, Debian and R related packages.

A minimal reproducible example

After all of our simplifications, we arrived at a minimal reproducible example with the Dockerfile:

FROM rocker/r-ver:latest
RUN apt update && apt install -y python3 python3-dev python3-venv
RUN install2.r --error reticulate
COPY test.R /root/

and associated R script:

reticulate::virtualenv_create(
  envname = "./venv",
  packages = "matplotlib"
)
reticulate::use_virtualenv("./venv")
reticulate::py_run_string("import matplotlib.pyplot as plt; plt.plot([1, 2, 3], [1, 2, 3])")

By simplifying the problem, we were now in a position to ask for help from others.

As this appeared to be a bug (it used to work, but now it doesn’t), we raised an issue against the {reticulate} repository.

A (partial) solution

Soon after posting we received a response from one of the {reticulate} developers. Their response revealed that matplotlib was nothing but an innocent bystander in our issue, and that the real culprits were the incompatible BLAS (Basic Linear Algebra Subprograms) libraries being used by R and numpy!

The suggested solution was to was compile the numpy package from source within Docker. However, compiling numpy at container runtime added around 3 minutes to the CI checks every time they ran. As such, we opted to build the numpy package from source at image build-time, effectively caching the package build, and avoiding re-compiling numpy every time our build tests ran against our training materials.

Although compiling numpy from source did fix our issue, it currently presents as more of a workaround than a long-term solution. Hopefully, a future change to the BLAS libraries used by the rocker image series or numpy, can allow the two to be friends again. Here’s to hoping!

Take-aways

Using scheduled CI jobs allowed us to catch this issue early, and gave us plenty of time to fix it before the next time the course ran.
Having a CI ensured we had an (internally) reproducible example, as the CI is based on a docker container.
In order to get help, it was crucial to simplify the problem.
Debugging is hard, and it’s okay to ask for help!

References

https://github.com/rstudio/reticulate/issues/1133