Last week I was sitting on a meeting with some colleagues looking at metrics from our Azure Devops pipelines. It looked like a significant number of our image builds in several of our projects were regularly taking upwards of 90 minutes to complete. I found this perplexing, so I opened a build log at random to see if I could immediately identify anything obviously suspicious.

I noticed there was one single step in the Dockerfile which took 23 minutes:

RUN usermod -a -G 0 appuser \
    && chgrp -R 0 . \
    && chmod -R g+rwX .

This does three things:

  1. Add the user appuser to the root (or 0) group in the container.
  2. Change the group of all files in the current directory to root (or 0), recursively.
  3. Add the read, write and execute group permissions to all files in the current directory, recursively.

These steps are required because the container image is intended to run on our Red Hat OpenShift platform which, by default, runs all containers with a randomised UID and the GID 0. This means that in order to properly run on OpenShift, all containerised software must have its files owned by the root group and appropriately set readable, writable and executable by that group as well, so that the software doesn’t crash with a permissions error.

I don’t think this approach is particularly egregious, but this is still a bit of an inefficient way of achieving this. Recursively chgrping and chmoding every file at the end of the Dockerfile can take a very long time. This particular image is a Node.js frontend application, so it has a lot of files:

$ podman run -ti --rm ghcr.io/city-of-helsinki/our/image:example /bin/sh -c "find node_modules | wc -l"
79802

The node_modules directory has about 80,000 files, a non-trivial number to change groups and permissions on, I would say.

The real culprit is of course Github Actions which, for some reason, provides very slow storage for our builds. Perhaps even intentionally in order to curtail crypto mining or other illicit uses. In an environment with reasonable storage speeds this should not matter in the slightest.

I cannot fault the developers too much for writing the Dockerfile they way they did, because they were just following Red Hat’s guidelines.

Testing on my computer with podman, changing the group and permissions on all of node_modules takes just a few seconds because the image is stored in nice and fast NVMe storage.

So how do we optimise this? Let’s just make sure the files have the proper group and permissions to begin with at file creation time.

We modified the Dockerfile so it now essentially looks like this:

FROM our_node_base_image:14-slim

RUN useradd -r -g 0 appuser   # important bit
USER appuser

WORKDIR /app
COPY --chown=appuser:root package.json yarn.lock and other stuff ./

RUN # steps to build the app

EXPOSE 8080
CMD ["start", "the", "app"]

Now appuser’s default group is 0, and the default umask is 0022 (I checked), which means that any files the user creates have the ownership appuser:root and permissions u=rw,g=r,o=r by default.

At the end of the Dockerfile we still have to chmod the files which the root group needs to execute and write to, but these are easy to identify and they amount to a much smaller set than all of node_modules as before.

We thought about omitting the appuser altogether, because OpenShift is going to override it anyway with a random UID, but we decided against it because inevitably someone will have to run the image in vanilla Kubernetes or docker-compose in local development, which do not perform the same UID mangling.

The pull request for these changes has not gone through yet, so I don’t have nice comparison metrics to show off here, but nevertheless I want to post this now that I have the motivation. Maybe I will amend this post later once I have nice numbers…