"adjectivism.org" in large block letters

Docker, private Git repos, and a terrible hack


Of all the “great idea, terrible execution” tools out there, Docker has to be one of the best. So good, in fact, that it inspired not just this rant, but the entire idea of adding articles to my otherwise-empty personal website. I’m not yet using better tools, so I’m stuck bending over backwards to get Dockerfile-based container images to behave.

Let’s set the stage: I have a Python application. Call it myapp. Its environment is managed with Poetry. It depends on a Python library mylib (which happens to be managed with Poetry as well, but that’s irrelevant). Both of these codebases live in private Git repositories. I already have Poetry doing the thing, namely, installing all the recursively-resolved dependencies and locking them to specific versions for repeatably creating environments for my application to run in. This all works great because so far I’m using tools that are functional. What I have looks something like this, where the git dependency is somewhere private.

pyproject.toml
[tool.poetry]
name = "myapp"
version = "0.1.0"

[tool.poetry.dependencies]
mylib = { git = "https://github.com/myorg/mylib", branch = "main" }

Unfortunately, I need to run this thing in a container. The production deployment already runs in one and I’d like to give developers a consistent environment to work in so they don’t need to wrestle with the lingering dependency hell that Poetry isn’t capable of solving. Someone before me declared bankruptcy on this and – I assume with a mixture of dejection and unbridled rage – set this container up so it would start with myapp baked into the container image but mylib fetched and installed at startup time. Yes, every startup time. It goes without saying that this is horrible. Installing what should be precomputed dependencies at runtime is horrible. Requiring the running container to have credentials to the private Git forge for no reason other than fetching dependencies that shouldn’t be installed at runtime in the first place is unforgivable. This must be fixed. How hard could installing these packages during a Docker build be?

Borderline. Fucking. Impossible.

This private dependency is a real problem because it needs authentication. When installing on a developer’s laptop, I can just take for granted that they have whatever setup is necessary for Git to work. If we’re using HTTPS to reach the repository, their credential storage will just work. If we’re using SSH to reach it, they either have their key loaded into their SSH agent or have some SSH client configuration with a static key. If I’ve specified an HTTPS URL in pyproject.toml and they only have SSH working, they can use an insteadOf to fix it. Whatever. I don’t care. We can make it work without any pain. But here are the options for getting Git to talk to these repositories while building a Docker container image:

  • Use SSH to reach the forge. Use Docker’s bespoke ssh mount type to grab the SSH agent socket for the layer that needs the credentials (RUN --mount=type=ssh /install.sh). Hope that nobody is using an unusual SSH_AUTH_SOCK because that ain’t gonna work. Tell all the developers who want to use HTTPS to go fuck themselves.
  • Use HTTPS to reach the forge. Tell everyone to hardcode their extremely sensitive credentials into a .gitconfig. Make sure that file is in the Docker build context, because it can’t access stuff outside of it. Bind mount this awful file in the RUN layer that needs the credentials (RUN --mount=type=bind,source=./.gitconfig,target=/root/.gitconfig /install.sh). Hope that Docker isn’t leaking the entire context into the image somehow. Tell all the developers who want to use SSH to go fuck themselves.
  • Try to guess which method they’re using and choose a different stage in the Dockerfile based on that, with all the problems of both the previous approaches – oh wait, Docker can’t do conditionals.

So that’s not happening.

At this point, just as I’m picking up my toys to go home, something occurs to me. Can’t pip install from a .zip file? How do I build one of those?

% pip --help

Usage:
  pip <command> [options]

Commands:
  install                     Install packages.
  download                    Download packages.

Sweet. But I’m using Poetry… How do I…

% poetry export --help

Description:
  Exports the lock file to alternative formats.

Usage:
  export [options]

Options:
  -f, --format=FORMAT        Format to export to. Currently, only constraints.txt and requirements.txt are supported. [default: "requirements.txt"]

And can I still…

% pip download --help

Usage:
  pip download [options] <requirement specifier> [package-index-options] ...
  pip download [options] -r <requirements file> [package-index-options] ...
  pip download [options] <vcs project url> ...

Poetry can tell pip what to download! I already know how to mount files from Docker’s build context so they can be seen by a layer. I already have a Makefile that calls docker build. I already have a shell script I run in a single layer during the container image build. All that’s left is to make sure Poetry actually exports the references to Git correctly.

% poetry export | grep mylib
mylib @ git+https://github.com/myorg/mylib@main ; python_version >= "3.9" and python_version < "4.0"

Shit. This would almost work. It already solved and locked the Git ref but it doesn’t want to tell pip about it. If I use this requirement specifier, I’ll always get main. But how does Poetry always do the right thing? This must be in the lock file?

poetry.lock
[[package]]
name = "mylib"
version = "0.1.0"
description = ""
optional = false
python-versions = "^3.9"
files = []
develop = false

[package.source]
type = "git"
url = "https://github.com/myorg/mylib"
reference = "main"
resolved_reference = "1e8d978c2988ee805efe9c12abcc6ee56917984e"

It’s right there! This resolved_reference thing seems totally undocumented, and there certainly isn’t a way to ask Poetry nicely for it. Fuck it.

It’s a UNIX system. I know this!

I’m trying to be portable and we’re not on Python 3.11 yet with it’s built-in TOML parser. That’s fine – it’s just text, right? How hard can this be?

extract-poetry-ref.awk
BEGIN { FS = "="; found = ""; }

/\[\[package\]\]/,/^\s*$/ {
    if($1 ~ /^\s*name\s*/) {
        sub(/^\s* "/, "", $2);
        sub(/"\s*$/, "", $2);
        found = $2;
    }
}

/\[package\.source\]/,/^\s*$/ {
    if(found == pkg && $1 ~ /^\s*resolved_reference\s*/) {
        sub(/^\s* "/, "", $2);
        sub(/"\s*$/, "", $2);
        print $2;
    }
}
% awk -v pkg=mylib -f extract-poetry-ref.awk poetry.lock
1e8d978c2988ee805efe9c12abcc6ee56917984e

Okay, that time it was actually easy.

This is basically everything I need. The plan is going to be to let Poetry install everything it can since that seems safest. I’ll only use this pip download hack for the packages Poetry can’t fetch itself from inside the container image build. So:

  • Identify the Git dependencies and download them
  • Run the docker build with the downloaded files available
  • Install from the mounted files

There’s one more little trick. When installing directly from a Git repository, pip (and therefore Poetry) writes some metadata to keep track of which ref was cloned. Poetry needs this to understand that the locked dependency is already satisfied. pip won’t write this metadata when installing from a .zip file for obvious reasons, and without it, Poetry will just try to install the dependency again and we will have gotten nowhere. I just need to make it look like these dependencies were installed in one shot rather than this stupid two-stage process.

Serenity now

The complete hack looks like this.

The make target I use as an entrypoint to this disaster asks Poetry for the dependencies and ignores everything but the Git ones. After parsing the requirement specifier from Poetry for each, pip can cache the archive of the repository, and I’ll write its little metadata file where I can get to it later.

Makefile
.PHONY: build-container
build-container: export DOCKER_BUILDKIT=1
build-container:
	mkdir -p .pip-cache/
	set -e; \
	for req in "$$(poetry --no-ansi export -f requirements.txt | grep 'git\+')"; \
	do \
	pkg=$$(echo "$$req" | cut -d' ' -f1); \
	url=$$(echo "$$req" | cut -d' ' -f3 | sed -e 's/@[^@]*$$//;s/^git+//'); \
	branch=$$(echo "$$req" | cut -d' ' -f3 | sed -e 's/^.*@//'); \
	ref=$$(awk -v pkg=$$pkg -f extract-poetry-ref.awk poetry.lock); \
	req=$$(echo "$$req" | sed -e "s/@$$branch/@$$ref/"); \
	version=$$(poetry --no-ansi show "$$pkg" | grep -E '^\s+version\s+:' | awk '{ print $$3; }'); \
	pip download --no-deps -d .pip-cache/ --exists-action w "$$req"; \
	echo '{"url": "'"$$url"'", "vcs_info": {"vcs": "git", "requested_revision": "'"$$branch"'", "commit_id": "'"$$ref"'"}}' > .pip-cache/$${pkg}-$${version}-direct_url.json; \
	done
	docker build . -t myapp:latest

The Makefile refers to this awk script (same as the one referenced above):

extract-poetry-ref.awk
BEGIN { FS = "="; found = ""; }

/\[\[package\]\]/,/^\s*$/ {
    if($1 ~ /^\s*name\s*/) {
        sub(/^\s* "/, "", $2);
        sub(/"\s*$/, "", $2);
        found = $2;
    }
}

/\[package\.source\]/,/^\s*$/ {
    if(found == pkg && $1 ~ /^\s*resolved_reference\s*/) {
        sub(/^\s* "/, "", $2);
        sub(/"\s*$/, "", $2);
        print $2;
    }
}

The Dockerfile doesn’t look particularly suspicious since it’s what’s being worked around in the first place. Just mount the .pip-cache directory where it’s needed.

Dockerfile
FROM python:3.9-slim

# bake in our own code
ADD . /myapp
WORKDIR /myapp

# shell out for the rest of this mess
RUN --mount=type=bind,source=./.pip-cache,target=/tmp/pip-cache /myapp/install.sh

And the install script picks up where pip download left off, but has to bootstrap itself from scratch using Poetry’s lock file.

install.sh
#!/bin/sh
pip --no-cache-dir install poetry==1.5.1
poetry config virtualenvs.create false

for pkg in $(poetry --no-ansi export -f requirements.txt | grep 'git\+' | cut -d' ' -f1)
do
    version=$(poetry --no-ansi show $pkg | awk '/^[[:space:]]+version[[:space:]]*:/ { print $3; }')
    pip install --no-deps /tmp/pip-cache/${pkg}-${version}.zip
    path=$(pip show ${pkg} | awk '/Location:/ { print $2; }')
    cp /tmp/pip-cache/${pkg}-${version}-direct_url.json "${path}/${pkg}-${version}.dist-info/direct_url.json"
done
poetry install

There, now we’re able to build a container image from our own repository with the application actually installed when the container starts. As in – you know – the obvious, almost singular usecase Docker was created for.