The design of build tools and dependency management

Today I’d like to do another small case study, this time on the design of Maven. For those not aware, Maven is a build tool for Java projects, and familiarity with it is not necessary to benefit from today’s post. I just hope you’ve used some sort of build tool! Build tools come in a wide variety of styles, and I’m going to use Maven as an example while thinking about their design. Partly because it hit some things perfectly right, partly because it gets some things slightly wrong, and partly because some strange things sometimes get said about it. So let’s break it down.

The first step of design is: pick the right problem

You’ve probably used something like make, and the core abstraction in make is to produce files from other files. Source files depend on header files, object files depend on source files, executables depend on object files, and so on. Makefiles are the kind of design you come up with when you’re concerned just with getting something done.

And that’s the trouble with them. The most obvious thing in the world when you’re just trying to build an executable is think that the problem to be solved is trying to build an executable. This can make us short-sighted.

One of the key things Maven gets absolutely right is that it takes on the problem of building an artifact. Instead of being primarily concerned with compiling files, Maven describes how to construct url.myapp-1.2.3.jar. Merely compiling an executable is too small of a scope. It discards many of the actually hard problems that still need solutions, and thus the tool stops short of actually helping you out. Those problems get dumped on you.

By changing our perspective, and picking the right problem to solve, we can come up with radically different solutions. If the goal is to build myprog, you’re done at an executable. If the goal is to build myprog_1.0.0_amd64.dpkg (to envision a build tool for Debian packages), you need to do so much more. We need to know what other files get distributed with the executable. There’s a host of metadata, including versioning information. There’s dependencies.

And then there’s the things we can now do, now that all those things are within the scope of the problem.

Dependency management is the hard part

Prior to Maven, the usual state of the art for Java applications was to commit third-party dependencies as jars in SVN. This is not that different than the state of the art for Go applications today, except there we commit the dependencies’ sources instead of binary artifacts (e.g. here’s Kubernetes’s vendor directory).

This technically works, but it comes with several major drawbacks (besides our distaste for committing outside projects into our repo.) In particular, it doesn’t help downstream projects much. Transitive dependencies need to be handled, somehow. (Which is why Go has tools like dep.)

Dependency information needs to be communicated, and these dependencies themselves need to be obtainable. With a design like make or ant, these problems are out of scope, but with a design like Maven’s, it’s no wonder that Maven Central is a major part of the design. When Maven is told to build a project, it first grabs all the projects dependencies, automatically.

By changing the focus to producing artifacts, we end up with a build tool whose primary concern is obtaining artifacts and producing artifacts, with a side thing of maybe doing a little compiling. As a result, we end up more directly confronting and solving the problems people actually face. The compiling problem gets so reduced in importance, that it’s generally handled purely by convention. A Maven project configuration can merely say “yeah, this here is foo-1.2.3.jar” and the contents of that jar are usually left implicit in the project structure.

Artifacts, dependencies, and metadata

Compared to other modern package managers, one major difference is that Maven packages specify exact versions of dependencies. In contrast, what you see more often these days (with Ruby Gems, Python packages, Rust’s Cargo) is a separation of this into two different pieces: first, some metadata that specifies acceptable versions ranges for dependencies, and second a “lock file” that specifies exact versions to use. Generally, the lock file exists only for applications (the “leaves” of the dependency graphs), and only the meta data is used from libraries obtained via the package manager.

These days, it’s pretty well accepted that lock files are a better approach. But this approach is not without its downsides. When a Maven package is published, together with its exact dependencies, you can take that as an endorsement that it works. When you grab a dependency, you get the same artifact that was published, using the same artifacts it was built against by its publisher, which are the same artifacts that everyone else using that version gets as well. There’s no disconnect. You don’t end up with one set of versions, while upstream has a slightly different set, and other users see something else. When something goes subtly wrong for someone, “works for me?”

Many tools now commonly use some sort of semver convention by default. As a result, the publishers of an artifact and each different user could end up with slightly different versions between them. Lock files are a partial solution—they ensure that, for a given application, everyone is using the same versions of dependencies. But the semver convention means that a third-party can make an accidental mistake with a breaking change and an inappropriate version number, and other people suffer the consequences pretty much automatically.

Diamonds are a developer’s worst enemy

When two dependencies require conflicting versions of a common transitive dependency, you either have a conflict that needs solving, or you have duplication.

Dependency management is a problem that needs solving for every artifact. The more dependencies, the more it matters and the harder it gets. There’s a pervasive, dismissive assumption that this stuff is easy, but consider the sheer variety of different ways we have of talking about “dependency hell.”

The approach Maven takes is to require one version for every package involved in a build. Immediately, we should know what to expect from this: sometimes, when we depend on two packages, we’ll get an error about a conflict. These two packages have conflicting versioned dependencies on a common third package, and we have to somehow deal with this. (Maven takes us to the middle case of the diagram above.)

There’s really only three options for dealing with this in Maven:

Use different versions of those packages, ones that have compatible dependencies, if such versions exist.
Use an override in our application, to insist on a single version. Then we take on the responsibility of testing to make sure we haven’t screwed things up for those dependencies.
Separate things into different projects. Dependencies only need to be compatible project by project (there’s no “system wide” conflicts of any kind), so if they’re different projects they can happily depend on different versions.

Alternatives exist, as the diagram I drew above suggests. OSGi, for instance, allows multiple version of a library to co-exist. Although allowing multiple versions of a library to load solves the version conflict problem at package management time, it does potentially create other problems. Despite Java’s type safety, it’s still a frequent occurrence to finds casts in typical Java code, and this means having two versions of the same library can create runtime crash bugs. Duplication is not without potential costs when those two libraries were supposed to be compatible.

Likewise, if those libraries do, in fact, have to be the same version and must be compatible, duplication doesn’t help you at all. You still have a problem to be solved by modifying your dependencies, somehow.

When a dependency has effects on its users, we can end up getting different artifacts even for the same version of a library, when it is built against different versions.

Worse still, you can end up with transitive effects of dependencies on upstream artifacts. Consider a bug fix in a static inline function in a C library’s header file. Just recompiling the library doesn’t fix the problem; you need to recompile all the upstream users of that library. This means that the same artifact—“A” version 1.0—compiled against two different dependencies can result in two meaningfully different artifacts. So it’s not necessarily enough to just swap one artifact for another. This is a frequent source of “Cabal hell” for Haskell users: the compiler can do cross-module inlining, causing upstream artifacts to be invasively affected by choices of versions of their dependencies.

One approach to help solve this problem ends up going back towards the Maven approach. The typical Linux distribution package management model involves creation of a curated set of (mostly) compatible packages. This certainly helps avoid the diamond problem. It also means that any time a change like this happens, you’re supposed to be able to count on upstream to bump minor versions of reverse-dependencies and push those, too.

This problem is also solvable by distributing sources rather than compiled artifacts. Rust’s Cargo, Go’s various dependency tools, and NPM all take this approach. You can read some interesting things here by looking at Cargo’s distinction between public and private dependencies, or on a recent (yesterday, ha!) versioning proposal for Go packages.

Dependencies are a huge problem, and I’ll have some more to say next week.

“Declarative” isn’t especially required

Getting back to looking at Maven, one of the oversold bits of praise it gets (IMO) is for being “declarative.” To understand where these people are coming from, we can look back at what most were using before Maven: Ant. Ant is pretty much just an XML format Makefile. But unlike Makefiles (where there’s an interesting dependency graph getting constructed and solved, which is arguably somewhat declarative), a typical Ant build.xml was little more than a script.

As a result, Ant was little more than a scripting language with a few added tools for compiling Java files and building jars. Then it left every user to go figure out how to put these tools together to build their project. So we got every project re-implementing (generally quite badly) the same stuff, all in slightly different ways.

What Maven really offers here isn’t so much “declarativeness,” it offers abstraction. Especially its single, primary, pretty well-designed abstraction for building a Java project, with a lot of sensible defaults. Adoption of Maven for most users was a mass abandonment of bespoke, artisinal, fragile build scripts for a common convention that just worked for most users.

The declarative nature isn’t the main advantage here: if Ant had anything like a standard library with re-usable parts, it might have had most of the same advantages. (Indeed, if I recall correctly, very early versions of Maven started out as a collection of re-usable Ant scripts. But don’t quote me on that.) But the declarativeness is a forcing function, much like laziness in Haskell. Because your pom.xml is declarative, you have to build well-engineered re-usable plug-ins for interpreting it. So the declarativeness is nice because it forces us to create abstractions, but the abstractions are what we really benefit from.

Coming soon

Today’s essay was supposed to be something simple to prep for my next case study on (spoiler alert) containers. It kinda exploded, and I had to do some work to cut its scope down to size. What we have here with dependency management is a problem we don’t really have a good solution to, not yet anyway.

It seems clear to me that we do need the capability to tolerate multiple versions simultaneously. A package manager is a tool to help programmers solve problems, and when the problem is two independent components with conflicting dependencies, well… that’s the problem to be solved. A great many cases where these conflicts happen are spurious, and the code does actually not require the same version of those packages. “The problem” in these cases is the package manager being intolerant of reality.

But so far, most package managers that tolerate multiple versions also explode in complexity. Perhaps that’s acceptable to an extent: any time we upgrade versions of dependencies, we expect that things might break. Going from a single version to multiple versions of a dependency, then having things explode because they aren’t compatible is just another way things can break. But… there’s still a lot of merit in reducing the opportunities for things to go wrong. Just because problems can be expected there doesn’t mean we should be happy to create more of them. Just because we can tolerate multiple versions of a package doesn’t mean we should expect this to be the default.

Finding the right balance is a design problem that’s still open.