Your Version Control and Build Systems Don't Scale

Mar 4, 2010 - git, ibb, scons, hackernews, reddit

Let's spend a minute and talk about the performance of a tool you use every day: your build system. For this purpose, they're all the same, so let's assume you use GNU Make. For a no-op build, what is Make's running time? You know, that O(N) stuff.

If you said O(Files), you're right. Every time you type make and press enter, it scans every file in the project, looking for out-of-date sources. That doesn't sound so bad. Linear algorithms are good, right?

Oh, but projects grow over time, typically in proportion with the number of engineers (even if some are net negative). Since projects grow:

O(Files) = O(Engineers * Time)

Assuming you're successful, your team is probably growing too:

O(Engineers) = O(Time)

Applying some substitutions:

O(Make) = O(Files) = O(Engineers * Time) = O(Time^2)

A no-op make is quadratic in the duration of the project?!

Exhibit A, Mozilla:

Chad@laptop /c/xulrunner/mozilla-central/obj-xulrunner
$ time make
rm -f -rf ./dist/sdk
[tons of output snipped]
make[1]: Leaving directory `/c/xulrunner/mozilla-central/obj-xulrunner'

real    21m31.526s
user    4m13.292s
sys     9m15.066s

21.5 MINUTES! This is a no-op build! I changed nothing from the last build!

Mozilla's build is an especially egregious example, but it's not uncommon for nontrivial Make, SCons, or Jam no-op builds to exceed 15 seconds. Every time you run SCons, for example, it:

loads Python,
imports a bunch of Python code,
creates an entire dependency graph in memory,
scans all files for changes,
determines what to build,
builds it,
and throws all of those intermediate results away.

Am I the only person who thinks this situation is insane? We're spending all of our time improving constant factors rather than addressing fundamentally quadratic build algorithms.

No-op builds should be O(1) and instantaneous, and most other builds should be O(WhateverChanged). What if I told you this is possible?

Introducing IBB: I/O-Bound Build

Dusty, one of the best engineers I know, once told me that C compilers are now fast enough that they're I/O-bound, not CPU-bound as you'd expect. This was inspiration for a ~~build system~~ dependency engine architecture:

A long-running server process that contains a complete dependency graph between files, including the commands required to update targets from sources.
An ultra-tiny C program that communicates build requests to the build server.
A background thread that watches for filesystem updates (via ReadDirectoryChangesW or libfam) and updates the dependency graph in memory.

I have built a prototype of ibb's architecture, and its code is available on my github.

In ibb, a no-op build takes constant time, no matter the size of the project. I tested with Noel Llopis's build system stress test and no-op builds are merely 50 milliseconds. Take that, everyone else!

A Faster Grep

Let's say you wanted to run a regular expression across your entire codebase. On Windows, you'll spend all of your time in kernel calls, even if the files are in cache. However, remember that modern memory bandwidths are measured in GB/s...

ibb's file search implementation can run a regular expression across 200 MB of PHP and JavaScript in 100 milliseconds. It copies file data into memory on the first run and rapidly scans across that memory on future runs. It could just as easily mmap and use OS's disk cache to avoid kernel calls.

Minimal Test Runner

Carl Masak's inspirational addictive TDD harness is also possible with ibb. If you can determine which unit tests depend on which source files (via an __import__ hook in Python, say), you can run appropriate tests immediately after saving a source file.

Version Control

I've watched both git and Subversion fall apart as a codebase grows. git is so faaast... until you import a 20 GB repository. Then it's O(Files) just like everything else.

$ time git status
# On branch master
nothing to commit (working directory clean)

real    0m9.603s
user    0m0.858s
sys     0m7.378s

10 seconds to see what files I've modified. Ouch.

Imagine if ibb could keep track of which files have changed...

What This Means For You

Running tests, grepping code, and building binaries from source are equivalent problems. ibb's architecture supports lightning-fast implementations of all of the above.

By converting an O(Time^2) build, test runner, or version control system to O(1), we eliminate hours of costly engineer time and encourage good habits like committing often and frequently running tests, no matter the age of the project. Frankly, instant feedback feels great.

It's crazy to me that this architecture isn't standard practice across development tools.

Take ibb's concepts or code, integrate it with your engineering processes, and improve the quality of your team's life today.

How to Make a DVD with the Flip UltraHD - NextPrevious - The Completionist's Guide to Sims 3 iPhone

See discussion on Hacker News (2010) and /r/programming.

Have a comment? Send me an email or tweet.

Imported Comments [?]

Alice on Mar 4, 2010

Wouldn't the actual fix be encapsulating and modularizing the code base? It seems bizarre to me that you would ever even expect to scale a single development pipeline linearly with the number of developers; the coordination cost must be extreme.

I know that Google uses a single Perforce repo and a monolithic build for most C++ services, but the rest of us have the opportunity to move on.

treed on Mar 4, 2010

Actually, it feels like this problem should be solved at the OS layer.

Increasingly, I and other sysadmin types are having to solve problems WRT directory scans. For things like backups or whatever else. With the size of modern file trees, this is getting really out of hand.

The solutions I've generally seen (and this includes IBB) is to take this tracking out of the hands of the OS and handle it manually. Having a daemon to track file changes and maintain a list manually, or keeping a manifest file.

The better solution here is to ask our operating systems to have the ability to respond quickly to the question "Which of these files changed in the last 24 hours?"

We've seen some stumbling towards this with things like FAM. What would be really nice is if, rather than having to have processes constantly open to listen to FAM, you could have FAM itself launch processes to respond to FS changes. (As an example, launching a quick process to say "This file changed, add it to this list for backups later.")

Wouldn't help for the grep thing, but running grep once loads things into memory for me, so. shrug

Justin Forder on Mar 5, 2010

You are dead right.

I once heard a guy from Ford talking about their C++ build (this was in the early days of C++ as a mainstream language, before Java) taking a week!

And at the start of my programming career (at Decca Navigator, writing assembler for the TANS Tactical Air Navigation System computer, which had a serial 16-bit CPU with 15-microsecond add time), we used to run the assembler (written in FORTRAN by a member of staff) overnight on the Comshare time-sharing system (cheaper at night, about 100 pounds sterling - in 1974 money, for me more than two weeks' pay - to assemble 16k instructions), download the object code at 30 characters/second onto paper tape, load into core store (the paper tape loader would stop at any invalid character or checksum error, and we would hand-patch the paper tape). For the next two weeks we'd debug in binary, marking up changes in the assembler listing and putting equivalent patches in the code in binary (jump out of the code where the problem was to an empty bit of core, execute the corrected instructions there, then jump back). Patches were recorded in reporter's notebooks, so if anything bad happened to the core store we could load the tape again and reapply the patches by hand. Eventually we would need a clean slate, so we'd go back to the terminal (a dot-matrix printing terminal, no screen), edit the changes into the assembler source files, and schedule another overnight assembler run. The first day with the results of a new assembly might be spent checking that all the things we'd fixed the previous time round were still fixed - and correcting accidental discrepancies between the corrections we'd made in the assembler source and the binary patches we'd been working with. Happy days!

I've bored many people with this story face to face; this is the first time I've written about it on the Web.

There is a lesson here, though, which is that people can adapt to working with tools that run at enormously different speeds. We knew the instruction set, in octal as well as in assembler mnemonics, the core stores were non-volatile, and the effect of a patch in the core store was immediate. But we had to have the readable, commented assembler source as the definitive version, so we ended up with cycles of working with the source and then working with the binary.

Dirkjan Ochtman on Mar 5, 2010

Mercurial comes with an inotify extension that can dramatically improve status by running a small server in the background (automatically run on the first status, stays in memory for a configurable timeout).

William Pietri on Mar 5, 2010

Yes, this makes a lot of sense. I am entirely in favor of it.

Can it be taken farther? Eclipse auto-builds on save, and JUnit Max auto-runs the tests on save. But in a few years we'll have twice as many cores at our disposal, and then twice again. What can we do to take advantage of that?

Instead of every save, can we kick off the chain every time the code is plausibly compilable? Can we speculate on what the next few edits will be and have the build running before the coding is done? Can we go beyond automated unit testing to automated load testing, search for refactoring opportunities, impact analysis, profiling, or even UI evaluation?

Make was invented in 1977. Build tools created today will be in their prime when make hits 40. The amount of computing power it is economical to give a developer is absurdly greater. Your analysis here makes me think that ibb is just the start of what we could get if we take a fresh look at our common tools.

Mark on Mar 6, 2010

Mozilla probably still uses the recursive makefile build system.

That's the problem.

It's just as bad as calculating fibbonaci using naive recursive algorithm.

Evan Jones on Mar 6, 2010

I think we are due for a better build system. This seems like a great proof of concept. Keeping the compiler around between builds also makes sense (see the GCC compile server discussion, which never went anywhere as far as I can tell: http://tromey.com/blog/?p=407)

Build system wish list: * Higher level than Make (I want to say "build this library" or "build this executable" and have it do dependency analysis for me.

Some story for dependencies. I want to be able to either import third party libraries into my project, so I can tell people something like "check out this single tree and hit make." At the same time, i want to be able to easily send those patches "upstream" to the original project. This requires some build system integration to handle the dependency analysis and rebuilds correctly. I'm imagining maven, but "better"?

The problem, as I see it, is that there is little commercial value in build systems, so there is little incentive for others to really do a good job of this. Everyone hacks their own system to satisfy their own needs, then moves on.

Evan

Brandon Ehle on Mar 6, 2010

The ReadDirectoryChangesW / libfam is not nearly as "free" as it seems. If you do lots of global operations on the directory tree, you will pay for the dependency checking overhead each and every time you modify files instead of just when you run git status, svn up, make, scons, etc... Depending on what your workflow is this could still be a signficant win, but it could also be a huge loss.

When svn and make are watching the same directory for update notifications, anytime svn needs to update a large portion of the tree, make is going to be slowing down subversion to handle the incoming file / directory changed requests. The same operation will happen when a compile is writing intermediate files all over the tree (unless you compile to an out of tree location). Thus on the reverse side you are switching O(1) for the amount of time it takes to update a file for O(Tools) watching the tree for updates.

Another downside is that this only works when the daemon is able to stay running from build to build. If you need to reboot between tests, have security policies that log you out of the machine, or are tight on RAM and need to kill the listener daemon, you will quickly be back to the original slow build times.

There are other ways to solve this that could are not quite O(1) on the no-op build and VCS status side and also do not have O(Tools) running in the background consuming memory and CPU with each update notification. One example would be for file modification times to trickle up the directory chain so that individual file stat operations could be culled.

Eric on Mar 9, 2010

Welcome to clearcase.

Mike Samuel on May 6, 2010

Prebake is a build system that does this as well. It's alpha software.

From http://code.google.com/p/prebake/wiki/Goals

Build Time is Independent of Source Repository Size

When a build system has to stat every file in the repo, build is necessarily O(|project|). By hooking into the filesystem to get updates on files that change, prebake avoids this O(|project|) cost ; if you change one file, your build time depends only on the number of files that depend on that file and the time required to rebuild them.

Chad Austin on Jun 11, 2010

William: Absolutely. I intend to spike a continuous-autobuild feature to ibb that will compile and run tests every time you save. I think it will be amazing.

Brandon: You're absolutely right. I will discuss that in my next post.

Mike Samuel: Thanks for the heads up! I will take a deeper look at prebake.

Sorry, everyone, about the comments. I wasn't getting moderation e-mails so I didn't notice them until the other day. >_<

Allan Stokes on Jul 19, 2010

Are you expecting anyone to take your O(N^2) analysis seriously?

The general rule of thumb is that productivity is sub-linear in the growth of code base complexity. You could argue that as complexity rises, it pressures the project toward smaller source files on average, but I haven't witnessed this, except in projects that were out of control.

More accurately, the total number of file tests (in a conventional build system) is O(N^2) over the life of the project, where N is the final project size.

Even this makes a radical assumption that a large project isn't partitioned into a 100 shared libraries, where each shared library is built independently during active development.

You seem to have a master build fetish. In a distributed project, this is often the first build each new participant performs, which seems to cause heart palpitations in a certain type of engineer who likes to perform wild extrapolations on the basis step using strangely fabricated denominators.

You also neglected to take into account the negative exponential term due to rising system performance. True, this is linked to the nearly constant physical size of the drive read head over the past two decades. However, if you spin around 180 degree and examine the future, we're on the precipice of the largest negative step function in file IO since the Rubik's cube was invented: the transition to SSD. What kind of astrobucks does it take to develop a source code tree that won't comfortably fit on a cheap SSD drive three years from now? At which point we're back to negative exponential scaling in file stat times.

This will become even more clear as file systems further adapt to exploit SSD (or miracle RAM) performance scaling.

There are many applications that could benefit from a derived authority with different performance characteristics than the authoritative backing store. Amortized O(1) is often possible with space overhead linear in the number of objects tracked (which generally implies a RAM based data structure).

On the flip side, you really have to put a lot of faith in the derived authority layer, it adds complexity, and it's a bad, bad day when you discover your derived authority hasn't been telling the truth since the last patch Monday.

Seems like a strange point in history to mount an architectural campaign on the observation that HDD head mechanisms don't scale.

Chad Austin on Jul 20, 2010

Allan: I never said anything about SSD vs. HDD. SSDs don't change the fundamental algorithmic complexity of building software: you're still going to do O(files) stats, each stat going through a kernel call.

FWIW, I run SSDs on all of my systems, and I still have the problems I described in the article.

anon on Feb 22, 2011

Your ibb saved not only I/O ops, but also CPU cycles, so the name is misleading.

edward trunk on Apr 19, 2012

there are so many ways to go after this build problem. the way i go after this problem is to take a monolithic source tree and break it into many source trees, as granular as possible. all of a sudden builds run faster. but that is really just the start. a monolithic source tree impacts the entire life cycle. branching, testing, installation. a developer should be able to check out a module and build, test, install. the build system should retrieve all dependencies using an approach similar to what maven does, but for native code. not only will developer builds run faster, so do all other types of builds.