Code Reviews: Follow the Data

This is a mirror of the corresponding post at the IMVU Engineering Blog.

After years of reviewing other people's code, I'd like to share a couple practices that improve the effectiveness of code reviews.

Why Review Code?

First, why review code at all? There are a few reasons:

Catching bugs
Enforce stylistic consistency with the rest of the code
Finding opportunities to share code across systems
Helping new engineers spin up with the team and project
Helping API authors see actual problems that API consumers face
Maintain the health of the system overall

It seems most people reach for one of two standard techniques when reviewing code; they either review diffs as they arrive or they review every change in one large chunk. We've used both techniques, but I find there's a more effective way to spend code review time.

Reviewing Diffs

It seems the default code review technique for most people is to sign up for commit emails or proposed patches and read every diff as it goes by. This has some benefits - it's nice to see when someone is working in an area or that a library had a deficiency needing correction. However, on a large application, diffs become blinding. When all you see is streams of diffs, you can lose sight of how the application's architecture is evolving.

I've ended up in situations where every diff looked perfectly reasonable but, when examining the application at a higher level, key system invariants had been broken.

In addition, diffs tends to emphasize small, less important details over more important integration and design risks. I've noticed that, when people only review diffs, they tend to point out things like whitespace style, how iteration over arrays is expressed, and the names of local variables. In some codebases these are important, but higher-level structure is often much more important over the life of the code.

Reviewing diffs can also result in wasted work. Perhaps someone is iterating towards a solution. The code reviewer may waste time reviewing code that its author is intending to rework anyway.

Reviewing Everything

Less often, I've seen another code review approach similar to reviewing diffs, but on entire bodies of work at a time. This approach can work, but it's often mindnumbing. See, there are two types of code: core, foundational code and leaf code. Foundational code sits beneath and between everything else in the application. It's important that it be correct, extensible, and maintainable. Leaf code is the specific functionality a feature needs. It is likely to be used in a single place and may never be touched again. Leaf code is not that interesting, so most of your code review energy should go towards the core pieces. Reviewing all the code in a project or user story mixes the leaf code in with the foundational code and makes it harder see exactly what's going on.

I think there's a better way to run code reviews. It's not as boring, tends to catch important changes to core systems, and is fairly efficient in terms of time spent.

Follow the Data

My preferred technique for reviewing code is to trace data as it flows through the system. This should be done after a meaningful, but not TOO large, body of work. You want about as much code as you can review in an hour: perhaps more than a user story, but less than an entire feature. Start with a single piece of data, say, some text entered on a website form. Then, trace that data all the way through the system to the output. This includes any network protocols, transformation functions, text encoding, decoding, storage in databases, caching, and escaping.

Following data through the code makes its high-level structure apparent. After all, code only exists to transform data. You may even notice scenarios where two transformations can be folded into one. Or perhaps eliminated entirely -- sometimes abstraction adds no value at all.

This style of code review frequently covers code that wasn't written by the person or team that initiated the code review. But that's okay! It helps people get a bigger picture, and if the goal is to maintain overall system health, new code and existing shouldn't be treated differently.

It's also perfectly fine for the code review not to cover every new function. You'll likely hit most of them while tracing the data (otherwise the functions would be dead code) but it's better to emphasize the main flows. Once the code's high-level structure is apparent, it's usually clear which functions are more important than others.

After experimenting with various code review techniques, this approach has been the most effective and reliable over time. Make sure code reviews are somewhat frequent, however. After completion of every "project" or "story" or "module" or whatever, sit down for an hour with the code's authors and appropriate tech leads and review the code. If the code review takes longer than an hour, people become too fatigued to add value.

Handling Code Review Follow-Up Tasks

At IMVU in particular, as we're reviewing the code, someone will rapidly take notes into a shared document. The purpose of the review meeting is to review the code, not discuss the appropriate follow-up actions. It's important not to interrupt the flow of the review meeting with a drawn-out discussion about what to do about one particular issue.

After the meeting, the team leads should categorize follow-up tasks into one of three categories:

Do it right now. Usually tiny tweaks, for example: rename a function, call a different API, delete some commented-out code, move a function to a different file.
Do it at the top of the next iteration. This is for small or medium-sized tasks that are worth doing. Examples: fix a bug, rework an API a bit, change an important but not-yet-ubiquitous file format.
Would be nice someday. Delete these tasks. Don't put them in a backlog. Maybe mention them to a research or infrastructure team. Example: "It would be great if our job scheduling system could specify dependencies declaratively."

Nothing should float around on an amorphous backlog. If they are important, they'll come up again. Plus, it's very tempting to say "We'll get to it" but you never will, and even if you have time, nobody will have context. So either get it done right away or be honest with yourself and consciously drop it.

Now go and review some code! :)