First, the moral: Unit tests are good. But reliable design is better.

Even if you have to deal with short-term pain. Even if you haven't figured out all of the edge cases.

Let me back up. I love automated tests. I've been test-driving code at IMVU since I started. We buy new engineers a copy of Test-Driven Development: By Example. Whenever there is a bug, we write tests to make sure it never happens again.

After years of working this way, seeing projects succeed and fail, I'd like to refine my perspective. Let me share a story.

IMVU was originally a bolt-on addition to AOL Instant Messenger. Two IMVU clients communicated with each other by manipulating AOL IM's UI and scanning the window for new text messages, much like a screen reader would. This architecture propagated some implications through our entire codebase:

1) The messaging layer was inherently unreliable. AOL IM chat windows could be manipulated by the user or other programs. Thus, our chat protocol was built around eventual consistency.

2) We could not depend on an authoritative source of truth. Since text-over-IM is peer-to-peer, no client has a true view into where all of the avatars are sitting or who currently owns the room.

Thus, in 2008, long after we'd dropped support for integration with third-party IM clients and replaced it with an authoritative state database, we continued to have severe state consistency bugs. The client's architecture still pretended like the chat protocol was unreliable and state was peer-to-peer instead of authoritative.

To address these bugs, we wrote copious test coverage. These were deep tests: start up a local Apache and MySQL instance, connect a couple ClientApp Python processes to them, have one invite another to chat, and assert that their scene graphs were consistent. We have dozens of these tests, for all kinds of edge cases. And we thought we'd fixed the bugs for good...

But the bugs returned. These tests are still running and passing, mind you, but differences in timing and sequencing result in the same state consistency issues we saw in 2008. It's clear that test coverage is not sufficient to prevent these types of bugs.

So what's the other ingredient in reliable software? I argue that, in agile software development, correct-by-design systems are underemphasized.

Doesn't Test Driven Development guide me to build correct-by-design systems?

TDD prescribes a "red, green, refactor" rhythm, where you write a failing test, do the simplest work to make it pass, and then refactor the code so it's high quality. TDD helps you reach the “I haven't seen it fail" stage, by verifying that yes, your code can pass these tests. But just because you've written some tests doesn’t mean your code will always work.

So there's another level of reliability: "I have considered the ways it can fail, but I can't think of any." This statement is stronger, assuming you're sufficiently imaginative. Even still, you won’t think of everything, especially if you’re working at the edge of your human capacity (as you should be).

Nonetheless, thoughtfulness is better than nothing. I recommend adding a fourth step to your TDD rhythm: "Red, green, refactor, what else could go wrong?" In that fourth step, you deeply examine the code and think of additional tests to write.

The strongest level of software correctness is not about finding possible failure conditions; it's about proving that your system works in the presence of all inputs. Correctness proofs for non-trivial algorithms are too challenging for all of the code we write, but in a critical subsystem like chat state management, the time spent on a lightweight proof will easily pay for itself. Again, I'm not advocating that we always prove the correctness of our software, but we should at least generally be convinced of its correctness and investigate facts that indicate otherwise. TDD by itself is not enough.

OK, so we can't easily test-drive or refactor our way out of the chat system mess we got ourselves into, because it's simply too flawed, so what can we do? The solution is especially tricky, because in situations like this, there are always features that depend on subtleties of the poor design. A rewrite would break those features, which is unacceptable, right? Even if breaking those features is acceptable to the company, there are political challenges. Imagine the look on your product owner's face when you announce "Hey I have a new architecture that will break your feature but provide no customer benefit yet."

The ancient saying "You can't make an omelette without breaking some eggs" applies directly here. Preserving 100% feature compatibility is less important than fixing deep flaws.

Why? High-order bits are hardest to change, but in the end, are all that matters. The low-order bits are easy to change, and any competent organization will fix the small things over time.

I can't help but recall the original iPhone. Everyone said "What?! No copy and paste?!" Indeed, the iPhone couldn't copy and paste until 18 months and two major OS releases later. Even still, the iPhone reshaped the mobile industry. Clearly 100% feature compatibility is not a requirement for success.

My attitude towards unit testing has changed. While I write, run, and love unit testing, I value correct-by-design subsystems even more. When it comes down to it, tests are low-order bits compared to code that just doesn't break.

For those curious, what are we doing about the chat system? I'll let Jon's GDC presentation speak for itself.