By mid-2006, we'd primarily focused on access violations and unhandled exceptions, the explosive application failures. After extensive effort, we got our client's crash rate down to 2% or so, where 2% of all sessions ended in a crash.* Still the customers cried "Fix the crashes!"
It turns out that when a customer says "crash" they mean "it stopped doing what I wanted", but engineers hear "the program threw an exception or caused an access violation". Thus, to the customer, crash can mean:
- the application was unresponsive for a period of time
- the UI failed to load, making the client unusable
- the application has been disconnected from the server
In short, any time the customer cannot make progress and it's not (perceived to be) their fault, the application has crashed.
OK, we've got our work cut out for us... Let's start by considering deadlocks and stalls.
First, some terminology: in computer science, a deadlock is a situation where two threads or processes are waiting for each other, so neither makes progress. That definition is a bit academic for our purposes. Let's redefine deadlock as any situation where the program becomes unresponsive for an unreasonable length of time. This definition includes livelock, slow operations without progress indication, and network (or disk!) I/O that blocks the program from responding to input.
It actually doesn't matter whether the program will eventually respond to input. People get impatient quickly. You've only got a few seconds to respond to the customer's commands.
Detecting Deadlocks in C++
The embedded programming world has a "watchdog timer" concept. Your program is responsible for periodically pinging the watchdog, and if for several seconds you don't, the watchdog restarts your program and reports debugging information.
Implementing this in C++ is straightforward:
- Start a watchdog thread that wakes up every few seconds to check that the program is still responding to events.
- Add a heartbeat to your main event loop that frequently pings the watchdog.
- If the watchdog timer detects the program is unresponsive, record stack traces and log files, then report the failure.
IMVU's CallStack API allows us to grab the C++ call stack of an arbitrary thread, so, if the main thread is unresponsive, we report its current stack every couple of seconds. This is often all that's needed to find and fix the deadlock.
Detecting Deadlocks in Python
In Python, we can take the same approach as above:
- Start a watchdog thread.
- Ping the Python watchdog thread in your main loop.
- If the watchdog detects that you're unresponsive, record the main thread's Python stack (this time with sys._current_frames) and report it.
Python's global interpreter lock (GIL) can throw a wrench in this plan. If one thread enters an infinite loop while keeping the GIL held (say, in a native extension), the watchdog thread will never wake and so cannot report a deadlock. In practice, this isn't a problem, because the C++ deadlock detector will notice and report a deadlock. Plus, most common deadlocks are caused by calls that release the GIL:
file.read, and so on.
It might help to think of the Python deadlock detector as a fallback that, if successful, adds richer information to your deadlock reports. If it failed, whatever. The C++ deadlock detector is probably enough to diagnose and fix the problem.
What did we learn?
It turned out the IMVU client had several bugs where we blocked the main thread on the network, sometimes for up to 30 seconds. By that point, most users just clicked the close box [X] and terminated the process. Oops.
In addition, the deadlock detectors pointed out places where we were doing too much work in between message pumps. For example, loading some assets into the 3D scene might nominally take 200ms. On a computer with 256 MB of RAM, though, the system might start thrashing and loading the same assets would take 5s and report as a "deadlock". The solution was to reducing the program's working set and bite off smaller chunks of work in between pumps.
I don't recall seeing many "computer science" deadlocks, but these watchdogs were invaluable in tracking down important failure conditions in the IMVU client.
Next time, we'll improve the accuracy of our crash metrics and answer the question "How do you know your metrics are valid?"
* Median session length is a more useful reliability metric. It's possible to fix crashes and see no change in your percentage of failed sessions, if fixing crashes simply causes sessions to become longer.