Reporting Crashes in IMVU: Structured Exceptions

Previously, we discussed the implementation of automated reporting of
unhandled C++ exceptions
. However, if you’ve ever programmed in C++,
you know that C++ exceptions are not the only way your code can fail.
In fact, the most common failures probably aren’t C++ exceptions at
all. You know what I’m referring to: the dreaded access violation
(sometimes called segmentation fault).

Access Violation

How do we detect and report access violations? First, let’s talk
about what an access violation actually is.

Your processor has a mechanism for detecting loads and stores from
invalid memory addresses. When this happens, it raises an interrupt,
which Windows exposes to the program via Structured Exception Handling
(SEH). Matt Pietrek has written an excellent article on how
SEH works
, including a description of C++ exceptions implemented
on top of SEH. The gist is that there is a linked list of stack
frames that can possibly handle the exception. When an exception
occurs, that list is walked, and if an entry claims it can handle it,
it does. Otherwise, if no entry can handle the exception, the program
is halted and the familiar crash dialog box is displayed to the user.

OK, so access violations can be detected with SEH. In fact, with the
same mechanism, we can detect all other types of structured
exceptions, including division by zero and stack overflow. What does
the code look like? It’s approximately:

bool handle_exception_impl_seh(function f) {
    __try {
        // This is the previously-described C++ exception handler.
        // For various reasons, they need to be in different functions.
        // C++ exceptions are implemented in terms of SEH, so the C++
        // exception handling must be deeper in the call stack than
        // the structured exception handling.
        return handle_exception_impl_cpp(f);
    }
    // catch all structured exceptions here
    __except (EXCEPTION_EXECUTE_HANDLER) {
        PyErr_SetString(PyExc_RuntimeError, "Structured exception in C++ function");
        return true; // an error occurred
    }
}

Note the __try and __except keywords. This means we’re using
structured exception handling, not C++ exception handling. The filter
expression in the __except statement evaluates to
EXCEPTION_EXECUTE_HANDLER, indicating that we always want to handle
structured exceptions. From the filter expression, you can optionally
use the GetExceptionCode
and GetExceptionInformation
intrinsics to access information about the actual error.

Now, if you write some code like:

Object* o = 0;
o->method(); // oops!

The error will be converted to a Python exception, and reported
with our existing mechanism. Good enough for now! However, there are
real problems with this approach. Can you think of them?

Soon, I’ll show the full implementation of the structured
exception handler.

Reporting Crashes in IMVU: Part II: C++ Exceptions

A year ago, I explained
how the IMVU client automatically reports unexpected Python exceptions
(crashes) to us. I intended that post to be the first of a long
series that covered all of the tricks we use to detect and report
abnormal situations. Clearly, my intentions have not played out yet,
so I am going to pick up that series by describing how we catch
exceptions that occur in our C++ code. Without further ado,

Reporting C++ Exceptions

As discussed earlier, IMVU’s error handling system can handle any
Python exception that bubbles out of the client’s main loop and
automatically report the failure back to us so that we can fix it for
the next release. However, our application is a
combination of Python and C++, so what happens if our C++ code has a
bug and raises an uncaught C++ exception, such as std::bad_alloc
or std::out_of_range?

Most of our C++ code is exposed to Python via the excellent
Boost.Python library, which automatically catches C++ exceptions at
the boundary and translates them to Python exceptions. The
translation layer looks something like this:

bool handle_errors(function fn) {
    try {
        fn();
        return false; // no error
    }
    catch (const std::runtime_error& e) {
        // raise RuntimeError into Python
        PyErr_SetString(PyExc_RuntimeError, e.what());
    }
    catch (const std::bad_alloc&) {
        // raise MemoryError into Python
        PyErr_SetString(PyExc_MemoryError, "out of memory");
    }
    catch (const std::exception& e) {
        // raise Exception into Python
        PyErr_SetString(PyExc_Exception, e.what());
    }
    catch (...) {
        PyErr_SetString(PyExc_Exception, "Unknown C++ exception");
    }
    return true;
}

Thus, any C++ exception that’s thrown by the C++ function is
caught by Boost.Python and reraised as the appropriate Python
exception, which will already be handled by the previously-discussed
crash reporting system.

Let’s take another look at the client’s main loop:

def mainLoop():
    while running:
        pumpWindowsMessages()
        updateAnimations()
        redrawWindows()

def main():
    try:
        mainLoop()
    except:
        # includes exception type, exception value, and python stack trace
        error_information = sys.exc_info()
        if OK == askUserForPermission():
            submitError(error_information)

If the C++ functions called from updateAnimations() or redrawWindows()
raise a C++ exception, it will be caught by the Python error-handling
code and reported to us the same way Python
exceptions are.

Great! But is this a complete solution to the problem? Exercise
for the reader: what else could go wrong here? (Hint: we use Visual
Studio 2005 and there was a bug in catch (…) that Microsoft fixed in
Visual Studio 2008…)

Reporting Crashes in IMVU: Catching Python Exceptions

For years now, I have been meaning to write a series of articles on the automated crash reporting system in the IMVU client. This first article will give a bit of background on the structure of the client and show how we handle Python exceptions.

At IMVU, we generally subscribe to the Fail Fast philosophy of handling errors: when the client encounters an unexpected error, we immediately crash the program and ask the user to submit a crash report. As part of the crash report, we send log files, stack traces, system information, and anything else that might help us debug the failure.

You might wonder why we crash the program whenever anything goes wrong rather than trying to catch the error and continue running. Counterintuitively, crashing the program forces us to act on crashes and immediately exposes bugs that might trigger unwanted behavior or lost data down the road.

Now let’s talk a little bit about how the client is structured. The IMVU client is written primarily in Python, with time-critical components such as the 3D renderer written in C++. Since the client is a cross between a normal interactive Windows program and a real-time game, the main loop looks something like this:

def main():
    while running:
        pumpWindowsMessages() # for 1/30th of a second
        updateAnimations()
        redrawWindows()

This structure assumes that no exceptions bubble into or out of the main loop. Let’s imagine that updateAnimations() has a bug and occasionally raises an uncaught exception. If running the client with a standard command-line python invocation, the program would print the exception and stack trace to the console window and exit. That’s all great, but our users don’t launch our client by invoking python from the command line: we use py2exe to build a standalone executable that users ultimately run. With an unmodified py2exe application, uncaught exceptions are printed to sys.stderr (as above), except there is no console window to display the error. Thus, the py2exe bootstrap code registers a handler so that errors are logged to a file, and when the program shuts down, a dialog box shows something like “An error has been logged. Please see IMVUClient.exe.log.”

From a crash reporting standpoint, this is not good enough. We can’t be asking our users to manually hunt down some log files on their hard drives and mail them to us. It’s just too much work – they will simply stop using our product. (Unfortunately, most of the software out there asks users to do exactly this!) We need a way for the client to automatically handle errors and prompt the users to submit the reports back to us. So let’s rejigger main() a bit:

def mainLoop():
    while running:
        pumpWindowsMessages()
        updateAnimations()
        redrawWindows()

def main():
    try:
        mainLoop()
    except:
        error_information = sys.exc_info()
        if OK == askUserForPermission():
            submitError(error_information)

This time, if a bug in updateAnimations() raises an exception, the top-level try: except: clause catches the error and handles it intelligently. In our current implementation, we post the error report to a Bugzilla instance, where we have built custom tools to analyze and prioritize the failures in the field.

This is the main gist of how the IMVU client automatically reports failures. The next post in this series will cover automatic detection of errors in our C++ libraries.