We're getting to the point in our development at IMVU where we're starting to have to fix some of the harder, more obscure, crash bugs. I spent the last week on one, and I wanted to write about what it was and the process we used to find it.

First, IMVU has a very good automated bug reporting system. Crashes, unclean shutdowns, and a couple other types of errant behavior get automatically reported and entered into our bug database. Most crashes (read: exceptions) are reported via our normal crash handling logic, along with some basic system information and both Python and C++ stack traces. There are some scenarios, however, where our normal mechanisms do not get a chance to handle the error. In these cases, we've got a "last-chance handler" which writes the exception information to a file, which is then submitted as part of the next report. The biggest omission from our automated reporting is deadlock detection.

Dusty discovered a deadlock in Python's logging module:

import logging
logger = logging.getLogger('logger')
logger.addHandler(logging.StreamHandler())
logger.setLevel(logging.INFO)

class C:
    def __str__(self):
        logger.info('C.__str__')
        return 'C'

logger.info('test')
logger.info(C())

Logging from an object that's being logged tries to acquire a non-recursive lock twice on the same thread. First rule of avoiding deadlocks: never call back into user code while a lock is held. In this simple case, a recursive lock would have sufficed. We decided to simply disable the logging.Handler lock; after all, garbled log file output isn't so bad, and it's better than deadlocking.

def nullfunc(*args): pass
logging.Handler.acquire = nullfunc
logging.Handler.release = nullfunc

That seemed to fix the problem. Except for the intermittent crash reports we started getting (to be specific, last chance crashes), inside of the CRT via Python's file functions. The exception was a null pointer access and the stack looks like:

PYTHON23!0x000baa00 -                         - PyFile_Type
PYTHON23!0x0003ac27 -                         - PyFile_SetEncoding
  MSVCRT!0x00030a06 -                         - clearerr
   ntdll!0x00001010 -                         - RtlEnterCriticalSection

Well that's weird. PyFile_Type isn't a function... and PyFile_SetEncoding doesn't call clearerr as far as I can tell. But a test program that calls clearerr(NULL) does indeed raise the same exception. But why would Python call clearerr on a null FILE* pointer?

It was around this time I noticed the crashes happened during log rollover. We use Python's RotatingFileHandler to rotate through a fixed set of fixed-size log files. Things started to click in my mind, and after a few lucky google searches, I found out that Python file objects (PyFileObject) are not thread-safe. There was some discussion on Python-dev, but no actionable conclusion was reached. The long and short is that if you have two threads working on the same file at the same time, the results are undefined. More specifically, if one thread closes a file while another tries to write, you can get into a situation where f->f_fp is NULL, and clearerr(NULL) raises an exception. (Ah ha!)

But why are two threads working on a file object at the same time? Oh right! We disabled the logging synchronization earlier... so if one thread decides it's time to rollover, it starts closing files and doing its internal shuffling, all while another thread is trying to output log statements. Oops! The "proper" fix involves reaching into the deadlock-prone logging module and replacing its nonrecursive lock with the recursive equivalent.

Lesson learned: it's very hard to understand the consequences of changes to synchronization code. Also, without our crash reporting system, we would have never found this problem.