Postmortem of a crash.

We're getting to the point in our development at IMVU where we're starting to have to fix some of the harder, more obscure, crash bugs. I spent the last week on one, and I wanted to write about what it was and the process we used to find it.

First, IMVU has a very good automated bug reporting system. Crashes, unclean shutdowns, and a couple other types of errant behavior get automatically reported and entered into our bug database. Most crashes (read: exceptions) are reported via our normal crash handling logic, along with some basic system information and both Python and C++ stack traces. There are some scenarios, however, where our normal mechanisms do not get a chance to handle the error. In these cases, we've got a "last-chance handler" which writes the exception information to a file, which is then submitted as part of the next report. The biggest omission from our automated reporting is deadlock detection.

Dusty discovered a deadlock in Python's logging module:

import logging
logger = logging.getLogger('logger')
logger.addHandler(logging.StreamHandler())
logger.setLevel(logging.INFO)

class C:
    def __str__(self):
        logger.info('C.__str__')
        return 'C'

logger.info('test')
logger.info(C())

Logging from an object that's being logged tries to acquire a non-recursive lock twice on the same thread. First rule of avoiding deadlocks: never call back into user code while a lock is held. In this simple case, a recursive lock would have sufficed. We decided to simply disable the logging.Handler lock; after all, garbled log file output isn't so bad, and it's better than deadlocking.

def nullfunc(*args): pass
logging.Handler.acquire = nullfunc
logging.Handler.release = nullfunc

That seemed to fix the problem. Except for the intermittent crash reports we started getting (to be specific, last chance crashes), inside of the CRT via Python's file functions. The exception was a null pointer access and the stack looks like:

PYTHON23!0x000baa00 -                         - PyFile_Type
PYTHON23!0x0003ac27 -                         - PyFile_SetEncoding
  MSVCRT!0x00030a06 -                         - clearerr
   ntdll!0x00001010 -                         - RtlEnterCriticalSection

Well that's weird. PyFile_Type isn't a function... and PyFile_SetEncoding doesn't call clearerr as far as I can tell. But a test program that calls clearerr(NULL) does indeed raise the same exception. But why would Python call clearerr on a null FILE* pointer?

It was around this time I noticed the crashes happened during log rollover. We use Python's RotatingFileHandler to rotate through a fixed set of fixed-size log files. Things started to click in my mind, and after a few lucky google searches, I found out that Python file objects (PyFileObject) are not thread-safe. There was some discussion on Python-dev, but no actionable conclusion was reached. The long and short is that if you have two threads working on the same file at the same time, the results are undefined. More specifically, if one thread closes a file while another tries to write, you can get into a situation where f->f_fp is NULL, and clearerr(NULL) raises an exception. (Ah ha!)

But why are two threads working on a file object at the same time? Oh right! We disabled the logging synchronization earlier... so if one thread decides it's time to rollover, it starts closing files and doing its internal shuffling, all while another thread is trying to output log statements. Oops! The "proper" fix involves reaching into the deadlock-prone logging module and replacing its nonrecursive lock with the recursive equivalent.

Lesson learned: it's very hard to understand the consequences of changes to synchronization code. Also, without our crash reporting system, we would have never found this problem.

Imported Comments [?]

hiretsukan on May 7, 2006

Logging should really be doing the str into a local before ever grabbing the lock. A string conversion is a rather silly (and theoretically expensive) operation to perform inside a lock.

thespeedbump on May 7, 2006

Yeah. You could also do the string-ification before taking the lock, but that might involve breaking custom handlers that assume that emit() is given an un-formatted record.

Switching the lock out for a recursive lock keeps all the pre-existing behaviour unchanged, which makes it the more compelling solution in my eyes.

timothyfitz on May 11, 2006

How long did that take to find and fix?

anonymous on May 11, 2006

A week including a bunch of work to bring the legacy bug reporting subsystem under test coverage.

aegisknight on May 11, 2006

Oops that was me.

ouch bet you're pretty sick of it by now. How much Python does IMVU use? I wasn't even aware it was in use.

Most of the app is written in Python. :)

anonymous on Jul 14, 2006

my log in screen keeps telling me dat i did not sign off properly last time. when the log in screen does come in, and i put in my password, nothing comes up. it keeps telling me i logged off on the home page , i cant log in. plese someone help me. my e mail is esanfijo@aol.com

anonymous on Mar 23, 2008

Python - You serious ? You gotta have to write the rendering stuff in C++ I would imagine

inhahe on Mar 8, 2009

i was thinking maybe they left that up to directx etc.