Tasks, object lifetimes, and implicit cancellation

Background: The IMVU client has a system that we call the ‘task system’ which works a lot like lightweight threads or coroutines. You can schedule up some work which gives you back a Future object on which you can either wait or request the actual result (if it has materialized, that is).

I just read Joe Duffy’s IDisposable, finalization, and concurrency. The parallels between TPL and IMVU’s task system surprised me.

When we built the task system, there were two ways to cancel execution of a task: explicit and implicit. You could either call future.cancel() or just release your reference to the future and wait for it to be collected by the GC. In either case, the scheduler would notice and stop running your task. I’ve often wondered if support for implicit task cancellation was a good design.

After reading Joe’s article, I’m convinced.  If you believe that resource utilization is also a resource, and that implicit resource disposal (garbage collection) is the right default, then implicit cancellation is the only sane default.  In fact, we recently removed support for explicit cancellation, because we haven’t even ‘really’ used it yet.  (Some of Dusty’s code used it, but he said it was fine to take it out.)

We may have to implement explicit cancellation again someday, but now I’m happy to wait until there is a compelling use case.

backup-sourceforge.py

Today I updated my backup-sourceforge.py script to reflect the new SourceForge backup policies, including support for Subversion repositories. You can download it on my scripts page.

It used to be a bash script, but the second you’re using commands like trap "error" ERR and set -e and cd "`dirname \"$0\"`", it’s worth rewriting in Python. Interestingly, the Python came out surprisingly terse and clear, unlike past experiences writing shell scripts in Python. Also, it used to call into the adocman project (some Perl scripts provided by SourceForge to make backups easier), but it was such a pain to set up CPAN and install the required Perl packages that I decided I would reimplement their xml_export tool in Python and urllib2. I managed to replace xml_export with fewer than 20 lines of code. Python really does have “batteries included”!

iPodExtract

This weekend I wrote a little program to extract music and movies off of your iPod, reconstituting an Artist/Album/Track directory tree based on the tags in the files. I lost my RAID 0 array a while back (due to a controller failure, not a drive failure as you might think), so I needed a way to restore my iTunes purchases. Also, I didn’t want to rerip all of my CDs. The program worked well for my purposes, so I thought I should whip up a GUI and release it.

Its name is iPodExtract. Check it out.

Also, I’m experimenting with Google ads.

Postmortem of a crash.

We’re getting to the point in our development at IMVU where we’re starting to have to fix some of the harder, more obscure, crash bugs. I spent the last week on one, and I wanted to write about what it was and the process we used to find it.

First, IMVU has a very good automated bug reporting system. Crashes, unclean shutdowns, and a couple other types of errant behavior get automatically reported and entered into our bug database. Most crashes (read: exceptions) are reported via our normal crash handling logic, along with some basic system information and both Python and C++ stack traces. There are some scenarios, however, where our normal mechanisms do not get a chance to handle the error. In these cases, we’ve got a “last-chance handler” which writes the exception information to a file, which is then submitted as part of the next report. The biggest omission from our automated reporting is deadlock detection.

Dusty discovered a deadlock in Python’s logging module:

import logging
logger = logging.getLogger('logger')
logger.addHandler(logging.StreamHandler())
logger.setLevel(logging.INFO)

class C:
    def __str__(self):
        logger.info('C.__str__')
        return 'C'

logger.info('test')
logger.info(C())

Logging from an object that’s being logged tries to acquire a non-recursive lock twice on the same thread. First rule of avoiding deadlocks: never call back into user code while a lock is held. In this simple case, a recursive lock would have sufficed. We decided to simply disable the logging.Handler lock; after all, garbled log file output isn’t so bad, and it’s better than deadlocking.

def nullfunc(*args): pass
logging.Handler.acquire = nullfunc
logging.Handler.release = nullfunc

That seemed to fix the problem. Except for the intermittent crash reports we started getting (to be specific, last chance crashes), inside of the CRT via Python’s file functions. The exception was a null pointer access and the stack looks like:

PYTHON23!0x000baa00 -                         - PyFile_Type
PYTHON23!0x0003ac27 -                         - PyFile_SetEncoding
  MSVCRT!0x00030a06 -                         - clearerr
   ntdll!0x00001010 -                         - RtlEnterCriticalSection

Well that’s weird. PyFile_Type isn’t a function… and PyFile_SetEncoding doesn’t call clearerr as far as I can tell. But a test program that calls clearerr(NULL) does indeed raise the same exception. But why would Python call clearerr on a null FILE* pointer?

It was around this time I noticed the crashes happened during log rollover. We use Python’s RotatingFileHandler to rotate through a fixed set of fixed-size log files. Things started to click in my mind, and after a few lucky google searches, I found out that Python file objects (PyFileObject) are not thread-safe. There was some discussion on Python-dev, but no actionable conclusion was reached. The long and short is that if you have two threads working on the same file at the same time, the results are undefined. More specifically, if one thread closes a file while another tries to write, you can get into a situation where f->f_fp is NULL, and clearerr(NULL) raises an exception. (Ah ha!)

But why are two threads working on a file object at the same time? Oh right! We disabled the logging synchronization earlier… so if one thread decides it’s time to rollover, it starts closing files and doing its internal shuffling, all while another thread is trying to output log statements. Oops! The “proper” fix involves reaching into the deadlock-prone logging module and replacing its nonrecursive lock with the recursive equivalent.

Lesson learned: it’s very hard to understand the consequences of changes to synchronization code. Also, without our crash reporting system, we would have never found this problem.

Guess the output of this code, answers.

My previous post showed a bit of code that demonstrates an actual bug I ran into this week. Here is the code again:

# Python
names = ['foo', 'bar', 'baz']
fns = [lambda: n for n in names]
print [f() for f in fns]

Here’s the output:

['baz', 'baz', 'baz']

Let’s try to figure out why. First factor out the list comprehensions:

names = ['foo', 'bar', 'baz']
fns = []
for n in names:
    fns.append(lambda: n)
result = []
for f in fns:
    result.append(f())
print result

Same result. Good. Now let’s factor out the lambda (anonymous function) and give it an explicit name.

names = ['foo', 'bar', 'baz']
fns = []
for n in names:
    def fn(): return n
    fns.append(fn)
result = []
for f in fns:
    result.append(f())
print result

Still the same result. The reason this program outputs what it does might be becoming clearer… Let’s think about what a for loop does again. “for N in LS” is translated into something like “if LS has anything, assign N to the first item in LS and run the loop body. if there is another item in LS, assign N to that item and run the body again. repeat.” That assignment to N happens in the current local scope.

Okay, getting closer… Inside a function created inside of a scope, variable lookups start in the current local scope; if the lookup fails, it continues to the next outer scope. And so on all the way up to the global scope. (If it’s not there, you get a NameError.) This lets you create inner functions that use variables from outside, such as globals. The important thing to note here is that these lookups only happen when you actually try to use a variable. So let’s see what’s happening here. The program builds a list of functions which return a variable n. Since n is not in the functions’ scopes, it is pulled from the outer scope (in this case, the globals). By the time the functions are called, however, n has changed value several times. Its most recent value is the one the functions return; in this case, ‘baz’. Thus, all of the functions return ‘baz’. For performance reasons, you wouldn’t want to take a snapshot of the values of every variable your functions access, but I think mentally that’s what we think is happening when we write the code.

Is there a way to take a snapshot of variables we know will change, though? Default argument initializers let us do just that. Replace the program with the following and it does exactly what we want.

names = ['foo', 'bar', 'baz']
fns = []
for n in names:
    def fn(n=n): return n
    fns.append(fn)
result = []
for f in fns:
    result.append(f())
print result

And translating back into the original form with lambdas and list comprehensions…

names = ['foo', 'bar', 'baz']
fns = [lambda n=n: n for n in names]
print [f() for f in fns]

Tada! Lesson: functions create scopes (or frames in Python terminology). loops don’t. Therefore, if you create functions inside of loops where variables will change value, use default arguments to capture their values at function creation time.