Tasks and Futures and Concurrency in the IMVU Client

The IMVU client has a platform for developing asynchronous or long-running operations that we call the task system. The following describes why we built it. (Chris finally convinced me to post this information to the web.)

Most programs that we write can be split into two categories. There are short-lived programs like gzip that take command line arguments and files and inputs and output to other files or stdout.

There are also longer-lived programs that run indefinitely, handling events that happen to it. Think about MySQL or memcached. It takes inputs from the network, time, and files; doing work and spitting output back onto the network or the file system.

The operating system kernel is one of those event-driven, long-lived programs. It schedules the execution of subsidiary short- and long-lived programs.

We can certainly map from one type another. Take PHP for example. It is running within the context of a long-running server process, but the programming model for a single page is a lot closer to gzip. The program takes inputs (session data, request variables) and outputs HTML.

You can also think of a web service as a set of computers all running a bunch of short-lived processes that combine into one large program. PHP is the CPU, memcached is memory, the database is the file system, etc.

Desktop applications, including the IMVU client and other “rich internet clients” are structured a lot like servers: they are long-lived programs that respond to events from the keyboard, mouse, time, the windowing system, network requests… and interact with its users via graphics, sound, files, and the network.

The key notion here is that these are abstractions on top of an enormous state machine known as your processors and memory.

Now let’s talk a little bit about concepts and modeling. When you come into a code base, say, our client, you expect to see structures that look and behave like concepts you know: avatars, chat windows, buddy list, inventory, products from the catalog, etc. This is one reason OOP is so powerful – it’s an effective method for describing concepts and the way they interact.

Now let’s talk about a simpler concept. One so ingrained you probably don’t even think about it. Functions (in the imperative sense). Functions are an abstraction on top of the state machine that is your processor. We’re all very familiar with functions: “This operation takes X and Y as inputs, does steps A, B, and C, and returns Z.” Even if you throw in a few branches and loops and call some other functions, we’re good at following the logic. For example:

def _readNumber(self):
    isfloat = False
    result = self._next()
    peek = self._peek()
    while peek is not None and (peek.isdigit() or peek == "."):
        isfloat = isfloat or peek == "."
        result = result + self._next()
        peek = self._peek()
        if isfloat:
            return float(result)
            return int(result)
    except ValueError:
        raise ReadException, "Not a valid JSON number: '%s'" % result

This a fairly elaborate function, but we have no problem following it. Still, it’s an abstraction on top of the processor and memory: it’s compiled by Python into bytecode, executed by the interpreter loop, which is written in C, which is compiled to x86 assembly, which is being converted by the processor into micro ops… The point here is that we can build our own abstractions if they make us more effective at doing our jobs.

There is a concept in the IMVU client that may be unfamiliar to some of you: long-running processes. These are like functions, except that many can be running at the same time, and they can be blocked, waiting for results from the network or disk. Or perhaps just waiting for some time to elapse. Pseudocode for one of these processes might look like this:

def loadAvatar(chatId, userId):
    avatarInfo = getAvatarInfo(userId)
    if failure, show dialog box and stop loading
    products = getCurrentOutfit(chatId, userId)
    for p in products:
    loadAvatarIntoScene(products) # this might take a while

In fact, these processes can be fairly easily implemented on top of threads, and many people do it that way, to varying degrees of success.

However, threads have several disadvantages:

  • Each thread (on CPython/Windows) requires 1 MB of memory for the stack (plus any kernel objects), limiting you to at most 2000.
  • Because threads are scheduled by the operating system, they’re prone to nondeterminism and race conditions. In our case, threads are a large source of intermittent test failures and real bugs.
  • Threads in CPython don’t actually buy you any concurrency because of the global interpreter lock, so you might as well just run everything on the main thread.
  • Threads are GC roots and must be explicitly cleaned up.
  • You have to be extremely careful when interacting with the windowing system or any third-party system from threads. Many APIs are not thread safe.

Another way to manage these types of long-running operations is by building an explicit state machine and periodically pumping it, when it will check and update its state if possible. This is generally how interactive games work, because they can get away with using up all of the CPU and having a fixed frame rate. Much of the code in the IMVU client uses this model, but we’ve found it 1) hard to understand the interactions between states and 2) inefficient. Most of the time you’re _not_ updating the state of a particular object, so you don’t need to check every single frame. For example, some of our most avid users have thousands of items in their inventory, and our old inventory UI walks that entire list 30 times per second, asking “does this one have any events to process? does this one have any events to process? does this one have any events to process?” This takes up to 10% of the CPU on the user’s computer without a 3D window open! Finally, this approach generally makes the application feel sluggish, because the latency between an event (say, a mouse click) being sent to the client and it being processed depends on the rate at which the application is pumped, rather than having the client immediately handle messages as they come in.

A third technique you might call continuation-passing style or “onComplete callbacks”. Basically, any function that can be asynchronous takes an onComplete callback function that it runs (with the result or error) when the operation is completed. The problem with this style is that hearkens back to the days of goto with spaghetti callback loops and non-linear control flow. Moreover, if you accidentally forget to call an onComplete, or you call it twice, you’ve just introduced a very hard-to-find bug. We have had both of these types of bugs multiple times in our client.

We did not find any of the above approaches particularly compelling, so we built the task system on top of Python’s bidirectional generators. First, I’ll enumerate the advantages:

  • Tasks are a lightweight resource (sorta) – you can start as many as you’d like.
  • Tasks look like normal functions with ‘yield’ statements sprinkled in.
  • There’s no chance you’ll forget to call onComplete or call it twice.
  • Exceptions propagate automatically — with stack traces!
  • Tasks can be implicitly cancelled by releasing the reference to their result, so you won’t leak them as is so easy with threads.
  • All tasks run interleaved on the main thread, so you don’t have to worry about most kinds of race conditions or calling into external libraries.
  • Tasks run as part of the Windows event loop, which means that popping up modal dialog boxes or modal menus does not block them from executing, keeping the application responsive.

Here’s what the hypothetical example from above would look like as a task:

def loadAvatar(chatId, userId):
        avatarInfo = yield getAvatarInfo(userId)
    except Exception:

    products = yield getCurrentOutfit(chatId, userId)
    for p in products:
            yield loadProduct(p)
        except Exception:
            pass # ignore products that fail to load?
    yield loadAvatarIntoScene(products)

At every ‘yield’, other tasks get a chance to run, and if that task does not return immediately, then the task is paused until the asynchronous call is done. getAvatarInfo, for example, will contact the server in order to fetch the avatar name. Tasks can yield on other task calls, suspending their execution until the subsidiary call returns a value or raises an exception. (Sound familiar?)

Tasks can start other tasks, with the admittedly strange “yield Start(task)” syntax which gives back a Future object. Think of futures as values that may not have materialized yet. You can later wait for the result of a future by yielding on it.

def printFuture(future):
    # this will throw an exception if the future
    # contains an error
    result = yield future
    print result

def printAvatarInfo():
    avatarInfo = yield Start(getAvatarInfo(userId))
    print 'started fetching avatarInfo'
    yield printFuture(avatarInfo)

You can use futures to start several tasks in parallel and then wait on all of them, making efficient use of parallel resources. Some examples with real code might help. Unfortunately you can’t run them, as we have not (yet) open sourced the task system.

Today, much of the client’s code uses threads, polling, and onCompletes, but we’re slowly converting things to tasks, usually to fix bugs. I hope this means that the quality of the client from a user’s perspective will continue to improve and engineers will be able to work in the code more easily.

Tasks, object lifetimes, and implicit cancellation

Background: The IMVU client has a system that we call the ‘task system’ which works a lot like lightweight threads or coroutines. You can schedule up some work which gives you back a Future object on which you can either wait or request the actual result (if it has materialized, that is).

I just read Joe Duffy’s IDisposable, finalization, and concurrency. The parallels between TPL and IMVU’s task system surprised me.

When we built the task system, there were two ways to cancel execution of a task: explicit and implicit. You could either call future.cancel() or just release your reference to the future and wait for it to be collected by the GC. In either case, the scheduler would notice and stop running your task. I’ve often wondered if support for implicit task cancellation was a good design.

After reading Joe’s article, I’m convinced.  If you believe that resource utilization is also a resource, and that implicit resource disposal (garbage collection) is the right default, then implicit cancellation is the only sane default.  In fact, we recently removed support for explicit cancellation, because we haven’t even ‘really’ used it yet.  (Some of Dusty’s code used it, but he said it was fine to take it out.)

We may have to implement explicit cancellation again someday, but now I’m happy to wait until there is a compelling use case.