#IND and #QNaN with /fp:fast

The other day Timothy and I were optimizing some floating-point-intensive lighting code. Looking at the generated code, I realized we weren’t compiling with /fp:fast. Due to the wonky state of floating point on 32-bit x86, Visual C++ frequently stores temporary results of floating point calculations to the stack and then reloads them, for the sake of consistent results.

See, the problem is that the floating point registers on x86 are 80 bits wide, so if you compile “float x, y, z, w; w = (x + y) * z” as…

fld [x]  ; ST0 = x
fadd [y] ; ST0 = x + y
fmul [z] ; ST0 = (x + y) * z
fstp [w] ; w = (x + y) * z

… the temporary results are always stored in ST0 with 80 bits of precision. However, since floats only have 32 bits of precision, you can wind up with different results depending on compilers, optimization settings, register allocation, etc. We often had problems like this at VRAC. Some poor engineering student would send out a panicked e-mail at 9:00 p.m. asking why his program started producing different results in release mode than it did in debug mode.

Thus, Visual C++ takes a more cautious approach. By default, it stores float intermediates back to memory to truncate them to 32 bits of precision:

fld [x]
fadd [y]
fstp [temp] ; truncate precision
fld [temp]
fmul [z]
fstp [w]

Tiny differences in precision don’t matter in IMVU, so enabling /fp:fast saved 50-100 CPU cycles per vertex in our vertex lighting loop. However, with this option turned on, our automated tests started failing with crazy #IND and #QNAN errors!

After some investigation, we discovered that our 4×4 matrix inversion routine (which calculates several 2×2 and 3×3 determinants) was using all 8 floating point registers with /fp:fast enabled. The x87 registers are stored in a “stack“, where ST0 is the top of the stack and STi is the i’th entry. Load operations like fld, fld1, and fldz push entries on the stack. Arithmetic operations like fadd and fmul operate on the top of the stack with the value in memory, storing the result back on the stack.

But what if the x87 register stack overflows? In this case, an “indefinite” NAN is loaded instead of the value you requested, indicating that you have lost information. (The data at the bottom of the stack was lost.) Here’s an example:

fldz  ; ST0 = 0
fld1  ; ST0 = 1, ST1 = 0
fldpi ; ST0 = pi, ST1 = 1, ST2 = 0
fldz
fldz
fldz
fldz
fldz  ; ST0-4 = 0, ST5 = pi, ST6 = 1, ST7 = 0
fldz  ; ST0 = IND!

Woops, there’s a bug in your code! You shouldn’t overflow the x87 register stack, so the processor has given you IND.

Indeed, this is what happened in our matrix inversion routine. But why?

Using a debugger, we determined that the x87 stack contained one value at the start of the function. Moreover, it contained a value at the start of the test! Something was fishy. Somebody was leaving the x87 stack dirty, and we needed to find out who.

void verify_x87_stack_empty() {
    unsigned z[8];
    __asm {
        fldz
        fldz
        fldz
        fldz
        fldz
        fldz
        fldz
        fldz
        fstp dword ptr [z+0x00]
        fstp dword ptr [z+0x04]
        fstp dword ptr [z+0x08]
        fstp dword ptr [z+0x0c]
        fstp dword ptr [z+0x10]
        fstp dword ptr [z+0x14]
        fstp dword ptr [z+0x18]
        fstp dword ptr [z+0x1c]
    }

    // Verify bit patterns. 0 = 0.0
    for (unsigned i = 0; i < 8; ++i) {
        CHECK_EQUAL(z[i], 0);
    }
}

The previous function, called before and after every test, discovered the culprit: we had a test that intentionally called printf() and frexp() with NaN values, which had the side effect of leaving the floating point stack in an unpredictable state.

Adding __asm emms to the end of the test fixed our problem: thereafter, /fp:fast worked wonderfully. Case closed.

10 Pitfalls of Dirty Code

10 Pitfalls of Dirty Code

(Disclaimer: These are my opinions and not the opinions of IMVU or its founders. I’m sure we all have different perspectives.)

A History of IMVU’s Development Process

IMVU was started with a particular philosophy: We don’t know what
customers will like, so let’s rapidly build a lot of different stuff
and throw away what doesn’t work. This was an

effective approach
to discovering a business by using a
sequence of product prototypes to get early customer feedback. The
first version of the 3D IMVU client took about six months to build,
and as the founders iterated towards a compelling user experience, the
user base grew monthly thereafter.

This development philosophy created a culture around rapid prototyping
of features, followed by testing them against large numbers of actual
customers. If a feature worked, we’d keep it. If it didn’t, we’d
trash it.

It would be hard to argue against this product development strategy,
in general. However, hindsight indicates we forgot to do something
important when developing IMVU: When the product changed, we did not
update the code to reflect the new product, leaving us with piles of
dirty code.

Dirty Code

What makes code dirty? Really, anything that prevents people from making changes to the system. Examples include code with:

  • unclear or too many responsibilities,
  • overly complicated or obscure control flow,
  • concepts that don’t map to the domain,
  • too many dependencies,
  • global state,
  • or duplicated logic.

In short, if you hire someone who’s clearly smart and they say “I
can’t make sense of this”, then you probably have a dirty code
problem.

You’ll sometimes hear of a technical debt
metaphor. Technical debt is a way to think about the cost of
introducing dirty code as you’ll need to maintain it in the future.
Knowingly introducing dirty code lets you quickly test a hypothesis or
learn some information, so you’re not investing in code you won’t
need. For code you will need, however, technical debt compounds on
itself and rapidly becomes more expensive than the original cost of
fixing it.

Taking on technical debt can be the right decision, but it’s important
to remember that you’re introducing work for someone down the road.
Moreover, only programmers have a good grasp on the true cost of
technical debt. The business will never decide to refactor over
pursuing the next shiny project (they don’t have enough information).
Your engineering organization must be empowered to do what it thinks
best for the business and technology platform. If the term “technical
debt” ever shows up in a strategic plan, your engineering team has
failed to communicate the true costs of their work.

We used the technical debt metaphor quite a bit at IMVU, and, in
hindsight, it’s obvious that we underestimated the long-term costs by
at least an order of magnitude.

So what are the true costs of letting dirty code linger? Each of
these are drawn from real examples at IMVU.

Team Costs

  1. Dirty code does not scale to larger teams.

In a code base where modules and objects have unclear
responsibilities, programmers tend to implement features by modifying
code all over the system. This causes conflicts as multiple people
change the same files. Following the open-closed principle helps
here. Every object should do one thing well and have a clear
interface to the rest of the system. Ideally, each feature would fit
into its own object or set of objects, and plug into the system in a
standard way.

  1. Dirty code reduces team morale.

Most programmers I know get pleasure and validation by shipping code
that makes people happy. Any frustration that gets in the way of that
basic need reduces morale. If an improvement to the product seems
like it should be a three-hour task but takes two days of
investigation and pain, the programmer is unlikely to feel like they
can make a difference.

  1. Dirty code makes programmers slower.

If there are two systems for doing X, and you want to make an
improvement to the way X is done, you have to change both systems,
increasing effort and the risk of regression.

If the concepts in the domain don’t map to your objects, programmers
have to struggle to find the right place for new code.

If A and B are unrelated aspects of the system and the logic for A and
B are glommed together, changes to A involve understanding B too. The
more aspects are coupled together, the cost of changing each of those
aspects goes up.

  1. Dirty code inhibits the formation of an ownership culture.

When the code is too complicated for anyone to fit it in their heads,
programmers will tend to blame the legacy code or architecture for any
bugs or regressions that crop up. If they perceive it’s too expensive
to fix the architecture, they will not feel responsible if the product
ends up being low-quality. To build a sustainable, high-quality
product, the programmers ultimately need to feel responsible, and the
feedback loop between customers and programmers needs to be closed.

Product Costs

  1. If product concepts are not reflected in the code, programmers
    might implement features in ways that don’t make sense in the product.

To explain this, I’ll give an example from the IMVU client: The
business rules around product loading are complicated to begin with.
Worse, the code does not directly reflect said rules. Because of
this, our attempts to implement a better loading experience (including
progress bar and object prioritization) have failed multiple times,
and we still don’t have it quite right.

  1. Dirty code incentivizes the business to invest in tangential
    revenue work rather than attacking core business problems.

For most startups*, the primary product or core competency should
ultimately derive the most revenue. If management perceives it’s too
expensive to work on the core product, they will tend to fund
tangential work such as new payment methods or bizdev deals. There’s
a point where that kind of work makes sense (and pays for itself), but
the core offering should be the biggest lever, and the company should
rally around that.

* During Web 2.0, your mileage may vary.

Quality Costs

  1. Even with good automated test coverage, dirty code increases the
    risk of introducing regressions.

If a module has too many dependencies or responsibilities, changes to
it can have unintended consequences. Automated test coverage helps a
great deal, but think of test coverage as approximating (number of
tested conditions) / (number of possible conditions). The
combinatorially increased states in dirty code effectively reduces the
coverage of your automated (and manual!) tests, allowing regressions
to slip through to customers. Threads, if statements, and nullable
objects are all examples of ways to reduce test coverage from the
code.

  1. Wide or unclear dependencies reduce the quality of tests.

In our experience, unit tests around dirty code tend to involve lots
of mocks, partial mocks, and monkey patching, reducing the likelihood
that tests will actually catch regressions. Worse, these tests are
more likely to fail after benign refactorings. As we refactor our
objects to better reflect the actual domain (updating the tests as we
go), our confidence in the tests improve and they become dramatically
easier understand.

  1. Dirty code hides real bugs.

If the code is too complicated for the programmers to understand and
has too many possible states to effectively test, what makes you think
you can know whether it reliably works? In our experience, every
dirty module corresponds with the area of the product that our users
report intermittently breaks. When refactoring those systems, we
inevitably discover race conditions, bugs in edge cases, and
performance problems.

  1. Dirty code gets dirtier.

Finally, if your team does not get used to improving dirty code, it
will only get worse. Eventually, the programmers (and maybe even
management) will start calling for a rewrite (a.k.a. resetting the
shitty counter). Software design and refactoring are skills that take
practice. Honing them will keep your business and technology nimble.

Final Thoughts

I’ve talked a lot about why dirty code is expensive, so you may be
asking “Well, what can I do about it?” First, try to pay attention to
your code. After you finish writing some, ask yourself “Could I make
this clearer?” Then ask your neighbor the same question. Beyond
that, here are some resources that will help you improve your design
skills:

It’s definitely possible to keep that feeling of ‘newness’ in your
code. Hopefully I’ve convinced that you that the extra few hours or
days to clean up after every project easily pays for itself. Code is
the lifeblood of our industry – keep it clean!

Refactor first!

Test-driven development teaches the red-green-refactor
mantra, which works wonderfully when starting a project from scratch.
Chances are good, however, that you’ll spend 80% of your programming
career working in existing codebases. In existing code, I’ve found it
easier to refactor first until the desired behavior change is a
30-minute task. By refactoring first, you’ll…

  • better understand the code you’re about to change.
  • better understand the consequences of your change.
  • make it easier for the person reviewing your change to understand
    its impact. (It’s much easier to review several independent
    refactorings and one simple change than a nonobvious change followed
    by refactorings; or, worse, all of the above in one commit.)
  • reduce risk and save time by preventing unexpected bugs.

Red-green-refactor is great, but try refactoring first!

Finding the shortest path between two objects in Python

I’ve been doing a bit of memory profiling and debugging in the IMVU client lately, and we’re starting to collect a nice set of tools. Today, I’m adding a function that can find the shortest path from one Python object to another. No guarantees about performance or correctness, but it seems to work so far.

The findShortestPath function.

Its tests.

Update: Fixed some ordering bugs in longer paths and optimized it a tad.

Reporting Crashes in IMVU: Catching Python Exceptions

For years now, I have been meaning to write a series of articles on the automated crash reporting system in the IMVU client. This first article will give a bit of background on the structure of the client and show how we handle Python exceptions.

At IMVU, we generally subscribe to the Fail Fast philosophy of handling errors: when the client encounters an unexpected error, we immediately crash the program and ask the user to submit a crash report. As part of the crash report, we send log files, stack traces, system information, and anything else that might help us debug the failure.

You might wonder why we crash the program whenever anything goes wrong rather than trying to catch the error and continue running. Counterintuitively, crashing the program forces us to act on crashes and immediately exposes bugs that might trigger unwanted behavior or lost data down the road.

Now let’s talk a little bit about how the client is structured. The IMVU client is written primarily in Python, with time-critical components such as the 3D renderer written in C++. Since the client is a cross between a normal interactive Windows program and a real-time game, the main loop looks something like this:

def main():
    while running:
        pumpWindowsMessages() # for 1/30th of a second
        updateAnimations()
        redrawWindows()

This structure assumes that no exceptions bubble into or out of the main loop. Let’s imagine that updateAnimations() has a bug and occasionally raises an uncaught exception. If running the client with a standard command-line python invocation, the program would print the exception and stack trace to the console window and exit. That’s all great, but our users don’t launch our client by invoking python from the command line: we use py2exe to build a standalone executable that users ultimately run. With an unmodified py2exe application, uncaught exceptions are printed to sys.stderr (as above), except there is no console window to display the error. Thus, the py2exe bootstrap code registers a handler so that errors are logged to a file, and when the program shuts down, a dialog box shows something like “An error has been logged. Please see IMVUClient.exe.log.”

From a crash reporting standpoint, this is not good enough. We can’t be asking our users to manually hunt down some log files on their hard drives and mail them to us. It’s just too much work – they will simply stop using our product. (Unfortunately, most of the software out there asks users to do exactly this!) We need a way for the client to automatically handle errors and prompt the users to submit the reports back to us. So let’s rejigger main() a bit:

def mainLoop():
    while running:
        pumpWindowsMessages()
        updateAnimations()
        redrawWindows()

def main():
    try:
        mainLoop()
    except:
        error_information = sys.exc_info()
        if OK == askUserForPermission():
            submitError(error_information)

This time, if a bug in updateAnimations() raises an exception, the top-level try: except: clause catches the error and handles it intelligently. In our current implementation, we post the error report to a Bugzilla instance, where we have built custom tools to analyze and prioritize the failures in the field.

This is the main gist of how the IMVU client automatically reports failures. The next post in this series will cover automatic detection of errors in our C++ libraries.

Open sourced our pstats viewer!

IMVU has benefited greatly from open source software. Large components of our client and website are built on top of various open source projects. In fact, you can see the list of software on our technology page.

Well, I’m happy to announce that we’re beginning to contribute source code back! We have created a project on SourceForge and released our first standalone tool: pstats_viewer.py. pstats_viewer is a tool for browsing the data stored in a .pstats file generated by the Python profiler.

Tasks and Futures and Concurrency in the IMVU Client

The IMVU client has a platform for developing asynchronous or long-running operations that we call the task system. The following describes why we built it. (Chris finally convinced me to post this information to the web.)


Most programs that we write can be split into two categories. There are short-lived programs like gzip that take command line arguments and files and inputs and output to other files or stdout.

There are also longer-lived programs that run indefinitely, handling events that happen to it. Think about MySQL or memcached. It takes inputs from the network, time, and files; doing work and spitting output back onto the network or the file system.

The operating system kernel is one of those event-driven, long-lived programs. It schedules the execution of subsidiary short- and long-lived programs.

We can certainly map from one type another. Take PHP for example. It is running within the context of a long-running server process, but the programming model for a single page is a lot closer to gzip. The program takes inputs (session data, request variables) and outputs HTML.

You can also think of a web service as a set of computers all running a bunch of short-lived processes that combine into one large program. PHP is the CPU, memcached is memory, the database is the file system, etc.

Desktop applications, including the IMVU client and other “rich internet clients” are structured a lot like servers: they are long-lived programs that respond to events from the keyboard, mouse, time, the windowing system, network requests… and interact with its users via graphics, sound, files, and the network.

The key notion here is that these are abstractions on top of an enormous state machine known as your processors and memory.


Now let’s talk a little bit about concepts and modeling. When you come into a code base, say, our client, you expect to see structures that look and behave like concepts you know: avatars, chat windows, buddy list, inventory, products from the catalog, etc. This is one reason OOP is so powerful – it’s an effective method for describing concepts and the way they interact.

Now let’s talk about a simpler concept. One so ingrained you probably don’t even think about it. Functions (in the imperative sense). Functions are an abstraction on top of the state machine that is your processor. We’re all very familiar with functions: “This operation takes X and Y as inputs, does steps A, B, and C, and returns Z.” Even if you throw in a few branches and loops and call some other functions, we’re good at following the logic. For example:

def _readNumber(self):
    isfloat = False
    result = self._next()
    peek = self._peek()
    while peek is not None and (peek.isdigit() or peek == "."):
        isfloat = isfloat or peek == "."
        result = result + self._next()
        peek = self._peek()
    try:
        if isfloat:
            return float(result)
        else:
            return int(result)
    except ValueError:
        raise ReadException, "Not a valid JSON number: '%s'" % result

This a fairly elaborate function, but we have no problem following it. Still, it’s an abstraction on top of the processor and memory: it’s compiled by Python into bytecode, executed by the interpreter loop, which is written in C, which is compiled to x86 assembly, which is being converted by the processor into micro ops… The point here is that we can build our own abstractions if they make us more effective at doing our jobs.

There is a concept in the IMVU client that may be unfamiliar to some of you: long-running processes. These are like functions, except that many can be running at the same time, and they can be blocked, waiting for results from the network or disk. Or perhaps just waiting for some time to elapse. Pseudocode for one of these processes might look like this:

def loadAvatar(chatId, userId):
    avatarInfo = getAvatarInfo(userId)
    if failure, show dialog box and stop loading
    products = getCurrentOutfit(chatId, userId)
    for p in products:
        loadProduct(p)
    loadAvatarIntoScene(products) # this might take a while

In fact, these processes can be fairly easily implemented on top of threads, and many people do it that way, to varying degrees of success.

However, threads have several disadvantages:

  • Each thread (on CPython/Windows) requires 1 MB of memory for the stack (plus any kernel objects), limiting you to at most 2000.
  • Because threads are scheduled by the operating system, they’re prone to nondeterminism and race conditions. In our case, threads are a large source of intermittent test failures and real bugs.
  • Threads in CPython don’t actually buy you any concurrency because of the global interpreter lock, so you might as well just run everything on the main thread.
  • Threads are GC roots and must be explicitly cleaned up.
  • You have to be extremely careful when interacting with the windowing system or any third-party system from threads. Many APIs are not thread safe.

Another way to manage these types of long-running operations is by building an explicit state machine and periodically pumping it, when it will check and update its state if possible. This is generally how interactive games work, because they can get away with using up all of the CPU and having a fixed frame rate. Much of the code in the IMVU client uses this model, but we’ve found it 1) hard to understand the interactions between states and 2) inefficient. Most of the time you’re _not_ updating the state of a particular object, so you don’t need to check every single frame. For example, some of our most avid users have thousands of items in their inventory, and our old inventory UI walks that entire list 30 times per second, asking “does this one have any events to process? does this one have any events to process? does this one have any events to process?” This takes up to 10% of the CPU on the user’s computer without a 3D window open! Finally, this approach generally makes the application feel sluggish, because the latency between an event (say, a mouse click) being sent to the client and it being processed depends on the rate at which the application is pumped, rather than having the client immediately handle messages as they come in.

A third technique you might call continuation-passing style or “onComplete callbacks”. Basically, any function that can be asynchronous takes an onComplete callback function that it runs (with the result or error) when the operation is completed. The problem with this style is that hearkens back to the days of goto with spaghetti callback loops and non-linear control flow. Moreover, if you accidentally forget to call an onComplete, or you call it twice, you’ve just introduced a very hard-to-find bug. We have had both of these types of bugs multiple times in our client.


We did not find any of the above approaches particularly compelling, so we built the task system on top of Python’s bidirectional generators. First, I’ll enumerate the advantages:

  • Tasks are a lightweight resource (sorta) – you can start as many as you’d like.
  • Tasks look like normal functions with ‘yield’ statements sprinkled in.
  • There’s no chance you’ll forget to call onComplete or call it twice.
  • Exceptions propagate automatically — with stack traces!
  • Tasks can be implicitly cancelled by releasing the reference to their result, so you won’t leak them as is so easy with threads.
  • All tasks run interleaved on the main thread, so you don’t have to worry about most kinds of race conditions or calling into external libraries.
  • Tasks run as part of the Windows event loop, which means that popping up modal dialog boxes or modal menus does not block them from executing, keeping the application responsive.

Here’s what the hypothetical example from above would look like as a task:

@task
def loadAvatar(chatId, userId):
    try:
        avatarInfo = yield getAvatarInfo(userId)
    except Exception:
        showErrorDialog()
        return

    products = yield getCurrentOutfit(chatId, userId)
    for p in products:
        try:
            yield loadProduct(p)
        except Exception:
            pass # ignore products that fail to load?
    yield loadAvatarIntoScene(products)

At every ‘yield’, other tasks get a chance to run, and if that task does not return immediately, then the task is paused until the asynchronous call is done. getAvatarInfo, for example, will contact the server in order to fetch the avatar name. Tasks can yield on other task calls, suspending their execution until the subsidiary call returns a value or raises an exception. (Sound familiar?)

Tasks can start other tasks, with the admittedly strange “yield Start(task)” syntax which gives back a Future object. Think of futures as values that may not have materialized yet. You can later wait for the result of a future by yielding on it.

@task
def printFuture(future):
    # this will throw an exception if the future
    # contains an error
    result = yield future
    print result

@task
def printAvatarInfo():
    avatarInfo = yield Start(getAvatarInfo(userId))
    print 'started fetching avatarInfo'
    yield printFuture(avatarInfo)

You can use futures to start several tasks in parallel and then wait on all of them, making efficient use of parallel resources. Some examples with real code might help. Unfortunately you can’t run them, as we have not (yet) open sourced the task system.

Today, much of the client’s code uses threads, polling, and onCompletes, but we’re slowly converting things to tasks, usually to fix bugs. I hope this means that the quality of the client from a user’s perspective will continue to improve and engineers will be able to work in the code more easily.

Quality With A Name

As IMVU and its engineering team have grown, we’ve encountered an increasing amount of overhead while implementing features that affect the original code for the client and web site (you may even call some of it prototype code). You know, this effort even goes back to when I started. Most of my early days at IMVU involved fixing bugs, adding test coverage, and making small changes to the product. There were a lot of things about the code that I didn’t particularly like, but I chalked that up to the first time I’ve ever had to deal intimately with a large code base that I did not write. That doesn’t mean I didn’t try to improve the things that I found; in fact, it’s been an excellent exercise in articulating engineering practices I’ve internalized so deeply that I take them for granted.

At IMVU, we practice test driven development, which means that we always* add a unit test exposing a bug in the system before fixing the bug. In practice, this means we should never see the same bug twice. Some areas of the code have seen more love than others, of course.

Recently, Andy has been trying to implement a tiny new feature in the buddy window. The kind of change that should only take 15 minutes to write the tests and maybe another 15 to implement. (And we wouldn’t have to worry about other features breaking, or other changes breaking this feature, and all the other TDD goodness.) That is, if we have good tests in that area of the code. But we don’t, so it’s not that easy. So he gets to be the lucky engineer, adding tests to this module, which is involving a lot of refactoring. The code’s style has a lot of internal consistency, and there definitely is a design, but adding tests around it has proven to be a huge time loss. So we’ve been trying to articulate why the code is such a pain. But James Shore explains it exactly:

“…the goodness of a design is inversely proportional to the cost of change.”

And there we go. That sums it up.

* Ideally. We’re always improving.

Agile Databases

There is a subgroup of the agile community focused on agile database development and database refactoring. Just do a search. At IMVU, we definitely do agile development of our databases and the application layer code that talks to them. We expect most new engineers to be capable of writing a database schema and pushing it to production within a week of starting, with automated tests that let us change things without fear of breakage, of course. We have no traditional DBAs, although one of our engineers acts as one, reviewing schemas before they’re applied in production.

The database refactoring resources that we have seen have been deficient in one area: they assume that you have small amounts of data, or the ability to take down your site or service so that you can apply schema changes. (Which lock tables until the schema is applied.) The problem that we have not seen addressed yet is how to change table structure when you have millions of customers and heavy usage. Each hour the site is down translates into lost revenue. Yeah, there are ways to get around this, such as applying schemas to slaves and then failing the master over to the slave, but these require a fair amount of infrastructure and pain to support.

But if I could choose one thing to get from “the agile database community”, it would be a modification to MySQL that let you add or remove indices or columns in the background, even if that particular table was under heavy use, and even if the table had degraded performance during the alter. Maybe it could be implemented as a table with special metadata that shows how to get the data from the old table if it does not exist in the new one. And maybe Oracle supports this already, and I just want a new feature in MySQL.

Update: It turns out that if you have partitioned your customers across enough databases, and have some spare capacity, you can apply schema changes across your databases in parallel, and it’s not so bad. So maybe that’s how to be agile and have users. :)