Writing Solid PHP

I borrowed Eric’s Writing Solid Code by Steve Maguire, because I’ve always wanted to read it. It’s one of those software development classics, etc. At a high level, it describes Microsoft’s approach to writing bug-free code by using a set of techniques that catch bugs as soon as possible — before they reach the source repository, Testing, or, heaven forbid, the customer.

At first, I was surprised at how old it was — 1993! Before Win95 even! Then, it seemed quite out of date. It mostly talks about programming in C on some old, anemic platforms without the niceties that we take for granted, such as hardware faults when dereferencing invalid pointers. So, of course, I’m trying to take higher-level lessons from it, such as the approach he’s taken to develop these bug-eliminating techniques, and then apply that approach to the environment in which I work. Unfortunately, a lot of the techniques have to do with low-level C, where you don’t have a powerful and safe type system or GC to eliminate whole classes of errors. Most of the specific bugs he’s mentioned wouldn’t even arise in Haskell, Python, Java, or C#. Also, this book was written before unit testing and test-driven development were really commonplace, and I think that these supplant some of the techniques Steve mentions. (Not all; there is some overlap between test-driven development and writing bug-free code, but neither TDD nor after-the-fact comprehensive tests give you license to write sloppy code.)

And as I’m thinking about how spoiled we are with our strong typing, garbage collection, and hardware exceptions, I realized that I program every day in an unsafe language that tries as hard as it can to hide errors from you, just like the microcomputers of yore. PHP. I don’t think there’s been a day yet where learning about a PHP design decision hasn’t made me think “WTF”. I mean, why do you need a rich collection of container types such as in STL or java.util? PHP’s array does it all! If you pass too few arguments to a PHP function, that’s okay, the missed ones will just be null. If you pass too many, no problem! They’ll be ignored. Try to append something to an array that _doesn’t even exist_, no problem! It’ll magically create the array! I could go on and on. Anyway, the language actively prevents you from finding bugs.

So, from here on out, I’m going to read the book, trying to apply every technique to PHP. I think there is a lot we can learn. Here’s the first technique that would catch bugs in our code that popped into my head:

function foo($a, $b) {
  tep_assert(count(func_get_args()) == 2);
...

Another one: null, true, and false are considered ‘magic numbers’ without immediately visible context. What does…

updateProgress(true);

…mean? Compare that to…

updateProgress(showNewMessage=true);

It’s well-known that true and false make bad function parameters, but I’d never thought about ‘null’ being a magic number, however. And it is, for the same reasons as true and false above. When you see code like ‘new DialogBox(null)’, what does that null represent?

I’m sure more will come…

Download IMVU’s Cal3D modifications

Hi all,

IMVU uses a lot of open source software on both our client software and the web site (frontend and backend). One of the packages we use is Cal3D, a 3D skeletal animation system. We’ve made a few changes to Cal3D over the last year or two, including support for morph targets, exporter UI improvements, an improved animation scheduler, vertex colors, and others. In the past, we’ve published our changes directly back to the Cal3D project, but this wasn’t strictly in accordance with the LGPL, under which Cal3D is licensed. So now you can easily download our version of Cal3D directly from our technology page!

Enjoy!

Chad

Postmortem of a crash.

We’re getting to the point in our development at IMVU where we’re starting to have to fix some of the harder, more obscure, crash bugs. I spent the last week on one, and I wanted to write about what it was and the process we used to find it.

First, IMVU has a very good automated bug reporting system. Crashes, unclean shutdowns, and a couple other types of errant behavior get automatically reported and entered into our bug database. Most crashes (read: exceptions) are reported via our normal crash handling logic, along with some basic system information and both Python and C++ stack traces. There are some scenarios, however, where our normal mechanisms do not get a chance to handle the error. In these cases, we’ve got a “last-chance handler” which writes the exception information to a file, which is then submitted as part of the next report. The biggest omission from our automated reporting is deadlock detection.

Dusty discovered a deadlock in Python’s logging module:

import logging
logger = logging.getLogger('logger')
logger.addHandler(logging.StreamHandler())
logger.setLevel(logging.INFO)

class C:
    def __str__(self):
        logger.info('C.__str__')
        return 'C'

logger.info('test')
logger.info(C())

Logging from an object that’s being logged tries to acquire a non-recursive lock twice on the same thread. First rule of avoiding deadlocks: never call back into user code while a lock is held. In this simple case, a recursive lock would have sufficed. We decided to simply disable the logging.Handler lock; after all, garbled log file output isn’t so bad, and it’s better than deadlocking.

def nullfunc(*args): pass
logging.Handler.acquire = nullfunc
logging.Handler.release = nullfunc

That seemed to fix the problem. Except for the intermittent crash reports we started getting (to be specific, last chance crashes), inside of the CRT via Python’s file functions. The exception was a null pointer access and the stack looks like:

PYTHON23!0x000baa00 -                         - PyFile_Type
PYTHON23!0x0003ac27 -                         - PyFile_SetEncoding
  MSVCRT!0x00030a06 -                         - clearerr
   ntdll!0x00001010 -                         - RtlEnterCriticalSection

Well that’s weird. PyFile_Type isn’t a function… and PyFile_SetEncoding doesn’t call clearerr as far as I can tell. But a test program that calls clearerr(NULL) does indeed raise the same exception. But why would Python call clearerr on a null FILE* pointer?

It was around this time I noticed the crashes happened during log rollover. We use Python’s RotatingFileHandler to rotate through a fixed set of fixed-size log files. Things started to click in my mind, and after a few lucky google searches, I found out that Python file objects (PyFileObject) are not thread-safe. There was some discussion on Python-dev, but no actionable conclusion was reached. The long and short is that if you have two threads working on the same file at the same time, the results are undefined. More specifically, if one thread closes a file while another tries to write, you can get into a situation where f->f_fp is NULL, and clearerr(NULL) raises an exception. (Ah ha!)

But why are two threads working on a file object at the same time? Oh right! We disabled the logging synchronization earlier… so if one thread decides it’s time to rollover, it starts closing files and doing its internal shuffling, all while another thread is trying to output log statements. Oops! The “proper” fix involves reaching into the deadlock-prone logging module and replacing its nonrecursive lock with the recursive equivalent.

Lesson learned: it’s very hard to understand the consequences of changes to synchronization code. Also, without our crash reporting system, we would have never found this problem.

Guess the output of this code, answers.

My previous post showed a bit of code that demonstrates an actual bug I ran into this week. Here is the code again:

# Python
names = ['foo', 'bar', 'baz']
fns = [lambda: n for n in names]
print [f() for f in fns]

Here’s the output:

['baz', 'baz', 'baz']

Let’s try to figure out why. First factor out the list comprehensions:

names = ['foo', 'bar', 'baz']
fns = []
for n in names:
    fns.append(lambda: n)
result = []
for f in fns:
    result.append(f())
print result

Same result. Good. Now let’s factor out the lambda (anonymous function) and give it an explicit name.

names = ['foo', 'bar', 'baz']
fns = []
for n in names:
    def fn(): return n
    fns.append(fn)
result = []
for f in fns:
    result.append(f())
print result

Still the same result. The reason this program outputs what it does might be becoming clearer… Let’s think about what a for loop does again. “for N in LS” is translated into something like “if LS has anything, assign N to the first item in LS and run the loop body. if there is another item in LS, assign N to that item and run the body again. repeat.” That assignment to N happens in the current local scope.

Okay, getting closer… Inside a function created inside of a scope, variable lookups start in the current local scope; if the lookup fails, it continues to the next outer scope. And so on all the way up to the global scope. (If it’s not there, you get a NameError.) This lets you create inner functions that use variables from outside, such as globals. The important thing to note here is that these lookups only happen when you actually try to use a variable. So let’s see what’s happening here. The program builds a list of functions which return a variable n. Since n is not in the functions’ scopes, it is pulled from the outer scope (in this case, the globals). By the time the functions are called, however, n has changed value several times. Its most recent value is the one the functions return; in this case, ‘baz’. Thus, all of the functions return ‘baz’. For performance reasons, you wouldn’t want to take a snapshot of the values of every variable your functions access, but I think mentally that’s what we think is happening when we write the code.

Is there a way to take a snapshot of variables we know will change, though? Default argument initializers let us do just that. Replace the program with the following and it does exactly what we want.

names = ['foo', 'bar', 'baz']
fns = []
for n in names:
    def fn(n=n): return n
    fns.append(fn)
result = []
for f in fns:
    result.append(f())
print result

And translating back into the original form with lambdas and list comprehensions…

names = ['foo', 'bar', 'baz']
fns = [lambda n=n: n for n in names]
print [f() for f in fns]

Tada! Lesson: functions create scopes (or frames in Python terminology). loops don’t. Therefore, if you create functions inside of loops where variables will change value, use default arguments to capture their values at function creation time.

Kinesis keyboards worked better in my mind.

I tried a Kinesis keyboard, as I’ve always kind of wanted to buy one. I expected to take longer to get used to typing on it, but the motion is quite natural and I adjusted rapidly. But after an hour, the honeymoon is over… and for surprising reasons:

  • Control key is too far away.
  • Said control key doesn’t always work.
  • Windows key is the “context menu key” that nobody actually uses.
  • Said useless Windows key is easier to get to than Control and Alt.
  • Alt-tab, Control-C, and Control-V are ridiculously difficult to type.
  • No option to purchase with Dvorak key faces.
  • { and } are on opposite sides of the keyboard when using the Windows Dvorak keymap.

They still sell for a pretty penny on eBay though…