<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Chad Austin &#187; x86</title>
	<atom:link href="http://chadaustin.me/tag/x86/feed/" rel="self" type="application/rss+xml" />
	<link>http://chadaustin.me</link>
	<description></description>
	<lastBuildDate>Tue, 17 Aug 2010 08:51:43 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Reporting Crashes in IMVU: Who threw that C++ exception?</title>
		<link>http://chadaustin.me/2009/04/who-threw-that-exception/</link>
		<comments>http://chadaustin.me/2009/04/who-threw-that-exception/#comments</comments>
		<pubDate>Mon, 20 Apr 2009 03:32:38 +0000</pubDate>
		<dc:creator>Chad Austin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[c++]]></category>
		<category><![CDATA[crashes]]></category>
		<category><![CDATA[imvu]]></category>
		<category><![CDATA[x86]]></category>

		<guid isPermaLink="false">http://aegisknight.org/?p=1328</guid>
		<description><![CDATA[It&#8217;s not often that I get to write about recent work.  Most of the techniques in this series were implemented at IMVU years ago.  A few weeks ago, however, a common C++ exception (tr1::bad_weak_ptr) starting intermittently causing crashes in the wild.  This exception can be thrown in a variety of circumstances, so [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s not often that I get to write about recent work.  Most of the techniques in this series were implemented at IMVU years ago.  A few weeks ago, however, a common C++ exception (<code>tr1::bad_weak_ptr</code>) starting intermittently causing crashes in the wild.  This exception can be thrown in a variety of circumstances, so we had no clue which code was problematic.</p>

<p>We could have modified <code>tr1::bad_weak_ptr</code> so its constructor fetched a <code>CallStack</code> and returned it from <code>tr1::bad_weak_ptr::what()</code>, but fetching a <code>CallStack</code> is not terribly cheap, especially in such a frequently-thrown-and-caught exception.  Ideally, we&#8217;d only grab a stack after we&#8217;ve determined it&#8217;s a crash (in the top-level crash handler).</p>

<p>Allow me to illustrate:</p>

<pre>
void main_function(/*arguments*/) {
    try {
        try {
            // We don't want to grab the call stack here, because
            // we'll catch the exception soon.
            this_could_fail(/*arguments*/);
        }
        catch (const std::exception&amp; e) {
            // Yup, exception is fine.  Just swallow and
            // do something else.
            fallback_algorithm(/*arguments*/);
        }
    }
    catch (const std::exception&amp; e) {
        // Oh no! fallback_algorithm() failed.
        // Grab a stack trace now.
        report_crash(CallStack::here());
    }
}
</pre>

<p>Almost!  Unfortunately, the call stack generated in the catch clause doesn&#8217;t contain <code>fallback_algorithm</code>.  It starts with <code>main_function</code>, because the stack has already been unwound by the time the catch clause runs.</p>

<p>Remember the structure of the stack:</p>

<a href="http://aegisknight.org/wp-uploads/example_stack.png"><img src="http://aegisknight.org/wp-uploads/example_stack.png" alt="Example Stack" title="Example Stack" width="455" height="444" class="size-full wp-image-1337" /></a>

<p>We can use the <code>ebp</code> register, which points to the current stack frame, to walk and record the current call stack.  <code>[ebp+4]</code> is the caller&#8217;s address, <code>[[ebp]+4]</code> is the caller&#8217;s caller, <code>[[[ebp]]+4]</code> is the caller&#8217;s caller&#8217;s caller, and so on.</p>

<p>What can we do with this information?  Slava Oks at Microsoft <a href="http://blogs.msdn.com/slavao/archive/2005/01/30/363428.aspx">gives the clues we need</a>.  When you type <code>throw MyException()</code>, a temporary <code>MyException</code> object is constructed <em>at the bottom of the stack</em> and passed into the catch clause by reference or by value (as a copy deeper on the stack).</p>

<p>Before the catch clause runs, objects on the stack between the thrower and the catcher are destructed, and <code>ebp</code> is pointed at the catcher&#8217;s stack frame (so the catch clause can access parameters and local variables).</p>

<p>From within the outer catch block, here is the stack, <code>ebp</code>, and <code>esp</code>:</p>

<a href="http://aegisknight.org/wp-uploads/stack_in_catch.png"><img src="http://aegisknight.org/wp-uploads/stack_in_catch.png" alt="Stack From Catch Clause" title="Stack From Catch Clause" width="455" height="1168" class="size-full wp-image-1338" /></a>

<p>Notice that, every time an exception is <em>caught</em> the linked list of stack frames is truncated.  When an exception is caught, <code>ebp</code> is reset to the stack frame of the <em>catcher</em>, destroying our link to the thrower&#8217;s stack.</p>

<p>But there&#8217;s useful information between <code>ebp</code> and <code>esp</code>!  We just need to search for it.  We can find who threw the exception with this simple algorithm:</p>

<pre>
	For every possible pointer between ebp and esp,
	find the deepest pointer p,
	where p might be a frame pointer.
	(That is, where walking p eventually leads to ebp.)
</pre>

<p>Or you can just use <a href="http://imvu.svn.sourceforge.net/viewvc/imvu/imvu_open_source/CallStack/CallStack.cpp?view=markup#l_127">our implementation</a>.</p>

<p>With this in mind, let&#8217;s rewrite our example&#8217;s error handling:</p>

<pre>
void main_function(/*arguments*/) {
    try {
        try {
            this_could_fail(/*arguments*/);
        }
        catch (const std::exception&amp; e) {
            // that's okay, just swallow and
            // do something else.
            fallback_algorithm(/*arguments*/);
        }
    }
    catch (const std::exception&amp; e) {
        // oh no! fallback_algorithm() failed.
        // grab a stack trace - including thrower!<b>
        Context ctx;
        getCurrentContext(ctx);
        ctx.ebp = findDeepestFrame(ctx.ebp, ctx.esp);
        report_crash(CallStack(ctx));</b>
    }
}
</pre>

<p>Bingo, fallback_algorithm appears in the stack:</p>

<pre>
main_function
<b>fallback_algorithm</b>
__CxxThrowException@8
_KiUserExceptionDispatcher@8
ExecuteHandler@20
ExecuteHandler2@20
___CxxFrameHandler
___InternalCxxFrameHandler
___CxxExceptionFilter
___CxxExceptionFilter
?_is_exception_typeof@@YAHABVtype_info@@PAU_EXCEPTION_POINTERS@@@Z
?_CallCatchBlock2@@YAPAXPAUEHRegistrationNode@@PBU_s_FuncInfo@@PAXHK@Z
</pre>

<p>Now we&#8217;ll have no problems finding the source of C++ exceptions!</p>
]]></content:encoded>
			<wfw:commentRss>http://chadaustin.me/2009/04/who-threw-that-exception/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>You Won&#8217;t Learn This in School: Disabling Kernel Functions in Your Process</title>
		<link>http://chadaustin.me/2009/03/disabling-functions/</link>
		<comments>http://chadaustin.me/2009/03/disabling-functions/#comments</comments>
		<pubDate>Tue, 31 Mar 2009 06:56:21 +0000</pubDate>
		<dc:creator>Chad Austin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[c++]]></category>
		<category><![CDATA[crashes]]></category>
		<category><![CDATA[imvu]]></category>
		<category><![CDATA[x86]]></category>

		<guid isPermaLink="false">http://aegisknight.org/?p=1277</guid>
		<description><![CDATA[
Detecting and reporting unhandled exceptions with SetUnhandledExceptionFilter seemed logical, and, in fact, it worked&#8230;  for a while.  Eventually, we started to notice failures that should have been reported as a last-chance exception but weren&#8217;t.  After much investigation, we discovered that both Direct3D and Flash were installing their own unhandled exception filters!  [...]]]></description>
			<content:encoded><![CDATA[<p>
<a href="http://aegisknight.org/2009/03/crash-reporting-in-imvu-last-chance-exceptions/">Detecting and reporting unhandled exceptions with SetUnhandledExceptionFilter</a> seemed logical, and, in fact, it worked&#8230;  for a while.  Eventually, we started to notice failures that <em>should</em> have been reported as a last-chance exception but weren&#8217;t.  After much investigation, we discovered that <em>both</em> Direct3D <em>and</em> Flash were installing their own unhandled exception filters!  Worse, they were fighting over it, installing their handlers several times per second!  In practice, this meant our last-chance crash reports were rarely generated, convincing us our crash metrics were better than they were.  (Bad, bad libraries!)
</p>

<p>
It&#8217;s pretty ridiculous that we had to solve this problem, but, as Avery Lee says, <a href="http://virtualdub.org/blog/pivot/entry.php?id=245">&#8220;Just because it is not your fault does not mean it is not your problem.&#8221;</a>
</p>

<p>
The obvious solution is to join the fray, calling <code>SetUnhandledExceptionFilter</code> every frame, right?  How about we try something a bit more reliable&#8230;  I hate implementing solutions that have obvious flaws.  Thus, we chose to disable (with code modification) the <code>SetUnhandledExceptionFilter</code> function immediately after installing our own handler.  When Direct3D and Flash try to call it, their requests will be ignored, leaving our exception handler installed.
</p>

<p>
Code modification&#8230;  isn&#8217;t that scary?  With <a href="http://aegisknight.org/2009/02/a-brief-introduction-to-modern-x86-assembly-language/">a bit</a> of <a href="http://aegisknight.org/2009/02/reporting-crashes-in-imvu-c-call-stacks/">knowledge</a> and defensive programming, it&#8217;s not that bad.  In fact, I&#8217;ll show you the code up front:
</p>

<pre>
// If this doesn't make sense, skip the code and come back!

void lockUnhandledExceptionFilter() {
    HMODULE kernel32 = LoadLibraryA("kernel32.dll");
    Assert(kernel32);

    if (FARPROC gpaSetUnhandledExceptionFilter = GetProcAddress(kernel32, "SetUnhandledExceptionFilter")) {
        unsigned char expected_code[] = {
            0x8B, 0xFF, // mov edi,edi
            0x55,       // push ebp
            0x8B, 0xEC, // mov ebp,esp
        };

        // only replace code we expect
        if (memcmp(expected_code, gpaSetUnhandledExceptionFilter, sizeof(expected_code)) == 0) {
            unsigned char new_code[] = {
                0x33, 0xC0,       // xor eax,eax
                0xC2, 0x04, 0x00, // ret 4
            };

            BOOST_STATIC_ASSERT(sizeof(expected_code) == sizeof(new_code));

            DWORD old_protect;
            if (VirtualProtect(gpaSetUnhandledExceptionFilter, sizeof(new_code), PAGE_EXECUTE_READWRITE, &#038;old_protect)) {
                CopyMemory(gpaSetUnhandledExceptionFilter, new_code, sizeof(new_code));

                DWORD dummy;
                VirtualProtect(gpaSetUnhandledExceptionFilter, sizeof(new_code), old_protect, &amp;dummy);

                FlushInstructionCache(GetCurrentProcess(), gpaSetUnhandledExceptionFilter, sizeof(new_code));
            }
        }
    }
    FreeLibrary(kernel32);
}
</pre>

<p>
If that&#8217;s obvious to you, then great: We&#8217;re <a href="http://www.imvu.com/jobs">hiring</a>!
</p>

<p>Otherwise, here is an overview:</p>

<p>Use <code>GetProcAddress</code> to grab the real address of <code>SetUnhandledExceptionFilter</code>.  (If you just type <code>&amp;SetUnhandledExceptionFilter</code> you&#8217;ll get the relocatable import thunk, not the actual <code>SetUnhandledExceptionFilter</code> function.)</p>

<p>Most Windows functions begin with five bytes of prologue:</p>

<pre>
mov edi, edi ; 2 bytes for <a href="http://blogs.msdn.com/ishai/archive/2004/06/24/165143.aspx">hotpatching</a> support
push ebp     ; stack frame
mov ebp, esp ; stack frame (con't)
</pre>

<p>
We want to replace those five bytes with <code>return 0;</code>.  Remember that <code>__stdcall</code> functions return values in the <code>eax</code> register.  We want to replace the above code with:
</p>

<pre>
xor eax, eax ; eax = 0
ret 4        ; pops 4 bytes (arg) and returns
</pre>

<p>
Also five bytes!  How convenient!  Before we replace the prologue, we verify that the first five bytes match our expectations.  (If not, we can&#8217;t feel comfortable about the effects of the code replacement.)  The <a href="http://msdn.microsoft.com/en-us/library/aa366898(VS.85).aspx">VirtualProtect</a> and <a href="http://msdn.microsoft.com/en-us/library/ms679350(VS.85).aspx">FlushInstructionCache</a> calls are standard fare for code modification.
</p>

<p>
After implementing this, it&#8217;s worth stepping through the assembly in a debugger to verify that <code>SetUnhandledExceptionFilter</code> no longer has any effect.  (If you really enjoy writing unit tests, it&#8217;s definitely possible to unit test the desired behavior.  I&#8217;ll leave that as an exercise for the reader.)
</p>

<p>
Finally, our last-chance exception reporting actually works!
</p>

]]></content:encoded>
			<wfw:commentRss>http://chadaustin.me/2009/03/disabling-functions/feed/</wfw:commentRss>
		<slash:comments>25</slash:comments>
		</item>
		<item>
		<title>Reporting Crashes in IMVU: C++ Call Stacks</title>
		<link>http://chadaustin.me/2009/02/reporting-crashes-in-imvu-c-call-stacks/</link>
		<comments>http://chadaustin.me/2009/02/reporting-crashes-in-imvu-c-call-stacks/#comments</comments>
		<pubDate>Fri, 27 Feb 2009 07:35:00 +0000</pubDate>
		<dc:creator>Chad Austin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[c++]]></category>
		<category><![CDATA[crashes]]></category>
		<category><![CDATA[imvu]]></category>
		<category><![CDATA[x86]]></category>

		<guid isPermaLink="false">http://aegisknight.org/new/2009/02/27/reporting-crashes-in-imvu-c-call-stacks/</guid>
		<description><![CDATA[
Last time, we talked about including contextual information to help us
actually fix crashes that happen in the field.  Minidumps are a great
way to easily save a snapshot of the most important parts of a running
(or crashed) process, but it&#8217;s often useful to understand the
low-level mechanics of a C++ call stack (on x86).  Given [...]]]></description>
			<content:encoded><![CDATA[<p>
Last time, we talked about including contextual information to help us
actually fix crashes that happen in the field.  Minidumps are a great
way to easily save a snapshot of the most important parts of a running
(or crashed) process, but it&#8217;s often useful to understand the
low-level mechanics of a C++ call stack (on x86).  Given some basic
principles about function calls, we will derive the implementation
of code to walk a call stack.
</p>

<p>
C++ function call stack entries are stored on the x86 stack, which
grows downward in memory.  That is, pushing on the stack subtracts
from the stack pointer.  The <code>ESP</code> register points to the
most-recently-written item on the stack; thus, <code>push eax</code>
is equivalent to:
</p>

<pre>
sub esp, 4
mov [esp], eax
</pre>

<p>
Let&#8217;s say we&#8217;re calling a function:
</p>

<pre>
int __stdcall foo(int x, int y)
</pre>

<p>
The <code>__stdcall</code>
calling convention pushes arguments onto the stack from right to left
and returns the result in the <code>EAX</code> register, so calling
<code>foo(1, 2)</code> generates this code:
</p>

<pre>
push 2
push 1
call foo
; result in eax
</pre>

<p>
If you aren&#8217;t familiar with assembly, I know this is a lot to absorb,
but bear with me; we&#8217;re almost there.  We haven&#8217;t seen the
<code>call</code> instruction before.  It pushes the <code>EIP</code>
register, which is the return address from the called function onto
the stack and then jumps to the target function.
If we didn&#8217;t store the instruction pointer, the called function would
not know where to return when it was done.
</p>

<p>
The final piece of information we need to construct a C++ call stack is
that functions live in memory, functions have names, and thus sections
of memory have names.  If we can get access to a mapping of memory
addresses to function names (say, with the <a
href="http://msdn.microsoft.com/en-us/library/k7xkk3e2(VS.80).aspx">/MAP
linker option</a>), and we can read instruction pointers up the call
stack, we can generate a symbolic stack trace.
</p>

<p>
How do we read the instruction pointers up the call stack?
Unfortunately, just knowing the return address from the current
function is not enough.  How do you know the location of the caller&#8217;s
caller?  Without extra information, you don&#8217;t.  Fortunately, most
functions have that information in the form of a function prologue:
</p>

<pre>
push ebp
mov ebp, esp
</pre>

<p>
and epilogue:
</p>

<pre>
mov esp, ebp
pop ebp
</pre>

<p>
These bits of code appear at the beginning and end of every function, allowing you
to use the <code>EBP</code> register as the &#8220;current stack frame&#8221;.
Function arguments are always accessed at positive offsets from EBP,
and locals at negative offsets:
</p>

<pre>
; int foo(int x, int y)
; ...
[EBP+12] = y argument
[EBP+8]  = x argument
[EBP+4]  = return address (set by call instruction)
[EBP]    = previous stack frame
[EBP-4]  = local variable 1
[EBP-8]  = local variable 2
; ...
</pre>

<p>
Look!  For any stack frame <code>EBP</code>, the caller&#8217;s address is
at <code>[EBP+4]</code> and the previous stack frame is at <code>[EBP]</code>.
By dereferencing <code>EBP</code>, we can walk
the call stack, all the way to the top!
</p>

<pre>
struct stack_frame {
    stack_frame*  previous;
    unsigned long return_address;
};

std::vector&lt;unsigned long&gt; get_call_stack() {
    std::vector&lt;unsigned long&gt; call_stack;

    stack_frame* current_frame;
    __asm mov current_frame, ebp

    while (!IsBadReadPtr(current_frame, sizeof(stack_frame))) {
        call_stack.push_back(current_frame->return_address);
        current_frame = current_frame->previous;
    }
    return call_stack;
}

// Convert the array of addresses to names with the aforementioned MAP file.
</pre>

<p>
Yay, now we know how to grab a stack trace from any location in the
code.  This implementation is not robust, but the concepts are
correct: functions have names, functions live in memory, and we can
determine which memory addresses are on the call stack.  Now that you
know how to manually grab a call stack, let Microsoft do the heavy
lifting with the <a
href="http://msdn.microsoft.com/en-us/library/ms680650(VS.85).aspx">StackWalk64</a>
function.
</p>

<p>
Next time, we&#8217;ll talk about setting up your very own Microsoft Symbol Server so you can
grab accurate function names from every version of your software.
</p>]]></content:encoded>
			<wfw:commentRss>http://chadaustin.me/2009/02/reporting-crashes-in-imvu-c-call-stacks/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Reporting Crashes in IMVU: Call Stacks and Minidumps</title>
		<link>http://chadaustin.me/2009/02/reporting-crashes-in-imvu-call-stacks-and-minidumps/</link>
		<comments>http://chadaustin.me/2009/02/reporting-crashes-in-imvu-call-stacks-and-minidumps/#comments</comments>
		<pubDate>Thu, 26 Feb 2009 07:29:00 +0000</pubDate>
		<dc:creator>Chad Austin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[c++]]></category>
		<category><![CDATA[crashes]]></category>
		<category><![CDATA[imvu]]></category>
		<category><![CDATA[seh]]></category>
		<category><![CDATA[x86]]></category>

		<guid isPermaLink="false">http://aegisknight.org/new/2009/02/26/reporting-crashes-in-imvu-call-stacks-and-minidumps/</guid>
		<description><![CDATA[
So far, we&#8217;ve implemented reporting for Python exceptions that bubble
out of the main loop, C++ exceptions that bubble into Python (and then
out of the main loop), and structured exceptions that bubble into
Python (and then out of the main loop.)  This is a fairly
comprehensive set of failure conditions, but there&#8217;s still a big piece
missing from [...]]]></description>
			<content:encoded><![CDATA[<p>
So far, we&#8217;ve implemented reporting for <a href="http://aegisknight.livejournal.com/136847.html">Python exceptions that bubble
out of the main loop</a>, <a href="http://aegisknight.livejournal.com/142469.html">C++ exceptions that bubble into Python</a> (and then
out of the main loop), and <a href="http://aegisknight.livejournal.com/142613.html">structured exceptions that bubble into
Python</a> (and then out of the main loop.)  This is a fairly
comprehensive set of failure conditions, but there&#8217;s still a big piece
missing from our reporting.
</p>

<p>
Imagine that you implement this error reporting and have customers try
the new version of your software.  You&#8217;ll soon have a collection of
crash reports, and one thing will stand out clearly.  Without the
context in which crashes happened (call stacks, variable values,
perhaps log files), it&#8217;s very hard to determine their cause(s).  And
without determining their cause(s), it&#8217;s very hard to fix them.
</p>

<p>
Reporting log files are easy enough.  Just attach them to the error
report.  You may need to deal with privacy concerns or limit the size
of the log files that get uploaded, but those are straightforward
problems.
</p>

<p>
Because Python has <a href="http://www.python.org/doc/current/library/">batteries
included</a>, grabbing the call stack from a Python exception is
trivial.  Just take a quick look at the <a href="http://www.python.org/doc/current/library/traceback.html">traceback
module</a>.
</p>

<p>
Structured exceptions are a little harder.  The structure of a call
stack on x86 is machine- and sometimes compiler-dependent.
Fortunately, Microsoft provides an API to dump the relevant process
state to a file such that it can be opened in <a href="http://www.microsoft.com/visualstudio/en-us/default.mspx">Visual
Studio</a> or <a href="http://www.microsoft.com/whdc/devtools/debugging/default.mspx">WinDbg</a>,
which will let you view the stack trace and select other data.  These
files are called minidumps, and they&#8217;re pretty small.  Just call <a href="http://msdn.microsoft.com/en-us/library/ms680360(VS.85).aspx">MiniDumpWriteDump</a>
with the context of the exception and submit the generated file with your crash
report.
</p>

<p>
Grabbing a call stack from C++ exceptions is even harder, and maybe
not desired.  If you regularly use C++ exceptions for communicating
errors from C++ to Python, it&#8217;s probably too expensive to grab a call
stack or write a minidump every single time.  However, if you want to
do it anyway, here&#8217;s one way.
</p>

<p>
C++ exceptions are implemented on top of the Windows kernel&#8217;s
structured exception machinery.  Using the <code>try</code> and
<code>catch</code> statements in your C++ code causes the compiler to
generate SEH code behind the scenes.  However, by the time your C++
<code>catch</code> clauses run, the stack has already been unwound.  <a href="http://www.microsoft.com/msj/0197/Exception/Exception.aspx">Remember</a>
that SEH has three passes: first it runs filter expressions until it
finds one that can handle the exception; then it unwinds the stack
(destroying any objects allocated on the stack); finally it runs the
actual exception handler.  Your C++ exception handler runs as the last stage,
which means the stack has already been unwound, which means you can&#8217;t
get an accurate call stack from the exception handler.  However, we
can use SEH to grab a call stack at the point where the exception was
thrown, before we handle it&#8230;
</p>

<p>
First, let&#8217;s determine the SEH exception code of C++ exceptions
(WARNING, this code is compiler-dependent):
</p>

<pre>
int main() {
    DWORD code;
    __try {
        throw std::exception();
    }
    __except (code = GetExceptionCode(), EXCEPTION_EXECUTE_HANDLER) {
        printf("%X\n", code);
    }
}
</pre>

<p>
Once we have that, we can write our exception-catching function like
this:
</p>

<pre>
void throw_cpp_exception() {
    throw std::runtime_error("hi");
}

bool writeMiniDump(const EXCEPTION_POINTERS* ep) {
    // ...
    return true;
}

void catch_seh_exception() {
    __try {
        throw_cpp_exception();
    }
    __except (
        (CPP_EXCEPTION_CODE == GetExceptionCode()) &amp;&amp; writeMiniDump(GetExceptionInformation()),
        EXCEPTION_CONTINUE_SEARCH
    ) {
    }
}

int main() {
    try {
        catch_seh_exception();
    }
    catch (const std::exception&amp; e) {
        printf("%s\n", e.what());
    }
}
</pre>

<p>
Now we&#8217;ve got call stacks and program state for C++, SEH, and Python
exceptions, which makes fixing reported crashes dramatically easier.
</p>

<p>
Next time I&#8217;ll go into more detail about how C++ stack traces work,
and we&#8217;ll see if we can grab them more efficiently.
</p>]]></content:encoded>
			<wfw:commentRss>http://chadaustin.me/2009/02/reporting-crashes-in-imvu-call-stacks-and-minidumps/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>A brief introduction to modern x86 assembly language</title>
		<link>http://chadaustin.me/2009/02/a-brief-introduction-to-modern-x86-assembly-language/</link>
		<comments>http://chadaustin.me/2009/02/a-brief-introduction-to-modern-x86-assembly-language/#comments</comments>
		<pubDate>Sat, 21 Feb 2009 08:34:00 +0000</pubDate>
		<dc:creator>Chad Austin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[x86]]></category>

		<guid isPermaLink="false">http://aegisknight.org/new/2009/02/21/a-brief-introduction-to-modern-x86-assembly-language/</guid>
		<description><![CDATA[Several people have personally requested that I give a brief
introduction to modern x86 (sometimes called IA32) assembly language.
For simplicity&#8217;s sake, I&#8217;ll stick with the 32-bit version with a flat
memory model.  AMD64
(sometimes called x64) just isn&#8217;t as popular as x86 yet, so this seems safe.

For some reason, there&#8217;s a mythos around assembly language.  People [...]]]></description>
			<content:encoded><![CDATA[<p>Several people have personally requested that I give a brief
introduction to modern x86 (sometimes called IA32) assembly language.
For simplicity&#8217;s sake, I&#8217;ll stick with the 32-bit version with a flat
memory model.  <a
href="http://unity3d.com/webplayer/hwstats/pages/web-2009Q1-os.html">AMD64
(sometimes called x64) just isn&#8217;t as popular as x86 yet</a>, so this seems safe.</p>

<p>For some reason, there&#8217;s a mythos around assembly language.  People associate it with bearded gurus, assuming only ninjas can program in it, when, in principle, assembly language
is one of the simplest programming languages there is.  Any complexity
stems from a particular architecture&#8217;s oddities, and even though x86 is one of the
oddest of them all, I&#8217;ll show you that it can be easy to read and write.</p>

<p>
First, I&#8217;ll describe the basic architecture.  When programming in assembly,
there are three main concepts:
</p>

<p>
<strong>Instructions</strong> are the individual commands that tell the
computer to perform an operation.  These include instructions for
adding, multiplying, comparing, copying, performing bit-wise operations,
accessing memory, and communicating with external devices.  The
computer executes instructions sequentially.
</p>

<p>
<strong>Registers</strong> are where temporary values go.  There is a
small, fixed set of registers available for use.  Since there aren&#8217;t many registers, nothing stays in
them for very long, as they ar  soon needed for other purposes.
</p>

<p>
<strong>Memory</strong> is where longer-lived data goes.  It&#8217;s a
giant, flat array of bytes (8-bit quantities).  It&#8217;s much slower to
access than registers, but there&#8217;s a lot of it.
</p>

<p>
Before I get into some examples, let me describe the registers
available on x86.  There are only 8 general-purpose registers, each of
which is 32 bits wide.  They are:
</p>

<ul>
<li><code>EAX</code></li>
<li><code>EBX</code></li>
<li><code>ECX</code></li>
<li><code>EDX</code></li>
<li><code>ESI</code></li>
<li><code>EDI</code></li>
<li><code>EBP</code> &#8211; used when accessing local variables or function arguments</li>
<li><code>ESP</code> &#8211; used when calling functions</li>
</ul>

<p>
On x86, most instructions have two operands, a destination and a
source.  For example, let&#8217;s add two and three:
</p>

<pre>
mov eax, 2   ; eax = 2
mov ebx, 3   ; ebx = 3
add eax, ebx ; eax = 2 + 3 = 5
</pre>

<p>
<code>add eax, ebx</code> adds the values in registers eax and ebx, and stores
the result back in eax.  (BTW, this is one of the oddities of x86.
Other modern architectures differentiate between destination and
source operands, which would look like <code>add eax, ebx, ecx</code>
meaning <code>eax = ebx + ecx</code>.  On x86, the first operand is read and written in the same instruction.)
</p>

<p>
<code>mov</code> is the data movement instruction.  It copies values
from one register to another, or from a constant to a register, or
from memory to a register, or from a register to memory.
</p>

<p>
Speaking of memory, let&#8217;s say we want to add 2 and 3, storing the
result at address 32.  Since the result of the addition is 32 bits, the result will
actually use addresses 32, 33, 34, and 35.  Remember, memory is
indexed in bytes.
</p>

<pre>
mov eax, 2
mov ebx, 3
add eax, ebx
mov edi, 32
mov [edi], eax ; copies 5 to address 32 in memory
</pre>

<p>
What about loading data from memory?  (Reads from memory are called
loads.  Writes are called stores.)  Let&#8217;s write a program that copies
1000 4-byte quantities (4000 bytes) from address 10000 to address
20000.
</p>

<pre>
mov esi, 10000 ; by convention, esi is often used as the 'source' pointer
mov edi, 20000 ; similarly, edi often means 'destination' pointer
mov ecx, 1000 ; let's copy 1000 32-bit items

begin_loop:
mov eax, [esi] ; load from source
mov [edi], eax ; store to destination
add esi, 4
add edi, 4

sub ecx, 1 ; ecx -= 1
cmp ecx, 0 ; is ecx 0?

; if ecx does not equal 0, jump to the beginning of the loop
jne begin_loop
; otherwise, we're done
</pre>

<p>
This is how the C <code>memcpy</code> function works.  Not so bad, is
it?  For reference, this is what our x86 code would look like in C:
</p>

<pre>
int* src = (int*)10000;
int* dest = (int*)20000;
int count = 1000;
while (count--) {
    *dest++ = *src++;
}
</pre>

<p>
From here, all it takes is a good <a
href="http://www.intel.com/products/processor/manuals/">instruction
reference</a>, some memorization, and a bit of practice.  x86 is full
of arcane details (it&#8217;s 30 years old!), but once you&#8217;ve got the basic
concepts down, you can mostly ignore them.  I hope I&#8217;ve shown you that writing x86
is easy.  Perhaps more importantly, I hope you won&#8217;t be intimidated the next time Visual Studio
shows you the assembly for your program.  Understanding how the machine is executing your code
can be invaluable when debugging.
</p>]]></content:encoded>
			<wfw:commentRss>http://chadaustin.me/2009/02/a-brief-introduction-to-modern-x86-assembly-language/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Latency vs. Throughput</title>
		<link>http://chadaustin.me/2009/02/latency-vs-throughput/</link>
		<comments>http://chadaustin.me/2009/02/latency-vs-throughput/#comments</comments>
		<pubDate>Sat, 14 Feb 2009 05:57:00 +0000</pubDate>
		<dc:creator>Chad Austin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[x86]]></category>

		<guid isPermaLink="false">http://aegisknight.org/new/2009/02/13/latency-vs-throughput/</guid>
		<description><![CDATA[
This is my last post about processors and performance, I swear!  Plus,
my wrists are starting to hurt from this bloodpact thing (as I&#8217;m
diagnosed with RSI), so I think this will be a light one.



As I&#8217;ve discussed previously,
modern desktop processors work really hard to exploit the inherent
parallelism in your programs.  This is called instruction-level
parallelism, [...]]]></description>
			<content:encoded><![CDATA[<p>
This is my last post about processors and performance, I swear!  Plus,
my wrists are starting to hurt from this <a href="http://www.egometry.com/bloodpact/">bloodpact thing</a> (as I&#8217;m
diagnosed with RSI), so I think this will be a light one.
</p>

<p>
As I&#8217;ve discussed <a
href="http://aegisknight.livejournal.com/139879.html">previously</a>,
modern desktop processors work really hard to exploit the inherent
parallelism in your programs.  This is called <a
href="http://en.wikipedia.org/wiki/Instruction-level_parallelism">instruction-level
parallelism</a>, and is one of the techniques processors use to get
more performance out of slower clock rates (along with data-level
parallelism (SIMD) or multiple cores (MIMD)<a href="#footnotesimdvsmimd">*</a>).  Previously, I waved my
hands a bit and said &#8220;The processor makes independent operations run
in parallel.&#8221;  Now I&#8217;m going to teach you how to count cycles in the presence of latency and parallelism.
</p>

<p>
Traditionally, when analyzing the cost of an algorithm, you would
simply count the operations involved, sum their costs in cycles, and
call it a day.  These days, it&#8217;s not that easy.  Instructions have two
costs: dependency chain latency and reciprocal throughput.
</p>

<p>
Reciprocal throughput is simply the reciprocal of the maximum
throughput of a particular instruction.  Throughput is measured in
instructions/cycle, so reciprocal throughput is cycles/instruction.
</p>

<p>
OK, that sounds like the way we&#8217;ve always measured performance.  So
what&#8217;s dependency chain latency?  When the results of a previous
calculation are needed for another calculation, you have a dependency
chain.  In a dependency chain, you measure the cost of an instruction
by its latency, not its reciprocal throughput.  Remember that our
processors are working really hard to exploit parallelism in our code.
When there is no instruction-level parallelism available, we get
penalized.
</p>

<p>
Let&#8217;s go back to our sum 10000 numbers example, but unroll it a bit:
</p>

<pre>
float array[10000];
float sum = 0.0f;
for (int i = 0; i &lt; 10000; i += 8) {
    sum += array[i+0];
    sum += array[i+1];
    sum += array[i+2];
    sum += array[i+3];
    sum += array[i+4];
    sum += array[i+5];
    sum += array[i+6];
    sum += array[i+7];
}
return sum;
</pre>

In x86:

<pre>
xor ecx, ecx     ; ecx  = i   = 0
mov esi, array
xorps xmm0, xmm0 ; xmm0 = sum = 0.0

begin:
addss xmm0, [esi+0]
addss xmm0, [esi+4]
addss xmm0, [esi+8]
addss xmm0, [esi+12]
addss xmm0, [esi+16]
addss xmm0, [esi+20]
addss xmm0, [esi+24]
addss xmm0, [esi+28]

add esi, 32
add ecx, 1
cmp ecx, 10000
jl begin ; if ecx &lt; 10000, goto begin

; xmm0 = total sum
</pre>

<p>
Since each addition to <code>sum</code> in the loop depends on the previous
addition, these instructions are a dependency chain.  On a modern
processor, let&#8217;s say the reciprocal throughput of <code>addss</code> is 1 cycle.
However, the minimum latency is 4 cycles.  Since every instruction
depends on the previous, each addition costs 4 cycles.
</p>

<p>
As before, let&#8217;s try summing with four temporary sums:
</p>

<pre>
xor ecx, ecx     ; ecx  = i    = 0
mov esi, array
xorps xmm0, xmm0 ; xmm0 = sum1 = 0.0
xorps xmm1, xmm1 ; xmm1 = sum2 = 0.0
xorps xmm2, xmm2 ; xmm2 = sum3 = 0.0
xorps xmm3, xmm3 ; xmm3 = sum4 = 0.0

; top = sum0

begin:
addss xmm0, [esi+0]
addss xmm1, [esi+4]
addss xmm2, [esi+8]
addss xmm3, [esi+12]
addss xmm0, [esi+16]
addss xmm1, [esi+20]
addss xmm2, [esi+24]
addss xmm3, [esi+28]

add esi, 32
add ecx, 1
cmp ecx, 10000
jl begin ; if ecx &lt; 10000, goto begin

; accumulate sums
addss xmm0, xmm1
addss xmm2, xmm3 ; this instruction happens in parallel with the one above
addss xmm0, xmm2
</pre>

<p>
Here, the additions in the loop that depend on each other are 4 cycles apart,
meaning the minimum latency is no longer a problem.  This lets us hit
the maximum addition rate of one per cycle.
</p>

<p>
Removing dependency chains is a critical part of optimizing on today&#8217;s
processors.  The Core 2 processor has <em>six</em> execution units,
three of which are fully 128-bit SIMD ALUs.  If you can restructure
your algorithm so calculations happen independently, you can take
advantage of all of them.  (And if you can pull off making full use of
the Core 2&#8217;s ALU capacity, you win.)
</p>

<p>
<a name="footnotesimdvsmimd">*</a> BTW, it&#8217;s sort of unrelated, but I couldn&#8217;t help but link this article.
Greg Pfister has an interesting comparison and history of SIMD
vs. MIMD <a
href="http://perilsofparallel.blogspot.com/2008/09/larrabee-vs-nvidia-mimd-vs-simd.html">here</a>.  Ignore the terminology blathering and focus on the history of and influences on SIMD and MIMD over time.
</p>]]></content:encoded>
			<wfw:commentRss>http://chadaustin.me/2009/02/latency-vs-throughput/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A Simple Introduction to Superscalar, Out-of-Order Processors</title>
		<link>http://chadaustin.me/2009/02/a-simple-introduction-to-superscalar-out-of-order-processors/</link>
		<comments>http://chadaustin.me/2009/02/a-simple-introduction-to-superscalar-out-of-order-processors/#comments</comments>
		<pubDate>Thu, 12 Feb 2009 06:09:00 +0000</pubDate>
		<dc:creator>Chad Austin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[x86]]></category>

		<guid isPermaLink="false">http://aegisknight.org/new/2009/02/11/a-simple-introduction-to-superscalar-out-of-order-processors/</guid>
		<description><![CDATA[
Since the Pentium Pro/Pentium 2, we have all been using heavily superscalar, out-of-order processors.  I&#8217;d heard these terms a million times, but didn&#8217;t know what they meant until I read The Pentium Chronicles: The People, Passion, and Politics Behind Intel&#8217;s Landmark Chips (Practitioners).  (BTW, if you love processors, the history of technology, and [...]]]></description>
			<content:encoded><![CDATA[<p>
Since the Pentium Pro/Pentium 2, we have all been using heavily superscalar, out-of-order processors.  I&#8217;d heard these terms a million times, but didn&#8217;t know what they meant until I read <a href="http://www.amazon.com/gp/product/0471736171?ie=UTF8&#038;tag=aegisknightor-20&#038;linkCode=as2&#038;camp=1789&#038;creative=390957&#038;creativeASIN=0471736171">The Pentium Chronicles: The People, Passion, and Politics Behind Intel&#8217;s Landmark Chips (Practitioners)</a><img src="http://www.assoc-amazon.com/e/ir?t=aegisknightor-20&#038;l=as2&#038;o=1&#038;a=0471736171" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" />.  (BTW, if you love processors, the history of technology, and the fascinating dynamics at a company like Intel, that book is fantastic.)
</p>

<p>
Superscalar basically means &#8220;greater than 1&#8243;, implying that a superscalar processor can run code faster than its clock speed would suggest.  Indeed, a 3 GHz Pentium 4 can retire 4 independent integer additions per clock cycle, which is 12 billion integer additions per second!
</p>

<p>
Out-of-order means just that &#8211; the processor looks at your code at runtime and executes it in parallel if it can.  For example, imagine this code:
</p>

<pre>
// three independent, non-null pointers
int* p; int* q; int* r;
const int flag1, flag2, flag3;

if (*p &amp; flag1) {
    if (*q &amp; flag2) {
        if (*r &amp; flag3) {
            do_work();
        }
    }
}
</pre>

<p>
The processor can&#8217;t assume that <code>q</code> is a valid pointer until it checks <code>p</code>, and the same for <code>r</code> and <code>q</code>.  Accessing main memory costs ~200 cycles, so if none of the pointers point to cached memory, you just spent 600 cycles determining whether to <code>do_work()</code>.  This is called a &#8220;dependency chain&#8221;, where the result of a later calculation depends on the previous.  But what if you know that p, q, and r will all be valid pointers?  You can rewrite as:
</p>

<pre>
const int x = *p;
const int y = *q;
const int z = *r;
if ((x &amp; flag1) &amp;&amp; (y &amp; flag2) &amp;&amp; (z &amp; flag3)) {
    do_work();
}
</pre>

<p>
Now, the processor knows that all of those memory fetches are independent, so it runs them in parallel.  Then, it runs the <code>AND</code>s in parallel too, since they&#8217;re independent.  Your 600-cycle check just became 200 cycles.
</p>

<p>
Similarly, let&#8217;s say you want to add 10,000 numbers.
</p>

<pre>
int sum = 0;
for (int i = 0; i &lt; 10000; ++i) {
    sum += array[i];
}
return sum;
</pre>

<p>
Let&#8217;s assume the loop overhead and memory access is free, and each addition takes one cycle.  Since each addition depends on the previous value of sum, they must be executed serially, taking 10000 cycles.  However, you know that addition is associative, you can sum with two variables:
</p>

<pre>
int sum1 = 0;
int sum2 = 0;
for (int i = 0; i &lt; 10000; i += 2) {
    sum1 += array[i];
    sum2 += array[i+1];
}
return sum1 + sum2;
</pre>

<p>
Now you have two independent additions, which can be executed in parallel!  The loop takes 5000 cycles now.  If you independently sum in <code>sum1</code>, <code>sum2</code>, <code>sum3</code>, and <code>sum4</code>, the loop will take 2500 cycles.  And so on, until you&#8217;ve hit the IPC (instructions per cycle) limit on your processor.  If you&#8217;re making effective use of your SIMD units, you&#8217;d be surprised at how much work you can do in parallel&#8230;
</p>

<p>
And that&#8217;s what an out-of-order, superscalar processor can do for you!
</p>]]></content:encoded>
			<wfw:commentRss>http://chadaustin.me/2009/02/a-simple-introduction-to-superscalar-out-of-order-processors/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>The Real Benefit of Inlining Functions (or: Floating Point Calling Conventions)</title>
		<link>http://chadaustin.me/2009/02/the-real-benefit-of-inlining-functions-or-floating-point-calling-conventions/</link>
		<comments>http://chadaustin.me/2009/02/the-real-benefit-of-inlining-functions-or-floating-point-calling-conventions/#comments</comments>
		<pubDate>Mon, 09 Feb 2009 03:24:00 +0000</pubDate>
		<dc:creator>Chad Austin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[c++]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[x86]]></category>

		<guid isPermaLink="false">http://aegisknight.org/new/2009/02/08/the-real-benefit-of-inlining-functions-or-floating-point-calling-conventions/</guid>
		<description><![CDATA[
My mental model for the performance benefit of inlining a function call was:



code size increases
the overhead of the call, including argument and return value marshalling, is eliminated
the compiler knows more information, so it can generate better code



I had dramatically underestimated the value of #3, so this entry is an attempt to give a concrete example [...]]]></description>
			<content:encoded><![CDATA[<p>
My mental model for the performance benefit of <a href="http://en.wikipedia.org/wiki/Inline_expansion">inlining a function call</a> was:
</p>

<ol>
<li>code size increases</li>
<li>the overhead of the call, including argument and return value marshalling, is eliminated</li>
<li>the compiler knows more information, so it can generate better code</li>
</ol>

<p>
I had dramatically underestimated the value of #3, so this entry is an attempt to give a concrete example of how inlining can help.
</p>

<p>
As alluded in my <a href="http://aegisknight.livejournal.com/138886.html">previous entry</a>, you can&#8217;t just leave the floating point state willy nilly across function calls.  Every function should be able to make full use of the floating point register stack, which doesn&#8217;t work if somebody has left stale values on it.  In general, these rules are called <a href="http://en.wikipedia.org/wiki/X86_calling_conventions">calling conventions</a>.  Agner Fog has <a href="http://www.agner.org/optimize/calling_conventions.pdf">excellent coverage</a> of the topic, as usual.
</p>

<p>
Anyway, back to inlining.  The specifics aren&#8217;t that important, but we had a really simple function in the IMVU client which continued to show up in the profiles.  It looked something like this:
</p>

<pre>
std::vector&lt;float&gt; array;

float function() {
    float sum = 0.0f;
    for (size_t i = 0; i &lt; array.size(); ++i) {
        sum += array[i];
    }
    return sum;
}
</pre>

<p>
This function never operated on very large lists, and it also wasn&#8217;t called very often, so why was it consistently in the profiles?  A peek at the assembly showed (again, something like):
</p>

<pre>
fldz
fstp dword ptr [sum] ; sum = 0.0

xor ecx, ecx ; i = 0
jmp cmp

loop:

push ecx
call array.operator[]

fadd [sum] ; return value of operator[] in ST(0)
fstp [sum] ; why the load and the store??

add ecx, 1

cmp:

call array.size()
cmp ecx, eax
jb loop ; continue if i < return value

fld [sum] ; return value
</pre>

<p>
First of all, why all of the function calls?  Shouldn't std::vector be inlined?  But more importantly, why does the compiler keep spilling sum out to the stack?  Surely it could keep the sum in a floating point register for the entire calculation.
</p>

<p>
This is when I realized: due to the calling convention requirements on function calls, the floating point stack must be empty upon entry into the function.  The stack is in L1 cache, but still, that's three cycles per access, plus a bunch of pointless load and store uops.
</p>

<p>
Now, I actually know why std::vector isn't inlined.  For faster bug detection, we compile and ship with bounds checking enabled on STL containers and iterators.  But in this particular situation, the bounds checking isn't helpful, since we're iterating over the entire container.  I rewrote the function as:
</p>

<pre>
std::vector&lt;float&gt; array;

float function() {
    const float* p = &amp;array[0];
    size_t count = array.size();
    float sum = 0.0f;
    while (count--) {
        sum += *p++;
    }
    return sum;
}
</pre>

<p>
And the compiler generated the much more reasonable:
</p>

<pre>
call array.size()
mov ecx, eax ; ecx = count

push 0
call array.operator[]
mov esi, eax ; esi = p

fldz ; ST(0) = sum

jmp cmp
loop:

fadd [esi] ; sum += *p

add esi, 4 ; p++
sub ecx, 1 ; count--

cmp:
cmp ecx, 0
jne loop

; return ST(0)
</pre>

<p>
This is the real benefit of inlining.  Modern compilers are awesome at making nearly-optimal use of the CPU, but only when they have enough information.  Inlining functions gives them that information.
</p>

<p>
p.s. I apologize if my pseudo-assembly had mistakes.  I wrote it from memory.
</p>]]></content:encoded>
			<wfw:commentRss>http://chadaustin.me/2009/02/the-real-benefit-of-inlining-functions-or-floating-point-calling-conventions/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>#IND and #QNaN with /fp:fast</title>
		<link>http://chadaustin.me/2009/02/ind-and-qnan-with-fpfast/</link>
		<comments>http://chadaustin.me/2009/02/ind-and-qnan-with-fpfast/#comments</comments>
		<pubDate>Sun, 08 Feb 2009 01:16:00 +0000</pubDate>
		<dc:creator>Chad Austin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[c++]]></category>
		<category><![CDATA[imvu]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[x86]]></category>

		<guid isPermaLink="false">http://aegisknight.org/new/2009/02/07/ind-and-qnan-with-fpfast/</guid>
		<description><![CDATA[The other day Timothy and I were optimizing some floating-point-intensive lighting code.  Looking at the generated code, I realized we weren&#8217;t compiling with /fp:fast.  Due to the wonky state of floating point on 32-bit x86, Visual C++ frequently stores temporary results of floating point calculations to the stack and then reloads them, for [...]]]></description>
			<content:encoded><![CDATA[<p>The other day Timothy and I were optimizing some floating-point-intensive lighting code.  Looking at the generated code, I realized we weren&#8217;t compiling with <a href="http://msdn.microsoft.com/en-us/library/e7s85ffb(VS.80).aspx">/fp:fast</a>.  Due to the wonky state of floating point on 32-bit x86, Visual C++ frequently stores temporary results of floating point calculations to the stack and then reloads them, for the sake of consistent results.</p>

See, the problem is that the floating point registers on x86 are 80 bits wide, so if you compile &#8220;<code>float x, y, z, w; w = (x + y) * z</code>&#8221; as&#8230;

<pre>
fld [x]  ; ST0 = x
fadd [y] ; ST0 = x + y
fmul [z] ; ST0 = (x + y) * z
fstp [w] ; w = (x + y) * z
</pre>

<p>
&#8230; the temporary results are always stored in ST0 with 80 bits of precision.  However, since floats only have 32 bits of precision, you can wind up with different results depending on compilers, optimization settings, register allocation, etc.  We often had problems like this at VRAC.  Some poor engineering student would send out a panicked e-mail at 9:00 p.m. asking why his program started producing different results in release mode than it did in debug mode.
</p>

<p>
Thus, Visual C++ takes a more cautious approach.  By default, it stores float intermediates back to memory to truncate them to 32 bits of precision:
</p>

<pre>
fld [x]
fadd [y]
fstp [temp] ; truncate precision
fld [temp]
fmul [z]
fstp [w]
</pre>

<p>
Tiny differences in precision don&#8217;t matter in IMVU, so enabling /fp:fast saved 50-100 CPU cycles per vertex in our vertex lighting loop.  However, with this option turned on, our automated tests started failing with crazy #IND and #QNAN errors!
</p>

<p>
After some investigation, we discovered that our 4&#215;4 matrix inversion routine (which calculates several 2&#215;2 and 3&#215;3 determinants) was using all 8 floating point registers with /fp:fast enabled.  The x87 registers are stored in a &#8220;<a href="http://www.website.masmforum.com/tutorials/fptute/fpuchap1.htm">stack</a>&#8220;, where ST0 is the top of the stack and STi is the i&#8217;th entry.  Load operations like fld, fld1, and fldz push entries on the stack.  Arithmetic operations like fadd and fmul operate on the top of the stack with the value in memory, storing the result back on the stack.
</p>

<p>
But what if the x87 register stack overflows?  In this case, an <a href="http://www.website.masmforum.com/tutorials/fptute/fpuchap2.htm#indefini">&#8220;indefinite&#8221; NAN</a> is loaded instead of the value you requested, indicating that you have lost information.  (The data at the bottom of the stack was lost.)  Here&#8217;s an example:
</p>

<pre>
fldz  ; ST0 = 0
fld1  ; ST0 = 1, ST1 = 0
fldpi ; ST0 = pi, ST1 = 1, ST2 = 0
fldz
fldz
fldz
fldz
fldz  ; ST0-4 = 0, ST5 = pi, ST6 = 1, ST7 = 0
fldz  ; ST0 = IND!
</pre>

<p>
Woops, there&#8217;s a bug in your code!  You shouldn&#8217;t overflow the x87 register stack, so the processor has given you IND.
</p>

<p>
Indeed, this is what happened in our matrix inversion routine.  But why?
</p>

<p>
Using a debugger, we determined that the x87 stack contained one value at the start of the function.  Moreover, it contained a value at the start of the test!  Something was fishy.  Somebody was leaving the x87 stack dirty, and we needed to find out who.
</p>

<pre>
void verify_x87_stack_empty() {
    unsigned z[8];
    __asm {
        fldz
        fldz
        fldz
        fldz
        fldz
        fldz
        fldz
        fldz
        fstp dword ptr [z+0x00]
        fstp dword ptr [z+0x04]
        fstp dword ptr [z+0x08]
        fstp dword ptr [z+0x0c]
        fstp dword ptr [z+0x10]
        fstp dword ptr [z+0x14]
        fstp dword ptr [z+0x18]
        fstp dword ptr [z+0x1c]
    }

    // Verify bit patterns. 0 = 0.0
    for (unsigned i = 0; i < 8; ++i) {
        CHECK_EQUAL(z[i], 0);
    }
}
</pre>

<p>
The previous function, called before and after every test, discovered the culprit: we had a test that intentionally called printf() and frexp() with NaN values, which had the side effect of leaving the floating point stack in an unpredictable state.
</p>

<p>
Adding <code>__asm emms</code> to the end of the test fixed our problem: thereafter, /fp:fast worked wonderfully.  Case closed.
</p>]]></content:encoded>
			<wfw:commentRss>http://chadaustin.me/2009/02/ind-and-qnan-with-fpfast/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Did you know&#8230;</title>
		<link>http://chadaustin.me/2005/02/did-you-know/</link>
		<comments>http://chadaustin.me/2005/02/did-you-know/#comments</comments>
		<pubDate>Tue, 08 Feb 2005 22:26:00 +0000</pubDate>
		<dc:creator>Chad Austin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[x86]]></category>

		<guid isPermaLink="false">http://aegisknight.org/new/2005/02/08/did-you-know/</guid>
		<description><![CDATA[Did you know that floating point adding to infinity on the Pentium 4 takes 850 clock cycles?  Me either.

Interesting papers by Bruce Dawson.

]]></description>
			<content:encoded><![CDATA[<p>Did you know that floating point adding to infinity on the Pentium 4 takes <b><i>850 clock cycles</i></b>?  Me either.</p>

<p><a href="http://www.cygnus-software.com/papers/index.html">Interesting papers by Bruce Dawson</a>.</p>

]]></content:encoded>
			<wfw:commentRss>http://chadaustin.me/2005/02/did-you-know/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
