<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Chad Austin &#187; performance</title>
	<atom:link href="http://chadaustin.me/tag/performance/feed/" rel="self" type="application/rss+xml" />
	<link>http://chadaustin.me</link>
	<description></description>
	<lastBuildDate>Mon, 07 Nov 2011 21:24:20 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Digging into JavaScript Performance, Part 2</title>
		<link>http://chadaustin.me/2011/11/digging-into-javascript-performance-part-2/</link>
		<comments>http://chadaustin.me/2011/11/digging-into-javascript-performance-part-2/#comments</comments>
		<pubDate>Sun, 06 Nov 2011 08:04:37 +0000</pubDate>
		<dc:creator>Chad Austin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[c++]]></category>
		<category><![CDATA[flash]]></category>
		<category><![CDATA[games]]></category>
		<category><![CDATA[javascript]]></category>
		<category><![CDATA[nativeclient]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[x86]]></category>

		<guid isPermaLink="false">http://chadaustin.me/?p=1687</guid>
		<description><![CDATA[UPDATE.  After I posted these numbers, Alon Zakai, Emscripten&#8217;s author, pointed out options for generating optimized JavaScript.  I reran my benchmarks; check out the updated table below and the script used to generate the new results.

At the beginning of the year, I tried to justify my claim that JavaScript has a long way [...]]]></description>
			<content:encoded><![CDATA[<p><strong>UPDATE</strong>.  After I posted these numbers, Alon Zakai, Emscripten&#8217;s author, <a href="https://gist.github.com/1343182">pointed out</a> options for generating optimized JavaScript.  I reran my benchmarks; check out the updated table below and the <a href="https://github.com/chadaustin/Web-Benchmarks/blob/master/optimized_emscripten.sh">script</a> used to generate the <a href="https://github.com/chadaustin/Web-Benchmarks/blob/master/README">new results</a>.</p>

<p>At the beginning of the year, <a href="http://chadaustin.me/2011/01/digging-into-javascript-performance/">I tried to justify</a> my claim that JavaScript has a long way to go before it can compete with the performance of native code.</p>

<p>Well, 10 months have passed.  WebGL is catching on, Native Client has been launched, <a href="http://www.anandtech.com/show/4933/flash-11-supports-unreal-engine-3">Unreal Engine 3 targets Flash 11</a>, and Crytek has announced they might target Flash 11 too.  Exciting times!</p>

<p>On the GPU front, we&#8217;re in a good place.  With WebGL, iOS, and Flash 11 all roughly exposing shader model 2.0, it&#8217;s not a ton of work to target all of the above.  Even on the desktop you can&#8217;t assume higher than shader model 2.0: the Intel GMA 950 is <a href="http://unity3d.com/webplayer/hwstats/pages/web-2011Q3-gfxcard.html">still at the top</a>.</p>

<p>However, shader model 2.0 isn&#8217;t general enough to offload all of your compute-intensive workloads to the GPU.  With 16 vertex attributes and no vertex texture fetch, you simply can&#8217;t get enough data into your vertex shaders do to everything you need, e.g. blending morph targets.</p>

<p>Thus, for the foreseeable future, we&#8217;ll need to write fast CPU code that can run on the web, mobile devices, and the desktop.  Today, that means at least JavaScript and a native language like C++.  And, because Microsoft has not implemented WebGL, the Firefox and Chrome WebGL blacklists are so strict, and no major browsers fall back on software, you probably care about targeting Flash 11 too.  (It does have a software fallback!)  If you care about Flash 11, then your code had better target ActionScript 3 / AVM2 too.</p>

<p>How can we target native platforms, the web, and Flash at the same time?</p>

<p>Native platforms are easy: C++ is well-supported on Windows, Mac, iOS, and Android. SSE2 is ubiquitous on x86, ARM NEON is widely available, and both have high-quality intrinsics-based implementations.</p>

<p>As for Flash&#8230;  I&#8217;m just counting on <a href="http://blogs.adobe.com/flashplayer/2011/09/updates-from-the-lab.html">Adobe Alchemy</a> to ship.</p>

<p>On the web, you have two choices.  Write your code in C++ and cross-compile it to JavaScript with <a href="https://github.com/kripken/emscripten">Emscripten</a> or write it in JavaScript and run via your native JavaScript engine.  Ideally, cross-compiling C++ to JS via Emscripten would be as fast as writing your code in JavaScript.  If it is, then targeting all platforms is easy: just use C++ and the browsers will do as well as they would with native JavaScript.</p>

<p>Over the last two evenings, while weathering a dust storm, I set about updating my skeletal animation benchmark results: for math-heavy code, how does JavaScript compare to C++ today?  And how does Emscripten compare to hand-written JavaScript?</p>

<p>If you&#8217;d like, take a look at the <a href="https://github.com/chadaustin/Web-Benchmarks/blob/master/README">raw results</a>.</p>

<table>
<tr><th>Language</th><th>Compiler</th><th>Variant</th><th>Vertex Rate</th><th>Slowdown</th></tr>
<tr><td>C++</td><td>clang 2.9</td><td>SSE</td><td>101580000</td><td>1</td></tr>
<tr><td>C++</td><td>gcc 4.2</td><td>SSE</td><td>96420454</td><td>1.05</td></tr>
<tr><td>C++</td><td>gcc 4.2</td><td>scalar</td><td>63355501</td><td>1.6</td></tr>
<tr><td>C++</td><td>clang 2.9</td><td>scalar</td><td>62928175</td><td>1.61</td></tr>
<tr><td>JavaScript</td><td>Chrome 15</td><td>untyped</td><td>10210000</td><td>9.95</td></tr>
<tr><td>JavaScript</td><td>Firefox 7</td><td>typed arrays</td><td>8401598</td><td>12.1</td></tr>
<tr><td>JavaScript</td><td>Chrome 15</td><td>typed arrays</td><td>5790000</td><td>17.5</td></tr>
<tr><td>Emscripten</td><td>Chrome 15</td><td>scalar</td><td>5184815</td><td>19.6</td></tr>
<tr><td>JavaScript</td><td>Firefox 7</td><td>untyped</td><td>5104895</td><td>19.9</td></tr>
<tr><td>JavaScript</td><td>Firefox 9a2</td><td>untyped</td><td>2005988</td><td>50.6</td></tr>
<tr><td>JavaScript</td><td>Firefox 9a2</td><td>typed arrays</td><td>1932271</td><td>52.6</td></tr>
<tr><td>Emscripten</td><td>Firefox 9a2</td><td>scalar</td><td>734126</td><td>138</td></tr>
<tr><td>Emscripten</td><td>Firefox 7</td><td>scalar</td><td>729270</td><td>139</td></tr>
</table>

<p>Conclusions?</p>

<ul>
<li>JavaScript is still a factor of 10-20 away from well-written native code. Adding SIMD support to JavaScript will help, but obviously that&#8217;s not the whole story&#8230;</li>
<li>It&#8217;s bizarre that Chrome and Firefox disagree on whether typed arrays or not are faster.</li>
<li>Firefox 9 clearly has performance issues that need to be worked out.  I wanted to benchmark its type inference capabilities.</li>
<li><del>Emscripten&#8230; ouch :( I wish it were even comparable to hand-written JavaScript, but it&#8217;s another factor of 10-20 slower&#8230;</del></li>
<li>Emscripten on Chrome 15 is within a factor of two of hand-written JavaScript.  I think that means you can target all platforms with C++, because hand-written JavaScript won&#8217;t be that much faster than cross-compiled C++.</li>
<li>Emscripten on Firefox 7 and 9 still has issues, but Alon Zakai informs me that the trunk version of SpiderMonkey is much faster.</li>
</ul>

<p>In the future, I&#8217;d love to run the same test on Flash 11 / Alchemy and Native Client but the former hasn&#8217;t shipped and the latter remains a small market.</p>

<p>One final note: it&#8217;s very possible my test methodology is screwed up, my benchmarks are wrong, or I suck at copy/pasting numbers.  Science should be reproducible: please try to reproduce these results yourself!</p>]]></content:encoded>
			<wfw:commentRss>http://chadaustin.me/2011/11/digging-into-javascript-performance-part-2/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>How to Write an Interactive, 60 Hz Desktop Application</title>
		<link>http://chadaustin.me/2010/11/how-to-write-an-interactive-60-hz-desktop-application/</link>
		<comments>http://chadaustin.me/2010/11/how-to-write-an-interactive-60-hz-desktop-application/#comments</comments>
		<pubDate>Wed, 24 Nov 2010 11:17:28 +0000</pubDate>
		<dc:creator>Chad Austin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[c++]]></category>
		<category><![CDATA[imvu]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://chadaustin.me/?p=1592</guid>
		<description><![CDATA[This post is available on the IMVU Engineering Blog.

IMVU&#8217;s client application doesn&#8217;t fit neatly into a single development paradigm:


IMVU is a Windows desktop application.  Mouse clicks, window resizes, and dialog boxes must all respond with imperceptible latency.  Running IMVU should not significantly affect laptop battery life.
IMVU is an interactive 3D game.  The [...]]]></description>
			<content:encoded><![CDATA[<p>This post is available on the <a href="http://engineering.imvu.com/2010/11/24/how-to-write-an-interactive-60-hz-desktop-application/">IMVU Engineering Blog</a>.</p>

<p>IMVU&#8217;s client application doesn&#8217;t fit neatly into a single development paradigm:</p>

<ul>
<li>IMVU is a Windows desktop application.  Mouse clicks, window resizes, and dialog boxes must all respond with imperceptible latency.  Running IMVU should not significantly affect laptop battery life.</li>
<li>IMVU is an interactive 3D game.  The 3D scene must be simulated and drawn at smooth, interactive frame rates, 60 Hz if possible.</li>
<li>IMVU is a networked application.  Sending and receiving network packets must happen quickly and the UI should never have to wait for I/O.</li>
</ul>

<p>Thus, let us clarify some specific requirements:</p>

<ul>
<li>Minimal CPU usage (and thus battery consumption) when the application is minimized or obscured.</li>
<li>Minimal CPU usage in low-complexity scenes.  Unlike most games, IMVU must never unnecessarily consume battery life while waiting in spin loops.</li>
<li>Animation must continue while modal dialog boxes and menus are visible.  You don&#8217;t have control over these modal event loops, but it looks terrible if animation pauses while menus and dialogs are visible.</li>
<li>Animation must be accurate and precise.  It looks much better if every frame takes 22 milliseconds (45 Hz) than if some frames take 30 milliseconds and some take 15 milliseconds (averaging 45 Hz).</li>
<li>Animation must degrade gracefully.  In a really complex room with a dozen avatars, IMVU can easily spend all of a core&#8217;s CPU trying to animate the scene.  In this case, the frame rate should gradually drop while the application remains responsive to mouse clicks and other input events.</li>
<li>Support for Windows XP, Vista, and 7.</li>
</ul>

<h2>Naive Approach #1</h2>

<p>Windows applications typically have a main loop that looks something like:</p>

<pre>
MSG msg;
while (GetMessage(&amp;msg, 0, 0, 0) &gt; 0) {
    TranslateMessage(&amp;msg);
    DispatchMessage(&amp;msg);
}
</pre>

<h3>What went wrong</h3>

<p>Using <a href="http://msdn.microsoft.com/en-us/library/ms644906(VS.85).aspx">SetTimer/WM_TIMER</a> sounds like a good idea for simulation and painting, but it&#8217;s way <a href="http://www.virtualdub.org/blog/pivot/entry.php?id=272">too imprecise</a> for interactive applications.</p>

<h2>Naive Approach #2</h2>

<p>Games typically have a main loop that looks something like the following:</p>

<pre>
while (running) {
    // process input events
    MSG msg;
    while (PeekMessage(&amp;msg, 0, 0, 0, PM_REMOVE)) {
        TranslateMessage(&amp;msg);
        DispatchMessage(&amp;msg);
    }

    if (frame_interval_has_elapsed) {
        simulate_world();
        paint();
    }
}
</pre>

<h3>What went wrong</h3>

<p>The above loop never sleeps, draining the user&#8217;s battery and burning her legs.</p>

<h2>Clever Approach #1: Standard Event Loop + timeSetEvent</h2>

<pre>
void runMainLoop() {
    MSG msg;
    while (GetMessage(&amp;msg, 0, 0, 0) &gt; 0) {
        TranslateMessage(&amp;msg);
        DispatchMessage(&amp;msg);
    }
}

void customWindowProc(...) {
    if (message == timerMessage) {
        simulate();
        // schedules paint with InvalidateRect
    }
}

void CALLBACK TimerProc(UINT, UINT, DWORD, DWORD, DWORD) {
    if (0 == InterlockedExchange(&amp;inFlight, 1)) {
        PostMessage(frameTimerWindow, timerMessage, 0, 0);
    }
}

void startFrameTimer() {
    RegisterClass(customWindowProc, ...);
    frameTimerWindow = CreateWindow(...);
    timeSetEvent(FRAME_INTERVAL, 0, &amp;TimerProc, 0, TIME_PERIODIC);
}
</pre>

<h3>What went wrong</h3>

<p>The main loop&#8217;s GetMessage call always returns messages in a priority order.  Slightly oversimplified, posted messages come first, then WM_PAINT messages, then WM_TIMER.  Since timerMessage is a normal message, it will preempt any scheduled paints.  This would be fine for us, since simulations are cheap, but the dealbreaker is that if we fail to maintain frame rate, WM_TIMER messages are entirely starved.  This violates our graceful degradation requirement.  When frame rate begins to degrade, code dependent on WM_TIMER shouldn&#8217;t stop entirely.</p>

<p>Even worse, the modal dialog loop has a freaky historic detail.  It waits for the message queue to be empty <a href="http://blogs.msdn.com/b/oldnewthing/archive/2004/03/11/87941.aspx">before displaying modal dialogs</a>.  When painting can&#8217;t keep up, modal dialogs simply don&#8217;t appear.</p>

<p>We tried a bunch of variations, setting flags when stepping or painting, but they all had critical flaws.  Some continued to starve timers and dialog boxes and some degraded by ping-ponging between 30 Hz and 15 Hz, which looked terrible.</p>

<h2>Clever Approach #2: PostThreadMessage + WM_ENTERIDLE</h2>

<p>A standard message loop didn&#8217;t seem to be getting us anywhere, so we changed our timeSetEvent callback to PostThreadMessage a custom message to the main loop, who knew how to handle it.  Messages sent via PostThreadMessage don&#8217;t go to a window, so the event loop needs to process them directly.  Since DialogBox and TrackPopupMenu modal loops won&#8217;t understand this custom message, we will fall back on a different mechanism.</p>

<p>DialogBox and TrackPopupMenu send WM_ENTERIDLE to their owning windows.  Any window in IMVU that can host a dialog box or popup menu handles WM_ENTERIDLE by notifying a global idle handler, which can decide to schedule a new frame immediately or in N milliseconds, depending on how much time has elapsed.</p>

<h3>What Went Wrong</h3>

<p>So close!  In our testing under realistic workloads, timeSetEvent had horrible pauses and jitter.  Sometimes the multimedia thread would go 250 ms between notifications.  Otherwise, the custom event loop + WM_ENTERIDLE approach seemed sound.  I tried timeSetEvent with several flags, but they all had accuracy and precision problems.</p>

<h2>What Finally Worked</h2>

<p>Finally, we settled on MsgWaitForMultipleObjects with a calculated timeout.</p>

<p>Assuming the existence of a FrameTimeoutCalculator object which returns the number of milliseconds until the next frame:</p>

<pre>
int runApp() {
    FrameTimeoutCalculator ftc;

    for (;;) {
        const DWORD timeout = ftc.getTimeout();
        DWORD result = (timeout
            ? MsgWaitForMultipleObjects(0, 0, TRUE, timeout, QS_ALLEVENTS)
            : WAIT_TIMEOUT);
        if (result == WAIT_TIMEOUT) {
            simulate();
            ftc.step();
        }

        MSG msg;
        while (PeekMessage(&amp;msg, 0, 0, 0, PM_REMOVE)) {
            if (msg.message == WM_QUIT) {
                return msg.wParam;
            }

            TranslateMessage(&amp;msg);
            DispatchMessage(msg);
        }
    }
}
</pre>

<h3>Well, what about modal dialogs?</h3>

<p>Since we rely on a custom message loop to animate 3D scenes, how do we handle standard message loops such as the modal DialogBox and TrackPopupMenu calls?  Fortunately, DialogBox and TrackPopupMenu provide us with the hook required to implement frame updates: <a href="http://msdn.microsoft.com/en-us/library/ms645422(VS.85).aspx">WM_ENTERIDLE</a>.</p>

<p>When the standard DialogBox and TrackPopupMenu modal message loops go idle, they send their parent window a WM_ENTERIDLE message.  Upon receiving WM_ENTERIDLE, the parent window determines whether it&#8217;s time to render a new frame.  If so, we animate all visible 3D windows, which will trigger a WM_PAINT, which triggers a subsequent WM_ENTERIDLE.</p>

<p>On the other hand, if it&#8217;s not time to render a new frame, we call timeSetEvent with TIME_ONESHOT to schedule a frame update in the future.</p>

<p>As we saw previously, timeSetEvent isn&#8217;t as reliable as a custom loop using MsgWaitForMultipleObjectsEx, but if a modal dialog or popup menu is visible, the user probably isn&#8217;t paying very close attention anyway.  All that matters is that the UI remains responsive and animation continues while modal loops are open.  Code follows:</p>

<pre>
LRESULT CALLBACK ModalFrameSchedulerWndProc(HWND hwnd, UINT message, WPARAM wparam, LPARAM lparam) {
    if (message == idleMessage) {
        stepFrame();
    }
    return DefWindowProc(hwnd, message, wparam, lparam);
}

struct AlmostMSG {
    HWND hwnd;
    UINT message;
    WPARAM wparam;
    LPARAM lparam;
};

void CALLBACK timeForPost(UINT, UINT, DWORD_PTR user_data, DWORD_PTR, DWORD_PTR) {
    AlmostMSG* msg = reinterpret_cast&lt;AlmostMSG*&gt;(user_data);
    PostMessage(msg-&gt;hwnd, msg-&gt;message, msg-&gt;wparam, msg-&gt;lparam);
    delete msg;
}

void PostMessageIn(DWORD timeout, HWND hwnd, UINT message, WPARAM wparam, LPARAM lparam) {
    if (timeout) {
        AlmostMSG* msg = new AlmostMSG;
        msg->hwnd = hwnd;
        msg->message = message;
        msg->wparam = wparam;
        msg->lparam = lparam;
        timeSetEvent(timeout, 1, timeForPost, reinterpret_cast&lt;DWORD_PTR&gt;(msg), TIME_ONESHOT | TIME_CALLBACK_FUNCTION);
    } else {
        PostMessage(hwnd, message, wparam, lparam);
    }
}

class ModalFrameScheduler : public IFrameListener {
public:
    ModalFrameScheduler() { stepping = false; }

    // Call when WM_ENTERIDLE is received.
    void onIdle() {
        if (!frameListenerWindow) {
            idleMessage = RegisterWindowMessageW(L"IMVU_ScheduleFrame");
            Assert(idleMessage);

            WNDCLASS wc;
            ZeroMemory(&amp;wc, sizeof(wc));
            wc.hInstance = GetModuleHandle(0);
            wc.lpfnWndProc = ModalFrameSchedulerWndProc;
            wc.lpszClassName = L"IMVUModalFrameScheduler";
            RegisterClass(&amp;wc);

            frameListenerWindow = CreateWindowW(
                L"IMVUModalFrameScheduler",
                L"IMVUModalFrameScheduler",
                0, 0, 0, 0, 0, 0, 0,
                GetModuleHandle(0), 0);
            Assert(frameListenerWindow);
        }

        if (!stepping) {
            const unsigned timeout = ftc.getTimeout();
            stepping = true;
            PostMessageIn(timeout, frameListenerWindow, idleMessage, 0, 0);
            ftc.step();
        }
    }
    void step() { stepping = false; }

private:
    bool stepping;
    FrameTimeoutCalculator ftc;
};
</pre>

<h2>How has it worked out?</h2>

<p>A custom message loop and WM_ENTERIDLE neatly solves all of the goals we laid out:</p>

<ul>
<li>No unnecessary polling, and thus increased battery life and performace.</li>
<li>When possible, the 3D windows animate at 60 Hz.</li>
<li>Even degradation.  If painting a frame takes 40 ms, the frame rate will drop from 60 Hz to 25 Hz, not from 60 Hz to 15 Hz, as some of the implementations did.</li>
<li>Animation continue to play, even while modal dialogs and popup menus are visible.</li>
<li>This code runs well on XP, Vista, and Windows 7.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://chadaustin.me/2010/11/how-to-write-an-interactive-60-hz-desktop-application/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Efficiently Rendering Flash in a 3D Scene</title>
		<link>http://chadaustin.me/2010/07/efficiently-rendering-flash-in-a-3d-scene/</link>
		<comments>http://chadaustin.me/2010/07/efficiently-rendering-flash-in-a-3d-scene/#comments</comments>
		<pubDate>Thu, 29 Jul 2010 09:12:38 +0000</pubDate>
		<dc:creator>Chad Austin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[c++]]></category>
		<category><![CDATA[flash]]></category>
		<category><![CDATA[imvu]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://chadaustin.me/?p=1561</guid>
		<description><![CDATA[The original source of this post is at the IMVU engineering blog.  Subscribe now!

Last time, I talked about how to embed Flash into your desktop application, for UI flexibility and development speed.  This time, I&#8217;ll discuss efficient rendering into a 3D scene.

Rendering Flash as a 3D Overlay (The Naive Way)

At first blush, rendering [...]]]></description>
			<content:encoded><![CDATA[<p>The original source of this post is at the <a href="http://engineering.imvu.com/2010/07/29/efficiently-rendering-flash-in-a-3d-scene/">IMVU engineering blog</a>.  <a href="http://engineering.imvu.com">Subscribe now!</a></p>

<p><a href="http://chadaustin.me/2010/07/how-to-embed-flash-into-your-3d-application/">Last time</a>, I talked about how to embed Flash into your desktop application, for UI flexibility and development speed.  This time, I&#8217;ll discuss efficient rendering into a 3D scene.</p>

<h2>Rendering Flash as a 3D Overlay (The Naive Way)</h2>

<p>At first blush, rendering Flash on top of a 3D scene sounds easy.  Every frame:</p>

<ol>
<li>Create a <a href="http://msdn.microsoft.com/en-us/library/dd183494(VS.85).aspx">DIB section</a> the size of your 3D viewport</li>
<li>Render Flash into the DIB section with <a href="http://msdn.microsoft.com/en-us/library/ms688655(VS.85).aspx">IViewObject::Draw</a></li>
<li>Copy the DIB section into an <a href="http://msdn.microsoft.com/en-us/library/bb205909(VS.85).aspx">IDirect3DTexture9</a></li>
<li>Render the texture on the top of the scene</li>
</ol>

<div id="attachment_1562" class="wp-caption aligncenter" style="width: 257px"><a href="http://aegisknight.org/wp-uploads/Flash-Rendering.png"><img src="http://aegisknight.org/wp-uploads/Flash-Rendering.png" alt="" title="Naive Flash Rendering" width="247" height="524" class="size-full wp-image-1562" /></a><p class="wp-caption-text">Naive Flash Rendering</p></div>

<p>Ta da!  But your frame rate dropped to 2 frames per second?  Ouch.  It turns out this implementation is horribly slow.  There are a couple reasons.</p>

<p>First, asking the Adobe flash player to render into a DIB isn&#8217;t a cheap operation.  In our measurements, drawing even a simple SWF takes on the order of 10 milliseconds.  Since most UI doesn&#8217;t animate every frame, we should be able to cache the captured framebuffer.</p>

<p>Second, main memory and graphics memory are on different components in your computer.  You want to avoid wasting time and bus traffic by unnecessarily copying data from the CPU to the GPU every frame.  If only the lower-right corner of a SWF changes, we should limit our memory copies to that region.</p>

<p>Third, modern GPUs are fast, but not everyone has them.  Let&#8217;s say you have a giant mostly-empty SWF and want to render it on top of your 3D scene.  On slower GPUs, it would be ideal if you could limit your texture draws to the region of the screen that are non-transparent.</p>

<h2>Rendering Flash as a 3D Overlay (The Fast Way)</h2>

<p>Disclaimer: I can&#8217;t take credit for these algorithms.  They were jointly developed over years by many smart engineers at IMVU.</p>

<p>First, let&#8217;s reduce an embedded Flash player to its principles:</p>

<ul>
<li>Flash exposes an IShockwaveFlash [link] interface through which you can load and play movies.</li>
<li>Flash maintains its own frame buffer.  You can read these pixels with IViewObject::Draw.</li>
<li>When a SWF updates regions of the frame buffer, it notifies you through IOleInPlaceSiteWindowless::InvalidateRect.</li>
</ul>

<p>In addition, we&#8217;d like the Flash overlay system to fit within these performance constraints:</p>

<ul>
<li>Each SWF is rendered over the entire window.  For example, implementing a ball that bounces around the screen or a draggable UI component should not require any special IMVU APIs.</li>
<li>If a SWF is not animating, we do not copy its pixels to the GPU every frame.</li>
<li>We do not render the overlay in transparent regions.  That is, if no Flash content is visible, rendering is free.</li>
<li>Memory consumption (ignoring memory used by individual SWFs) for the overlay usage is O(framebuffer), not O(framebuffer * SWFs).  That is, loading three SWFs should not require allocation of three screen-sized textures.</li>
<li>If Flash notifies of multiple changed regions per frame, only call IViewObject::Draw once.</li>
</ul>

<p>Without further ado, let&#8217;s look at the fast algorithm:</p>

<div id="attachment_1564" class="wp-caption aligncenter" style="width: 573px"><a href="http://aegisknight.org/wp-uploads/Fast-Flash-Rendering.png"><img src="http://aegisknight.org/wp-uploads/Fast-Flash-Rendering.png" alt="" title="Fast Flash Rendering" width="563" height="808" class="size-full wp-image-1564" /></a><p class="wp-caption-text">Fast Flash Rendering</p></div>

<p>Flash notifies us of visual changes via IOleInPlaceSiteWindowless::InvalidateRect.  We take any updated rectangles and add them to a per-frame dirty region.  When it&#8217;s time to render a frame, there are four possibilities:</p>

<ul>
<li>The dirty region is empty and the opaque region is empty.  This case is basically free, because nothing need be drawn.</li>

<li>The dirty region is empty and the opaque region is nonempty.  In this case, we just need to render our cached textures for the non-opaque regions of the screen.  This case is the most common.  Since a video memory blit is fast, there&#8217;s not much we could do to further speed it up.</li>

<li>The dirty region is nonempty.  We must IViewObject::Draw into our Overlay DIB, with one tricky bit.  Since we&#8217;re only storing one overlay texture, we need to render each loaded Flash overlay SWF into the DIB, not just the one that changed.  Imagine an animating SWF underneath another translucent SWF.  The top SWF must be composited with the bottom SWF&#8217;s updates.  After rendering each SWF, we scan the updated DIB for a minimalish opaque region.  Why not just render the dirty region?  Imagine a SWF with a bouncing ball.  If we naively rendered every dirty rectangle, eventually we&#8217;d be rendering the entire screen.  Scanning for minimal opaque regions enables recalculation of what&#8217;s actually visible.</li>

<li>The dirty region is nonempty, but the updated pixels are all transparent.  If this occurs, we no longer need to render anything at all until Flash content reappears.</li>
</ul>

<p>This algorithm has proven efficient.  It supports multiple overlapping SWFs while minimizing memory consumption and CPU/GPU draw calls per frame.  Until recently, we used Flash for several of our UI components, giving us a standard toolchain and a great deal of flexibility.  Flash was the bridge that took us from the dark ages of C++ UI code to UI on which we could actually iterate.</p>
]]></content:encoded>
			<wfw:commentRss>http://chadaustin.me/2010/07/efficiently-rendering-flash-in-a-3d-scene/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How to Embed Flash Into Your 3D Application</title>
		<link>http://chadaustin.me/2010/07/how-to-embed-flash-into-your-3d-application/</link>
		<comments>http://chadaustin.me/2010/07/how-to-embed-flash-into-your-3d-application/#comments</comments>
		<pubDate>Thu, 29 Jul 2010 08:52:16 +0000</pubDate>
		<dc:creator>Chad Austin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[c++]]></category>
		<category><![CDATA[flash]]></category>
		<category><![CDATA[imvu]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://chadaustin.me/?p=1550</guid>
		<description><![CDATA[The original source of this post is at the IMVU engineering blog.  Subscribe now!

[I wrote this post last year when IMVU still used Flash for a significant portion of our UI. Even though we now embed Gecko, I believe embedding Flash is still valuable.]

Writing user interfaces is hard.  Writing usable interfaces is harder. [...]]]></description>
			<content:encoded><![CDATA[<p>The original source of this post is at the <a href="http://engineering.imvu.com/2010/07/29/how-to-embed-flash-into-your-3d-application/">IMVU engineering blog</a>.  <a href="http://engineering.imvu.com">Subscribe now!</a></p>

<p><em>[I wrote this post last year when IMVU still used Flash for a significant portion of our UI. Even though we now embed Gecko, I believe embedding Flash is still valuable.]</em></p>

<p>Writing user interfaces is hard.  Writing usable interfaces is harder.  Yet, the design of your interface <em>is your product</em>.</p>

<p>Products are living entities.  They always want to grow, adapting to their users as users adapt to them.  In that light, why build your user interface in a static technology like C++ or Java?  It won&#8217;t be perfect the first time you build it, so prepare for change.</p>

<p>IMVU employs two technologies for rapidly iterating on and refining our client UIs: Flash and Gecko/HTML.  Sure, integrating these technologies has a sizable up-front cost, but the iteration speed they provide easily pays for them.  Rapid iteration has some obvious benefits:</p>

<ol>
<li>reduces development cost</li>
<li>reduces time to market</li>
</ol>

<p>and some less-obvious benefits:</p>

<ol>
<li>better product/market fit: when you can change your UI, you will.</li>
<li>improved product quality: little details distinguish mediocre products from great products.  make changing details cheap and your Pinto will become a Cadillac.</li>
<li>improved morale: both engineers and designers <em>love</em> watching their creations appear on the screen right before them. it&#8217;s why so many programmers create games!</li>
</ol>

<p>I will show you how integrating Flash into a 3D application is easier than it sounds.</p>


<h2>Should I use Adobe Flash or Scaleform GFx?</h2>

<p>The two most common Flash implementations are Adobe&#8217;s ActiveX control (which has a <a href="http://www.adobe.com/products/player_census/flashplayer/version_penetration.html">97% installed base!</a>) and Scaleform GFx.</p>

<p>Adobe&#8217;s control has perfect compatibility with their tool chain (go figure!) but is closed-source and good luck getting help from Adobe.</p>

<p>Scaleform GFx is an alternate implementation of Flash designed to be embedded in 3D applications, but, last I checked, is not efficient on machines without GPUs.  (Disclaimer: this information is two years old, so I encourage you to make your own evaluation.)</p>

<p>IMVU chose to embed Adobe&#8217;s player.</p>

<h2>Deploying the Flash Runtime</h2>

<p>Assuming you&#8217;re using Adobe&#8217;s Flash player, how will you deploy their runtime?  Well, given Flash&#8217;s install base, you can get away with loading the Flash player already installed on the user&#8217;s computer.  If they don&#8217;t have Flash, just require that they install it from your download page.  Simple and easy.</p>

<p>Down the road, when Flash version incompatibilities and that last 5% of your possible market becomes important, you can request <a href="http://www.adobe.com/licensing/">permission from Adobe</a> to deploy the Flash player with your application.</p>

<h2>Displaying SWFs</h2>

<p>IMVU displays Flash in two contexts: traditional HWND windows and 2D overlays atop the 3D scene.</p>

<div id="attachment_1551" class="wp-caption aligncenter" style="width: 689px"><a href="http://aegisknight.org/wp-uploads/imvu_flash_window.png"><img src="http://aegisknight.org/wp-uploads/imvu_flash_window.png" alt="" title="IMVU Flash Window" width="679" height="353" class="size-full wp-image-1551" /></a><p class="wp-caption-text">IMVU Flash Window</p></div>

<div id="attachment_1568" class="wp-caption aligncenter" style="width: 485px"><a href="http://aegisknight.org/wp-uploads/imvu_flash_overlay1.png"><img src="http://aegisknight.org/wp-uploads/imvu_flash_overlay1.png" alt="" title="IMVU Flash Overlay" width="475" height="566" class="size-full wp-image-1568" /></a><p class="wp-caption-text">IMVU Flash Overlay</p></div>

<p>If you want to have something up and running in a day, buy <a href="http://www.f-in-box.com/">f_in_box</a>.  Besides its awesome name, it&#8217;s cheap, comes with source code, and the support forums are fantastic.  It&#8217;s a perfect way to bootstrap.  After a weekend of playing with f_in_box, Dusty and I had a YouTube video playing in a texture on top of our 3D scene.</p>

<p>Once you run into f_in_box&#8217;s limitations, you can use the IShockwaveFlash and IOleInPlaceObjectWindowless COM interfaces directly.  See Igor Makarav&#8217;s <a href="http://www.codeproject.com/KB/COM/flashcontrol.aspx?fid=321012">excellent tutorial</a> and CFlashWnd class.</p>

<h2>Rendering Flash as an HWND</h2>

<p>For top-level UI elements use f_in_box or CFlashWnd directly.  They&#8217;re perfectly suited for this.  Seriously, it&#8217;s just a few lines of code.  Look at their samples and go.</p>

<h2>Rendering Flash as a 3D Overlay</h2>

<p>Rendering Flash to a 3D window gets a bit tricky&#8230;  Wait for Part 2 of this post!</p>
]]></content:encoded>
			<wfw:commentRss>http://chadaustin.me/2010/07/how-to-embed-flash-into-your-3d-application/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Visualizing Python Import Dependencies</title>
		<link>http://chadaustin.me/2009/05/visualizing-python-import-dependencies/</link>
		<comments>http://chadaustin.me/2009/05/visualizing-python-import-dependencies/#comments</comments>
		<pubDate>Sun, 03 May 2009 02:37:19 +0000</pubDate>
		<dc:creator>Chad Austin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[imvu]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://aegisknight.org/?p=1368</guid>
		<description><![CDATA[
In a large Python program such as IMVU, startup time is dominated by Python module imports.  Take these warm timings:



$ time python -c 'None'

real    0m0.096s
user    0m0.077s
sys     0m0.061s

$ time python -c 'import urllib2'

real    0m0.314s
user    0m0.155s
sys     [...]]]></description>
			<content:encoded><![CDATA[<p>
In a large Python program such as IMVU, startup time is dominated by Python module imports.  Take these warm timings:
</p>

<pre>
$ time python -c 'None'

real    0m0.096s
user    0m0.077s
sys     0m0.061s

$ time python -c 'import urllib2'

real    0m0.314s
user    0m0.155s
sys     0m0.186s
</pre>

<p>
That&#8217;s 300ms for a single basic dependency.  Importing the entire IMVU client takes 1.5s warm and 20s cold on a typical user&#8217;s machine.
</p>

<p>
<a href="http://aegisknight.org/wp-uploads/result.png"><img src="http://aegisknight.org/wp-uploads/result.png" alt="Loading" title="Loading" width="200" height="196" class="aligncenter size-full wp-image-1372" /></a>
</p>

<p>
The IMVU client&#8217;s loading progress bar imports modules bottom-up; that is, leaf modules are imported before their parents.  The root module is imported last.
</p>

<p>
Implementing a bottom-up import sequence requires generating a graph of dependencies between modules:
</p>

<pre style="height: 30em">
def get_dependencies(module_name):
    """\
    Takes a module name as input (e.g. 'xml.dom') and returns a set of
    (lhs, rhs) tuples where lhs and rhs are module names and lhs
    imports rhs.
    """
    
    # module_from_key is a dict from a module key, an arbitrary
    # object, to a module object.  While importing, we discover
    # dependencies before we have access to the actual module objects.
    
    # import_dependencies is a list of (lhs, rhs) tuples where lhs and
    # rhs are module keys, and module_from_key[lhs] imported
    # module_from_key[rhs].

    root_key = object()
    module_from_key = {root_key: __main__}
    import_dependencies = []
    stack = [root_key]

    def import_in_stack(key, name, globals, locals, fromlist, level):
        stack.append(key)
        try:
            return original_import(name, globals, locals, fromlist, level)
        finally:
            stack.pop()

    import __builtin__
    original_import = __builtin__.__import__

    def my_import(name, globals=globals(), locals=locals(), fromlist=[], level=-1):
        # fromlist is a whore.  Most of the complexity in this
        # function stems from fromlist's semantics.  See
        # http://docs.python.org/library/functions.html#__import__
        
        # If a module imports 'xml.dom', then the module depends on
        # both 'xml' and 'xml.dom' modules.
        dotted = name.split('.')
        for i in range(1, len(dotted)):
            my_import('.'.join(dotted[0:i]), globals, locals, [], level)

        module_key = object()
        parent_key = stack[-1]

        def add_dependency_from_parent(key, m):
            module_from_key[key] = m
            import_dependencies.append((parent_key, key))

        submodule = import_in_stack(module_key, name, globals, locals, ['__name__'], level)
        add_dependency_from_parent(module_key, submodule)

        for f in (fromlist or []):
            from_key = object()
            module = import_in_stack(from_key, name, globals, locals, [f], level)
            if f == '*':
                continue
            submodule = getattr(module, f)
            if isinstance(submodule, types.ModuleType):
                add_dependency_from_parent(from_key, submodule)

        return original_import(name, globals, locals, fromlist, level)

    # Import module_name with import hook enabled.
    original_modules = sys.modules.copy()
    __builtin__.__import__ = my_import
    try:
        my_import(module_name)
    finally:
        __builtin__.__import__ = original_import
        sys.modules.clear()
        sys.modules.update(original_modules)

    assert stack == [root_key], stack

    return sorted(set(
        (module_from_key[lhs].__name__, module_from_key[rhs].__name__)
        for lhs, rhs in import_dependencies
    ))
</pre>

<p>
(You can view <a href="http://imvu.svn.sourceforge.net/viewvc/imvu/imvu_open_source/importdep/importdep.py?view=markup#l_1">all of the code at SourceForge</a>).
</p>

<p>
First, we install an <a href="http://docs.python.org/library/functions.html#__import__">__import__</a> hook that discovers import dependencies between modules, and convert those relationships into a directed graph:
</p>

<p>
<a href="http://aegisknight.org/wp-uploads/xmldomminidomdot.png"><img src="http://aegisknight.org/wp-uploads/xmldomminidomdot-300x240.png" alt="xml.dom.minidom" title="xml.dom.minidom" width="300" height="240" class="aligncenter size-medium wp-image-1386" /></a>
</p>

<p>
Then, we merge cycles.  If module A imports B, B imports C, and C imports A, then it doesn&#8217;t matter which you import first.  Importing A is equivalent to importing B or C.  After this step, we have a DAG:
</p>

<p><a href="http://aegisknight.org/wp-uploads/xmldomminidomdagdot.png"><img src="http://aegisknight.org/wp-uploads/xmldomminidomdagdot-300x182.png" alt="xml.dom.minidom DAG" title="xml.dom.minidom DAG" width="300" height="182" class="aligncenter size-medium wp-image-1390" /></a></p>

<p>
Finally, we can postorder traverse the DAG to determine a good import sequence and cost (approximated as the number of modules in the cycle) per import:
</p>

<pre>
1 xml
3 xml.dom
1 copy_reg
1 types
1 copy
1 xml.dom.NodeFilter
1 xml.dom.xmlbuilder
1 xml.dom.minidom
1 __main__
</pre>

<p>Now let&#8217;s look at some less-trivial examples.  urllib2:</p>

<p><a href="http://aegisknight.org/wp-uploads/urllib2dagdot.png"><img src="http://aegisknight.org/wp-uploads/urllib2dagdot-300x104.png" alt="urllib2" title="urllib2" width="300" height="104" class="aligncenter size-medium wp-image-1392" /></a></p>

<p>SimpleXMLRPCServer:</p>

<p><a href="http://aegisknight.org/wp-uploads/simplexmlrpcserverdagdot.png"><img src="http://aegisknight.org/wp-uploads/simplexmlrpcserverdagdot-300x84.png" alt="SimpleXMLRPCServer" title="SimpleXMLRPCServer" width="300" height="84" class="aligncenter size-medium wp-image-1393" /></a></p>

<p><a href="http://thespeedbump.livejournal.com/63798.html">imvu.task</a>:</p>

<p><a href="http://aegisknight.org/wp-uploads/imvutaskdagdot.png"><img src="http://aegisknight.org/wp-uploads/imvutaskdagdot-300x80.png" alt="imvu.task" title="imvu.task" width="300" height="80" class="aligncenter size-medium wp-image-1394" /></a></p>

<p>
Final notes: <a href="http://www.tarind.com/depgraph.html">Other</a> <a href="http://guichaz.free.fr/misc/">people</a> have solved this problem with bytecode scanning, but we wanted to know the actual modules imported for an accurate progress bar.  A simpler __import__ hook could have calculate the correct import sequence, but I find a visual representation of module dependencies to have additional benefits.
</p>

]]></content:encoded>
			<wfw:commentRss>http://chadaustin.me/2009/05/visualizing-python-import-dependencies/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Flushing the Windows Disk Cache</title>
		<link>http://chadaustin.me/2009/04/flushing-disk-cache/</link>
		<comments>http://chadaustin.me/2009/04/flushing-disk-cache/#comments</comments>
		<pubDate>Wed, 22 Apr 2009 08:46:35 +0000</pubDate>
		<dc:creator>Chad Austin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[imvu]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://aegisknight.org/?p=1350</guid>
		<description><![CDATA[
Occasionally, I want to test the performance of a program after a cold boot, or maybe after the computer has been idle for hours and the program has been paged out.  For example, the IMVU client starts relatively quickly when the disk cache is warm, but at system boot, it can take quite a [...]]]></description>
			<content:encoded><![CDATA[<p>
Occasionally, I want to test the performance of a program after a cold boot, or maybe after the computer has been idle for hours and the program has been paged out.  For example, the IMVU client starts relatively quickly when the disk cache is warm, but at system boot, it can take quite a while for the login dialog to even appear.  Iterating in these situations is a pain in the butt because you have to reboot or leave your computer idle for hours.
</p>

<p>
I&#8217;m sure there exists a program which will flush the disk caches and force programs out of memory and into the page file, but I can&#8217;t find it.  So <a href="http://aegisknight.org/flushmem/">I wrote one</a>.
</p>

<p>
First, a caveat: programs these days <a href="http://stackoverflow.com/questions/763159/should-i-bother-detecting-oom-out-of-memory-errors-in-my-c-code">rarely handle out-of-memory situations</a>, so running <code>flushmem.exe</code> might cause open applications to explode like popcorn.  Buyer beware, etc.
</p>

<p>
After running <code>flushmem.exe</code>, you should find that your computer becomes painfully slow as applications are paged back into memory and the disk cache is refilled.  Perfect.  Now I can realistically simulate the experiences of our users.
</p>

<p>You can download the program here or on the <a href="http://aegisknight.org/flushmem/">FlushMem page</a>.</p>

<ul>
<li><a href="http://aegisknight.org/download/flushmem.exe">flushmem.exe</a></li>
<li><a href="http://aegisknight.org/download/flushmem.cpp">flushmem.cpp</a> (source code)</li>
</ul>


<p>
Implementation details: in Windows, each process has a 2 GB user mode address space limit by default.  If physical memory + page file size is greater than 2 GB, flushmem spawns multiple processes.  Each process allocates memory in <a href="http://blogs.msdn.com/oldnewthing/archive/2003/10/08/55239.aspx">64 KiB chunks</a> until it can&#8217;t anymore, and then writes to each page, forcing older pages out to the page file.
</p>

]]></content:encoded>
			<wfw:commentRss>http://chadaustin.me/2009/04/flushing-disk-cache/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Latency vs. Throughput</title>
		<link>http://chadaustin.me/2009/02/latency-vs-throughput/</link>
		<comments>http://chadaustin.me/2009/02/latency-vs-throughput/#comments</comments>
		<pubDate>Sat, 14 Feb 2009 05:57:00 +0000</pubDate>
		<dc:creator>Chad Austin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[x86]]></category>

		<guid isPermaLink="false">http://aegisknight.org/new/2009/02/13/latency-vs-throughput/</guid>
		<description><![CDATA[
This is my last post about processors and performance, I swear!  Plus,
my wrists are starting to hurt from this bloodpact thing (as I&#8217;m
diagnosed with RSI), so I think this will be a light one.



As I&#8217;ve discussed previously,
modern desktop processors work really hard to exploit the inherent
parallelism in your programs.  This is called instruction-level
parallelism, [...]]]></description>
			<content:encoded><![CDATA[<p>
This is my last post about processors and performance, I swear!  Plus,
my wrists are starting to hurt from this <a href="http://www.egometry.com/bloodpact/">bloodpact thing</a> (as I&#8217;m
diagnosed with RSI), so I think this will be a light one.
</p>

<p>
As I&#8217;ve discussed <a
href="http://aegisknight.livejournal.com/139879.html">previously</a>,
modern desktop processors work really hard to exploit the inherent
parallelism in your programs.  This is called <a
href="http://en.wikipedia.org/wiki/Instruction-level_parallelism">instruction-level
parallelism</a>, and is one of the techniques processors use to get
more performance out of slower clock rates (along with data-level
parallelism (SIMD) or multiple cores (MIMD)<a href="#footnotesimdvsmimd">*</a>).  Previously, I waved my
hands a bit and said &#8220;The processor makes independent operations run
in parallel.&#8221;  Now I&#8217;m going to teach you how to count cycles in the presence of latency and parallelism.
</p>

<p>
Traditionally, when analyzing the cost of an algorithm, you would
simply count the operations involved, sum their costs in cycles, and
call it a day.  These days, it&#8217;s not that easy.  Instructions have two
costs: dependency chain latency and reciprocal throughput.
</p>

<p>
Reciprocal throughput is simply the reciprocal of the maximum
throughput of a particular instruction.  Throughput is measured in
instructions/cycle, so reciprocal throughput is cycles/instruction.
</p>

<p>
OK, that sounds like the way we&#8217;ve always measured performance.  So
what&#8217;s dependency chain latency?  When the results of a previous
calculation are needed for another calculation, you have a dependency
chain.  In a dependency chain, you measure the cost of an instruction
by its latency, not its reciprocal throughput.  Remember that our
processors are working really hard to exploit parallelism in our code.
When there is no instruction-level parallelism available, we get
penalized.
</p>

<p>
Let&#8217;s go back to our sum 10000 numbers example, but unroll it a bit:
</p>

<pre>
float array[10000];
float sum = 0.0f;
for (int i = 0; i &lt; 10000; i += 8) {
    sum += array[i+0];
    sum += array[i+1];
    sum += array[i+2];
    sum += array[i+3];
    sum += array[i+4];
    sum += array[i+5];
    sum += array[i+6];
    sum += array[i+7];
}
return sum;
</pre>

In x86:

<pre>
xor ecx, ecx     ; ecx  = i   = 0
mov esi, array
xorps xmm0, xmm0 ; xmm0 = sum = 0.0

begin:
addss xmm0, [esi+0]
addss xmm0, [esi+4]
addss xmm0, [esi+8]
addss xmm0, [esi+12]
addss xmm0, [esi+16]
addss xmm0, [esi+20]
addss xmm0, [esi+24]
addss xmm0, [esi+28]

add esi, 32
add ecx, 1
cmp ecx, 10000
jl begin ; if ecx &lt; 10000, goto begin

; xmm0 = total sum
</pre>

<p>
Since each addition to <code>sum</code> in the loop depends on the previous
addition, these instructions are a dependency chain.  On a modern
processor, let&#8217;s say the reciprocal throughput of <code>addss</code> is 1 cycle.
However, the minimum latency is 4 cycles.  Since every instruction
depends on the previous, each addition costs 4 cycles.
</p>

<p>
As before, let&#8217;s try summing with four temporary sums:
</p>

<pre>
xor ecx, ecx     ; ecx  = i    = 0
mov esi, array
xorps xmm0, xmm0 ; xmm0 = sum1 = 0.0
xorps xmm1, xmm1 ; xmm1 = sum2 = 0.0
xorps xmm2, xmm2 ; xmm2 = sum3 = 0.0
xorps xmm3, xmm3 ; xmm3 = sum4 = 0.0

; top = sum0

begin:
addss xmm0, [esi+0]
addss xmm1, [esi+4]
addss xmm2, [esi+8]
addss xmm3, [esi+12]
addss xmm0, [esi+16]
addss xmm1, [esi+20]
addss xmm2, [esi+24]
addss xmm3, [esi+28]

add esi, 32
add ecx, 1
cmp ecx, 10000
jl begin ; if ecx &lt; 10000, goto begin

; accumulate sums
addss xmm0, xmm1
addss xmm2, xmm3 ; this instruction happens in parallel with the one above
addss xmm0, xmm2
</pre>

<p>
Here, the additions in the loop that depend on each other are 4 cycles apart,
meaning the minimum latency is no longer a problem.  This lets us hit
the maximum addition rate of one per cycle.
</p>

<p>
Removing dependency chains is a critical part of optimizing on today&#8217;s
processors.  The Core 2 processor has <em>six</em> execution units,
three of which are fully 128-bit SIMD ALUs.  If you can restructure
your algorithm so calculations happen independently, you can take
advantage of all of them.  (And if you can pull off making full use of
the Core 2&#8217;s ALU capacity, you win.)
</p>

<p>
<a name="footnotesimdvsmimd">*</a> BTW, it&#8217;s sort of unrelated, but I couldn&#8217;t help but link this article.
Greg Pfister has an interesting comparison and history of SIMD
vs. MIMD <a
href="http://perilsofparallel.blogspot.com/2008/09/larrabee-vs-nvidia-mimd-vs-simd.html">here</a>.  Ignore the terminology blathering and focus on the history of and influences on SIMD and MIMD over time.
</p>]]></content:encoded>
			<wfw:commentRss>http://chadaustin.me/2009/02/latency-vs-throughput/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Running Time -&gt; Algebra -&gt; Hardware</title>
		<link>http://chadaustin.me/2009/02/running-time-algebra-hardware/</link>
		<comments>http://chadaustin.me/2009/02/running-time-algebra-hardware/#comments</comments>
		<pubDate>Fri, 13 Feb 2009 06:16:00 +0000</pubDate>
		<dc:creator>Chad Austin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://aegisknight.org/new/2009/02/12/running-time-algebra-hardware/</guid>
		<description><![CDATA[
I&#8217;m going to talk about something which should be obvious, but I continue to see people optimizing code in the wrong order (*cough* including myself *cough*).  So, here&#8217;s a reminder.  When you&#8217;re optimizing a bit of code&#8230;


FIRST


Make sure your algorithmic running time is right.  O(1) is almost always faster than O(N), and [...]]]></description>
			<content:encoded><![CDATA[<p>
I&#8217;m going to talk about something which should be obvious, but I continue to see people optimizing code in the wrong order (*cough* including myself *cough*).  So, here&#8217;s a reminder.  When you&#8217;re optimizing a bit of code&#8230;
</p>

<h2>FIRST</h2>

<p>
Make sure your algorithmic running time is right.  O(1) is almost always faster than O(N), and O(N^2) is right out.  Often these optimization exercises involve changing some O(N) to O(M), where M is smaller than N.
</p>

<p>
I&#8217;ll give an example.  Drawing a frame of 3D graphics in IMVU is O(N) where N is the sum of all vertices from all products on all objects loaded into a scene.  We recently implemented <a href="http://en.wikipedia.org/wiki/Viewing_frustum">view frustum culling</a>, which skips drawing objects that are known to be off-screen.  This reduces the rendering time from O(N) to O(M) where M&lt;N and M is the number of vertices from products that are visible.  If we implemented <a href="http://www.cbloom.com/3d/techdocs/vipm.txt">View Independent Progressive Meshes</a>, we could reduce the time to O(P) where P is the number of vertices that contribute to the visible detail of the scene, and P&lt;M&lt;N.
</p>

<p>
However, make sure to avoid algorithms with good running times but huge constants.  This is why, when CPUs got fast and random memory accesses got slow, searching an O(N) array (or std::vector) is often faster than searching an O(log N) tree (or std::map).  The tree will miss cache far more often.
</p>

<h2>SECOND</h2>

<p>
Then, use all of the algebra, set theory, and logic you know to reduce the number of operations required, in order of operation cost.
</p>

<p>
Let&#8217;s say we&#8217;re going to calculate the <a href="http://en.wikipedia.org/wiki/Phong_lighting">diffuse reflectance</a> on a surface: N dot L, where N and L are three-vectors, N is the normal of the surface, and L is the direction to the light.
</p>

<p>
The naive <code>normalize(N) dot normalize(L)</code> is&#8230;
</p>

<pre>
float lengthN = sqrtf(N.x*N.x + N.y*N.y + N.z*N.z);
float lengthL = sqrtf(L.x*L.x + L.y*L.y + L.z*L.z);
float dot =
    (N.x / lengthN) * (L.x / lengthL) +
    (N.y / lengthN) * (L.y / lengthL) +
    (N.z / lengthN) * (L.z / lengthL);
</pre>

<p>
&#8230; which turns out to be 6 additions, 9 multiplications, 6 divisions,
and 2 square roots.  Let&#8217;s say additions and multiplications are 2
cycles, and divisions and square roots are 40 cycles.  This gives us a
total of 6*2 + 9*2 + 6*40 + 2*40 = 350 cycles.
</p>

<p>
Instead, let&#8217;s do a bit of algebra:
</p>

<pre>
  normalize(N) dot normalize(L)
= N/|N| dot L/|L|
= (N dot L) / (|N||L|)
= (N dot L) / sqrt((N dot N) * (L dot L))
</pre>

<p>
The new calculation is&#8230;
</p>

<pre>
float lengthSquaredN = N.x*N.x + N.y*N.y + N.z*N.z;
float lengthSquaredL = sqrtf(L.x*L.x + L.y*L.y + L.z*L.z);
float NdotL          = N.x*L.x + N.y*L.y + N.z*L.z;
float dot =          NdotL / sqrtf(lengthSquaredN * lengthSquaredL);
</pre>

<p>
&#8230; 6 additions, 10 multiplications, 1 division, and 1 sqrt: 6*2 +
10*2 + 1*40 + 1*40 = 112 cycles.  Huge improvement just by applying basic math.
</p>

<h2>THIRD</h2>

<p>
Once you&#8217;re done optimizing algebraically, read your
<a href="http://www.intel.com/products/processor/manuals/">processor manuals</a>
and take full advantage of the hardware.  If you&#8217;ve got SSE4, you can
do the dot products in one instruction (DPPS), and an approximate
reciprocal square root in another (RSQRTSS), which can give another
huge improvement.
</p>

<p>
The reason you want to optimize in this order is that algorithmic improvements reduce the amount of work you have to do, making it less important to make that work fast.  A hardware-optimized O(N^2) algorithm can be easily beaten by an unoptimized O(N log N) algorithm.  Remember, Chad, the next time you schedule optimization projects, consider downstream effects such as these.
</p>]]></content:encoded>
			<wfw:commentRss>http://chadaustin.me/2009/02/running-time-algebra-hardware/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>A Simple Introduction to Superscalar, Out-of-Order Processors</title>
		<link>http://chadaustin.me/2009/02/a-simple-introduction-to-superscalar-out-of-order-processors/</link>
		<comments>http://chadaustin.me/2009/02/a-simple-introduction-to-superscalar-out-of-order-processors/#comments</comments>
		<pubDate>Thu, 12 Feb 2009 06:09:00 +0000</pubDate>
		<dc:creator>Chad Austin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[x86]]></category>

		<guid isPermaLink="false">http://aegisknight.org/new/2009/02/11/a-simple-introduction-to-superscalar-out-of-order-processors/</guid>
		<description><![CDATA[
Since the Pentium Pro/Pentium 2, we have all been using heavily superscalar, out-of-order processors.  I&#8217;d heard these terms a million times, but didn&#8217;t know what they meant until I read The Pentium Chronicles: The People, Passion, and Politics Behind Intel&#8217;s Landmark Chips (Practitioners).  (BTW, if you love processors, the history of technology, and [...]]]></description>
			<content:encoded><![CDATA[<p>
Since the Pentium Pro/Pentium 2, we have all been using heavily superscalar, out-of-order processors.  I&#8217;d heard these terms a million times, but didn&#8217;t know what they meant until I read <a href="http://www.amazon.com/gp/product/0471736171?ie=UTF8&#038;tag=aegisknightor-20&#038;linkCode=as2&#038;camp=1789&#038;creative=390957&#038;creativeASIN=0471736171">The Pentium Chronicles: The People, Passion, and Politics Behind Intel&#8217;s Landmark Chips (Practitioners)</a><img src="http://www.assoc-amazon.com/e/ir?t=aegisknightor-20&#038;l=as2&#038;o=1&#038;a=0471736171" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" />.  (BTW, if you love processors, the history of technology, and the fascinating dynamics at a company like Intel, that book is fantastic.)
</p>

<p>
Superscalar basically means &#8220;greater than 1&#8243;, implying that a superscalar processor can run code faster than its clock speed would suggest.  Indeed, a 3 GHz Pentium 4 can retire 4 independent integer additions per clock cycle, which is 12 billion integer additions per second!
</p>

<p>
Out-of-order means just that &#8211; the processor looks at your code at runtime and executes it in parallel if it can.  For example, imagine this code:
</p>

<pre>
// three independent, non-null pointers
int* p; int* q; int* r;
const int flag1, flag2, flag3;

if (*p &amp; flag1) {
    if (*q &amp; flag2) {
        if (*r &amp; flag3) {
            do_work();
        }
    }
}
</pre>

<p>
The processor can&#8217;t assume that <code>q</code> is a valid pointer until it checks <code>p</code>, and the same for <code>r</code> and <code>q</code>.  Accessing main memory costs ~200 cycles, so if none of the pointers point to cached memory, you just spent 600 cycles determining whether to <code>do_work()</code>.  This is called a &#8220;dependency chain&#8221;, where the result of a later calculation depends on the previous.  But what if you know that p, q, and r will all be valid pointers?  You can rewrite as:
</p>

<pre>
const int x = *p;
const int y = *q;
const int z = *r;
if ((x &amp; flag1) &amp;&amp; (y &amp; flag2) &amp;&amp; (z &amp; flag3)) {
    do_work();
}
</pre>

<p>
Now, the processor knows that all of those memory fetches are independent, so it runs them in parallel.  Then, it runs the <code>AND</code>s in parallel too, since they&#8217;re independent.  Your 600-cycle check just became 200 cycles.
</p>

<p>
Similarly, let&#8217;s say you want to add 10,000 numbers.
</p>

<pre>
int sum = 0;
for (int i = 0; i &lt; 10000; ++i) {
    sum += array[i];
}
return sum;
</pre>

<p>
Let&#8217;s assume the loop overhead and memory access is free, and each addition takes one cycle.  Since each addition depends on the previous value of sum, they must be executed serially, taking 10000 cycles.  However, you know that addition is associative, you can sum with two variables:
</p>

<pre>
int sum1 = 0;
int sum2 = 0;
for (int i = 0; i &lt; 10000; i += 2) {
    sum1 += array[i];
    sum2 += array[i+1];
}
return sum1 + sum2;
</pre>

<p>
Now you have two independent additions, which can be executed in parallel!  The loop takes 5000 cycles now.  If you independently sum in <code>sum1</code>, <code>sum2</code>, <code>sum3</code>, and <code>sum4</code>, the loop will take 2500 cycles.  And so on, until you&#8217;ve hit the IPC (instructions per cycle) limit on your processor.  If you&#8217;re making effective use of your SIMD units, you&#8217;d be surprised at how much work you can do in parallel&#8230;
</p>

<p>
And that&#8217;s what an out-of-order, superscalar processor can do for you!
</p>]]></content:encoded>
			<wfw:commentRss>http://chadaustin.me/2009/02/a-simple-introduction-to-superscalar-out-of-order-processors/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Logic vs. Array Processing</title>
		<link>http://chadaustin.me/2009/02/logic-vs-array-processing/</link>
		<comments>http://chadaustin.me/2009/02/logic-vs-array-processing/#comments</comments>
		<pubDate>Wed, 11 Feb 2009 07:20:00 +0000</pubDate>
		<dc:creator>Chad Austin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://aegisknight.org/new/2009/02/11/logic-vs-array-processing/</guid>
		<description><![CDATA[I&#8217;ve always been amused by the Java vs. C++ performance arguments:


&#8220;Java&#8217;s faster than C++!&#8221;
&#8220;No it&#8217;s not!&#8221;
&#8220;Yeah it is, look at this benchmark!&#8221;
&#8220;Well look how much longer the Java version of program takes to start!&#8221;



Back and forth and back and forth.  The fact is, they&#8217;re both right, and here&#8217;s why.  I mentally separate code [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve always been amused by the Java vs. C++ performance arguments:</p>

<ul>
<li>&#8220;Java&#8217;s faster than C++!&#8221;</li>
<li>&#8220;No it&#8217;s not!&#8221;</li>
<li>&#8220;Yeah it is, look at this benchmark!&#8221;</li>
<li>&#8220;Well look how much longer the Java version of program takes to start!&#8221;</li>
</ul>

<p>
Back and forth and back and forth.  The fact is, they&#8217;re both right, and here&#8217;s why.  I mentally separate code into either of two categories, logic or array processing:
</p>

<ol>
<li>3D rasterization is obviously array processing.</li>
<li>Video playback is also array processing.</li>
<li>Calculating your tax refund is logic.</li>
<li>Loading a PDF is definitely logic.</li>
</ol>

<p>
Often the line is blurry, but array processing involves running a relatively <a href="http://www.nobugs.org/developer/htrace/htrace.hs">small set of rules</a> over a <em>lot</em> of <a href="http://software.intel.com/en-us/articles/optimizing-the-rendering-pipeline-of-animated-models-using-the-intel-streaming-simd-extensions/">homogenous data</a>.  Computers are very, very good at <a href="http://en.wikipedia.org/wiki/SIMD">this kind of computation</a>, and specialized hardware such as a GPU can increase performance by orders of magnitude.  Ignoring memory bandwidth, a desktop CPU can multiply billions of floating point numbers per second, and a fast GPU can multiply trillions.
</p>

<p>
At the other extreme, logic code tends to be full of <a href="http://mxr.mozilla.org/mozilla-central/source/js/src/xpconnect/src/xpcconvert.cpp#1050">branches, function calls, dependent memory accesses</a>, and often it executes code that hasn&#8217;t been run in minutes.  Just think about the set of operations that happen when you open a file in Word.  Computers aren&#8217;t so good at these types of operations, and as Moore&#8217;s Law continues, they tend not to improve as rapidly as array computation does.
</p>

<p>
Back to Java vs. C++.  The synthetic benchmarks that compare Java and C++ performance tend to be tight loops, simply because accurate measurement requires it.  This gives the JVM time to prime its JIT/prediction engines/what have you, so I&#8217;d expect a good result.  Heck, I&#8217;d expect a good result from the modern JavaScript tracing engines.<a href="#footnote">*</a>
</p>

<p>
The lesson here is that, for array processing, it&#8217;s very little work to make full use of the hardware at hand.  Because the amount of code is limited (and the amount of data is large), time spent optimization has high leverage.
</p>

<p>
On the other hand, logic code is messy and spread out, often written by entire teams of people.  Its performance is dominated by your programming language and the team&#8217;s vocabulary of idioms.  Truly optimizing this kind of code is hard or impossible.  <a href="http://weblogs.mozillazine.org/roadmap/archives/009727.html">It can be done</a>, but you often have to retrain your team to make sure the benefits stick. 
</p>

<p>
This is a reason that the choice of programming language(s) and libraries has such a big effect on the responsiveness of a desktop application, and one of the reasons why people can &#8220;feel&#8221; the programming language in which a project was written.  Typical desktop application usage patterns are dominated by random, temporally sparse actions, so code size, &#8220;directness&#8221;, and working set are primary performance factors.  (Anecdote: <a href="http://thespeedbump.livejournal.com/">Andy</a>&#8217;s rewriting the IMVU client&#8217;s windowing framework so it&#8217;s a bajillion times simpler, and when he had the client running again, he exclaimed &#8220;Hey, resizing the 3D window is twice as responsive!&#8221;)
</p>

<p>
Perhaps there&#8217;s an argument here for the creation of more project-specific programming languages (<a href="http://en.wikipedia.org/wiki/Game_Oriented_Assembly_Lisp">GOAL</a>, <a href="https://developer.mozilla.org/en/Treehydra">TreeHydra</a>, <a href="http://www.martinfowler.com/bliki/DomainSpecificLanguage.html">DSLs</a>), so that performance improvements can be applied universally across the codebase.
</p>

<p>
With <a href=" http://discuss.joelonsoftware.com/default.asp?joel.3.731942.7">disk and memory speeds improving so much more slowly than CPU speeds</a>, the difference between a snappy desktop application and a sluggish application is a handful of page faults.  When choosing a technology platform for a project, it&#8217;s worth considering the impact to overall responsiveness down the road.  And I&#8217;m pretty sure I just recommended writing your entire application in C++, which sounds insane, even to me.  I&#8217;ll leave it at that.
</p>

<p>
<a name="footnote">*</a> By the way, I&#8217;m not picking on Java or promoting C++ in particular.  You could make these same arguments between any &#8220;native&#8221; language and &#8220;managed&#8221; language.  The blocking and tackling of loading applications, calling functions, and keeping memory footprint low are important.
</p>]]></content:encoded>
			<wfw:commentRss>http://chadaustin.me/2009/02/logic-vs-array-processing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

