A brief introduction to modern x86 assembly language
Several people have personally requested that I give a brief introduction to modern x86 (sometimes called IA32) assembly language. For simplicity's sake, I'll stick with the 32-bit version with a flat memory model. AMD64 (sometimes called x64) just isn't as popular as x86 yet, so this seems safe.
For some reason, there's a mythos around assembly language. People associate it with bearded gurus, assuming only ninjas can program in it, when, in principle, assembly language is one of the simplest programming languages there is. Any complexity stems from a particular architecture's oddities, and even though x86 is one of the oddest of them all, I'll show you that it can be easy to read and write.
First, I'll describe the basic architecture. When programming in assembly, there are three main concepts:
Instructions are the individual commands that tell the computer to perform an operation. These include instructions for adding, multiplying, comparing, copying, performing bit-wise operations, accessing memory, and communicating with external devices. The computer executes instructions sequentially.
Registers are where temporary values go. There is a small, fixed set of registers available for use. Since there aren't many registers, nothing stays in them for very long, as they ar soon needed for other purposes.
Memory is where longer-lived data goes. It's a giant, flat array of bytes (8-bit quantities). It's much slower to access than registers, but there's a lot of it.
Before I get into some examples, let me describe the registers available on x86. There are only 8 general-purpose registers, each of which is 32 bits wide. They are:
EAX
EBX
ECX
EDX
ESI
EDI
EBP
- used when accessing local variables or function argumentsESP
- used when calling functions
On x86, most instructions have two operands, a destination and a source. For example, let's add two and three:
mov eax, 2 ; eax = 2 mov ebx, 3 ; ebx = 3 add eax, ebx ; eax = 2 + 3 = 5
add eax, ebx
adds the values in registers eax and ebx, and stores
the result back in eax. (BTW, this is one of the oddities of x86.
Other modern architectures differentiate between destination and
source operands, which would look like add eax, ebx, ecx
meaning eax = ebx + ecx
. On x86, the first operand is read and written in the same instruction.)
mov
is the data movement instruction. It copies values
from one register to another, or from a constant to a register, or
from memory to a register, or from a register to memory.
Speaking of memory, let's say we want to add 2 and 3, storing the result at address 32. Since the result of the addition is 32 bits, the result will actually use addresses 32, 33, 34, and 35. Remember, memory is indexed in bytes.
mov eax, 2 mov ebx, 3 add eax, ebx mov edi, 32 mov [edi], eax ; copies 5 to address 32 in memory
What about loading data from memory? (Reads from memory are called loads. Writes are called stores.) Let's write a program that copies 1000 4-byte quantities (4000 bytes) from address 10000 to address 20000.
mov esi, 10000 ; by convention, esi is often used as the 'source' pointer mov edi, 20000 ; similarly, edi often means 'destination' pointer mov ecx, 1000 ; let's copy 1000 32-bit items begin_loop: mov eax, [esi] ; load from source mov [edi], eax ; store to destination add esi, 4 add edi, 4 sub ecx, 1 ; ecx -= 1 cmp ecx, 0 ; is ecx 0? ; if ecx does not equal 0, jump to the beginning of the loop jne begin_loop ; otherwise, we're done
This is how the C memcpy
function works. Not so bad, is
it? For reference, this is what our x86 code would look like in C:
int* src = (int*)10000; int* dest = (int*)20000; int count = 1000; while (count--) { *dest++ = *src++; }
From here, all it takes is a good instruction reference, some memorization, and a bit of practice. x86 is full of arcane details (it's 30 years old!), but once you've got the basic concepts down, you can mostly ignore them. I hope I've shown you that writing x86 is easy. Perhaps more importantly, I hope you won't be intimidated the next time Visual Studio shows you the assembly for your program. Understanding how the machine is executing your code can be invaluable when debugging.
This is pretty good stuff to know, but any time I'm tempted to play with asm I get caught up in little details like what assembler do I use, how do I call built in OS functions from it, and where do I find libraries for asm rather than C. Just so I know that I can produce some results. I never did decide on where to start with that.
[...] modification… isn’t that scary? With a bit of knowledge and defensive programming, it’s not that bad. In fact, I’ll show you the [...]