中文版：https://www.cnblogs.com/ioriwellings/p/17265697.html

Understanding Windows x64 Assembly

Contents

(Top)
Code Repository
Introduction
  Why should I care about assembly?
  Requirements
    What you will need
    What you should know
  Setting up
    Netwide Assembler
    Hello, world
CPUs: A hardware refresher
  Types of memory
  Registers
    General-purpose registers
    Floating point registers
    Extensions: SSE/AVX/AVX2/AVX512
    Special purpose registers
Windows: the window to the hardware
  The stack and heap, revisited
  Other program segments
  The Virtual Memory Address System
  Bits, bytes and sizes
  The Microsoft x64 Calling Convention
    What is a calling convention?
    Alignment requirements
    Function parameters and return values
    Volatile and non-volatile
    The shadow space
Hello, world revisted
  Assembly directives: a refresher
  Breaking the code down
    Architecture directives
    Addressing modes
    rip-relative addressing
    Initializing data in assembly
    Importing and Exporting symbols
  The Microsoft C Runtime Library
  WinMain and main
  Making a stack
  Calling functions in assembly
  Shutting down the program
  Writing a build script
Using assembly in C/C++ programs
  A factorial function
  Making a DLL from assesmbly
  Comparing our work
Debugging assembly
  The tools
    AsmDude
    WinDbg Preview
    radare2/Cutter and IDA
  Recognizing patterns
  Debugging: A real life issue
    The symptoms
    The suspect (code)
    Conventional debugging
    Dropping to assembly
Beating the compiler
  A correlation function
    NASM macros
    SIMD in assembly
    Packing values in SIMD
    SIMD math operations in assembly
    Known limitations
Working with the compiler
  Auto-vectorization
  Correlation, even faster?
Closing thoughts
Additional Resources
Credits

Welcome.

For those of you who have decided to take this step closer to the silicon that powers most of the modern world, I hope this document will help you just that little bit more when things seem confusing or obtuse. If this document itself proves to be either of those things, please do contact me and I will try to remedy such issues.

Code Repository

All the source code for this tutorial is available here.

Introduction

This is not a tutorial for absolute beginners on how to get started with programming assembly, or a “assembly for dummies” guide. Rather, it is a set of notes and observations I have made while on my own journey into the Microsoft x64 calling conventions that I hope will be useful to others who attempt the same path. Especially since there seems to a dearth of useful information out there regarding some of the stumbling blocks I've come across.

If you are looking for a tutorial that will walk through through the basics of assembly step-by-step, I highly recommend taking a look at the Additional Resources section.

Why should I care about assembly?

If you've never touched x64 assembly before, it's unlikely that I'll win you over in a few paragraphs, but let me give my hot take:

Dropping to assembly has helped me figure out bugs in release code under real production scenarios. (A particularly nasty one ended up being triggered by a driver bug! Thanks, “gaming” drivers.) I've also triaged issues in production code from third-party vendors that were sent back to them to resolve. External code aside, it's also proven useful to debug issues that occurred only in non-debug builds of Maya plugins, for example. And this is only with the limited knowledge that I have available! Imagine what you could do if you were Raymond Chen, for example.
Possible performance optimizations, or understanding how to modify your code to take better advantage of compiler optimizations. Some programmers resort to algorithms and increasing resource usage (threading, GPGPU, etc.) in order to achieve better performance; most of the time, there's actually a ton of compute being left on the table that could be better utilized instead. There's also the fact that if you don't know what to look for in your assembly, you won't know exactly if the compiler's doing what you wanted.
Ultimately, whatever code you write in whatever language you choose gets executed on actual, real hardware, through abstractions that we take for granted. Without understanding assembly, which is a lower-level abstraction closer to the hardware/metal, you'll never be able to understand the reasoning behind some of your decisions at the higher-level of abstractions. Once you have a clear sense of what actually happens on the hardware, it becomes very obvious what programming decisions are going to actually help improve performance. You'll be able to understand which optimizations are actually premature, and which should be done “by default”.
A better understanding of unexpected behaviour. Don't understand why your code changes are triggering a performance hit or a crash? Profilers can only tell you so much; if you understand how assembly works and how the compiler generates it, you'll be better equipped to handle odd degenerate cases, hardware quirks, or even compiler bugs. The only thing left between you and solving the most perplexing of issues will be the hardware itself.

Requirements

This tutorial will focus on writing x64 assembly in Windows. Linux and OSX, which use the System V ABI calling conventions, will not be covered here, as I feel that there is already a plethora of information out there covering writing assembly on those platforms already. However, in the event that I find it useful to cover information regarding those platforms, it will appear in the following format:

Crossing the platforms

Platform-specific information goes here.

What you will need

A PC running Windows 10.
A C/C++ development environment set up and ready to go. (If you want to see what my Emacs setup looks like, it's available here.)
You will need a relatively modern version of Visual Studio. I will be using 2017 for this tutorial. This may also serve as your C/C++ development environment, if you so prefer.
The Netwide Assembler compiler, for compiling our assembly code itself. This tutorial was written using 2.13.03, but any modern version should work.

Why not MASM?

On Windows, the officially supported assembly compiler is known as the Microsoft Assembler. Unfortunately, its feature set is rather limited, especially in the area of macros. It's also not really maintained by Microsoft as much these days, and so I am using NASM instead, since it has a little more feature-rich macros, which I make liberal use of. You are free to practise with MASM instead, but I think apart from some basic syntax differences, the concepts are pretty much identical and shouldn't be too difficult to port over to NASM.

To learn how to use MASM, please refer to the MSDN documentation for it.

Reliability issues

I've noticed that NASM's website seems to be rather poor in terms of reliability. If you find that the website is down at the time of reading this tutorial, please use your Google-fu and try to find mirrors to the documentation or the assembler itself. If all else fails, version 2.14 of the assembler is available here and the manual is available here.

What you should know

Basic knowledge of C/C++. More importantly, you should be innately familiar with how to compile and run a program, understanding what is involved in the compilation process and how it all comes together to generate that EXE/DLL. You should also have basic competence with writing C code and how to debug and inspect the execution of your programs.
Vague knowledge of assembly. While this tutorial is not aimed at experienced programmers, as previously mentioned, it's not going to be a assembly-from-scratch introduction either. Having a vague inkling of assembly and how it relates to hardware, whether it's MIPS assembly, ARM or whatever other architecture you might have previous experience in would be incredibly helpful in transitioning to the x64 ISA. If you have zero experience in the topic, please do feel free to try to continue reading, and pay attention to the refresher and resources sections of the tutorial.

In the next section, we'll go over setting up a basic toolchain for starting x64 assembly programming on Windows.

Setting up

Now, we'll get started with compiling a simple “Hello, world” program, which will verify the toolchain that you have installed on your computer works before we go any further. I find it helps to always be able to actually run the code that you're trying to understand.

Netwide Assembler

NASM on Windows is distributed via the website itself. Download the latest stable version (either the installer or the standalone version is fine), and make sure that you can run nasm -v from the command line. If all goes well, you should get something like the following:

nasm -v
NASM version 2.13.03 compiled on Feb  7 2018

Hello, world

Once you have verified that the assembler is available on your PATH, you're ready to go! To test if the assembler is working, create the following file and name it hello_world.asm:

bits 64
default rel

segment .data
    msg db "Hello world!", 0xd, 0xa, 0

segment .text
global main
extern ExitProcess

extern printf

main:
    push    rbp
    mov     rbp, rsp
    sub     rsp, 32

    lea     rcx, [msg]
    call    printf

    xor     rax, rax
    call    ExitProcess

You might be asking, “what on earth am I typing here?” at this point. We'll go over this later. For now, though, let's just make sure that your toolchain is working.

Let's go ahead and try to assemble this text into an object file. Run the following command in a command prompt.

nasm -f win64 -o hello_world.obj hello_world.asm

If it worked, you should see the hello_world.obj file appear in the same directory as your hello_world.asm file. We can then use the linker to create an executable out of this object file, just like when compiling C/C++ programs. Run the following command from a command prompt that has the Visual Studio environment variables set (e.g. the Developer Command Prompt for VS2017):

link hello_world.obj /subsystem:console /entry:main /out:hello_world_basic.exe

You should now have a hello_world.exe file in the directory. Run it, and if you get the following output, congratulations, you've just wrote an assembly program that runs on Windows!

hello_world_basic.exe
Hello world!

Now that we've verified that you're able to assemble code and create an executable that runs it, let's take a step back from the keyboard, and go to the whiteboard for a bit to get a brief refresher course on the basics of the hardware we're dealing with here in this tutorial.

CPUs: A hardware refresher

Let's focus on the hardware that exists for the majority of the population when they run programs on their computers: the Intel x86-64 CPU. For those of you who already have good understanding of assembly, you can skip this section. If you only have an understanding of C/C++ and have never really understood the details regarding what actually lives on a CPU, you might want to read this section before moving on.

Put on your reading glasses

For those who want a more detailed explanation of the topics covered in this section, I recommend reading What Every Programmer Should Know About Memory by Ulrich Drepper. It is a fantastic resource that goes into more details in a far more comprehensive fashion than I could ever hope to achieve. It's a long read, so make yourself comfy before getting started!

Types of memory

This could be an entire article on its own, so we're just going to go over the basics here. By now, you should be familiar with the fact that everything that happens with computers happens by moving memory around; all we're doing is just writing code at an abstraction level higher than that to help us manage that movement of memory. As such, how fast the hardware can move memory from one place to another is one of the most important considerations when we write code that uses different types of memory in different ways.

The common types of memory available to us when programming x64 assembly are, in order of speed of access (for the current generation of Intel Core i7 processors):

CPU registers (mostly 64 bits per register, with exceptions for some special registers that could go up to 512 bits wide)
CPU caches (L1, L2 and L3, with L1 and L2 being ~64 KB and ~256 KB per core, and L3 being shared across all cores with sizes ranging from 4 MB to 24 MB) An illustration of how these caches are shared among the cores is shown below:

Core1Core2L3L1_1L1_2caL2_1L2_2cheCore3Core4L1_3L1_4L2_3L2_4

Figure 1: A representation of the L1/2/3 caches in relation to the CPU.

As you can see, each physical core has its own L1 and L2 cache, with the L3 cache being shared among all the cores. Please note that the diagram is not representative of how the CPU is acutally laid out physically in the real world.

RAM (The sticks that you put on your motherboard, generally measuring anywhere from 2 GB all the way to 128 GB or more)
PCI-E NVMe (If you have a solid state drive that uses this interface, which can be anywhere from a few hundred gigabytes to 4 terabytes' worth)
Hard disk (Through the SATA cables, regardless of whether your disk is a solid-state or spinning platter drive, which have similar sizes to SSDs)
Network (over a CAT5e cable or similar, which then must go through this entire list of memory all over again from whichever PC it's retrieving the information from)

It's no coincidence that memory seems to be a function of speed/distance/size; in other words, the faster the memory is, the smaller capacity it tends to have, and the closer it seems to be to the CPU on the motherboard. Turns out that high-school physics comes into play here as well, even in the real world!

Also worth noting here is that when we talk about the speed of access; we're talking magnitidues of speed of access slower; even the difference between accessing memory from a CPU register versus that of the L2 cache can be more than 6 times slower! While CPU manufacturers devise new and ingenious solutions to help memory designs get faster and faster (and also more complex in the process), we must make sure that as programmers, we keep this in mind when we write our code. Memory access is not free, and abstractions have a cost, no matter how insignificant they may seem.

For now, we're going to talk a little bit more about the major type of memory that we'll be dealing with when writing assembly, which also happen to be the fastest type of memory available that we have access to: the CPU registers.

Registers

Central Processing Units (CPUs) are much more complicated these days than they were in the early days of computing. Nevertheless, some things have persisted through the decades, and one of those things is how CPU registers work.

What are they? Simply put, they're storage locations on the CPU die that are incredibly fast. In fact, they're the fastest type of memory that you have access to on all your hardware available. The x64 registers evolved from the x86 registers, which in turn evolved from the 8086 architecture, which in turn evolved from the 8080 microprocessor...anyway, you get the idea. CPU registers on the x86-64 instruction set architecture have a very long history, which influences their naming conventions, their purposes (which have become more general rather than their original specialized ones), and their sizes. This has resulted in an (unfortunately) slightly complex way of referring to the registers, in the name of preserving backwards compatibilty across architecture changes over the years.

Let's talk about the various kinds of registers available to us.

General-purpose registers

There are 16 GPRs on the x64 instruction set; they are referred to as rax, rbx, rcx, rdx, rbp, rsi, rdi, rsp, r8, r9, r10, r11 r12, r13, r14 and r15. The prefix “general-purpose” is a little misleading; while they are technically general-purpose in the sense that the CPU itself doesn't govern how they should be used, some of these registers have specific purposes and need to treated in specific ways according to what is known as the OS calling conventions, which we will cover in a little bit. For now, let's try to visualize how these registers are addressed a little better.

Crossing the platforms

There are two major calling conventions in practice for the x64 instruction set; the Microsoft x64 calling convention used in Windows, and the System V ABI used in Linux and OSX. As the name of this tutorial implies, we will only focus on the Microsoft one here, though if you work on Linux or OSX as a target platform for your software, I recommend getting familiar with that convention as well.

64bitstotal8816bits32bitsbitsbitsalah///////////////////////////ax////eax//rax

Figure 2: Example of a single 64-bit GPR extended from the 8080 ISA.

The diagram above shows how different parts of the register rax can be accessed. Each of the GPRs on the x64 instruction set are 64 bits wide. To access different parts of the register, the following names al, ah, ax and eax are used, where a can be subsituted with any of the other registers (i.e. to access the lower 8 bits of rbx, you would use bl and to access the lower 32 bits of the rcx register, you would use ecx). We'll see why this is important when we start writing actual assembly code, but since you'll be seeing these names a lot, it's best to get familiar with how they relate to different parts of the register.

If you have trouble remembering the suffixes/prefixes, the way I like to think about it is that l stands for “low”, h for “high”, e for extended and r for “register”, as in the full register (all 64 bits) itself.

There is an exception to how these prefixes/suffixes work, which applies to the r8 through r15 registers. To access the lower 32 bits of these registers, you add the d suffix to the end (i.e. r8d), for the lower 16 bits, the w suffix is used (i.e. r8w) and finally to access the lower 8 bits, the b suffix is used (i.e. r8b). You can think of d meaning “double word” (where a word on on the Microsoft x64 ISA is 16 bits), w meaning “word”, and b meaning byte (where a byte is 8 bits). To make sure we're crystal-clear here, the diagram below illustrates this.

64bitstotal8816bits32bitsbitsbitsr8b//////////////////////////////r8w////r8d//r8

Figure 3: Example of a 64-bit GPR introduced in the x64 ISA.

We see a difference here compared to the rax register; there is no way to address the higher 8 bits of the r8 register. This is because the registers r8 through r15 were introduced in the x64 ISA and were not part of the original x86 ISA. As such, there was no need to worry about backwards compatibility with these new registers being introduced.

Floating point registers

Without getting into too much detail about how floating-point works in terms of memory storage right now, suffice to say that the GPRs are (usually) only used for storing integers and memory addresses, while the floating point registers are used for storing actual floating point numbers.

In the x64 instruction set, there are 16 floating point registers, known as the xmm registers. These are 128-bit wide, and unlike the GPRs, their lower bits cannot be addressed separately. They number xmm0 through xmm15, and as you might expect, are generally used for SIMD operations as well.

A long, long time ago...

You may have noticed the existence of several oddly named registers named st0 through st7 when you open the registers window in Visual Studio while debugging a program. These registers are 80-bits wide, and exist as an artifact of history from the x87 coprocessor, which was also known as the Floating Point Unit (FPU). These days, the registers still exist for legacy reasons, but they are generally unused by modern calling conventions. The CDECL calling convention treated them the same way that the XMM registers are treated by the modern OS calling conventions of today; to be used for working with floating-point values.

It's worth noting that since the ratification of the IEEE 754 floating point standards, floating point numbers either use 32 bits (1 sign bit, 8 exponent bits and 23 bits for the mantissa) or 64 bits (1 sign bit, 11 exponent bits and 52 bits for the mantissa) each, so the xmm registers are well-suited to operations involving more than one floating point at once, since you can stuff 4 floats/2 doubles per-XMM register.

How floating point works in computing

If you're not familiar with how floating point works, I strongly recommend getting a good grasp on the subject. There are plenty of resources available regarding this topic that cover it in greater detail than I will be expanding on here, and there are even visualizers available that can show you how a floating point number is represented in binary format, which is also how they will get stored in the floating-point registers.

We generally will not need to convert between binary/hexadecimal/floating point representations by ourselves, since debuggers have facilities to help us do the math, but it's good to know how they're stored so that you can recognize patterns and regions of memory that might have data you're interested in.

Extensions: SSE/AVX/AVX2/AVX512

As previously mentioned, the XMM registers are 128 bits wide. This was introduced as an extension to the x64 instruction set by Intel in 1999, known as the Streaming SIMD Extensions (SSE). However, just like the GPRs have been expanded over the years, so have these new wide registers. The Advanced Vector Extensions (AVX) introduced wider registers in the form of 256-bit registers, and AVX-512 bumped this up to a whopping 512-bit wide registers.

In terms of addressing them, they follow a similar pattern to the GPRs, where xmm addresses the lower 128 bits, ymm the lower 256 bits, and zmm the entire width of the register (if you're lucky enough to have a CPU that expensive), which totals 512 bits.

512bitsxmm/128bits/ymm/256bits/zmm512bits

Figure 4: Illustration of a SIMD register in the x64 ISA.

Spot the difference?

You might have noticed that Intel has a distinction between the AVX and AVX2 terms when referring to their CPU feature sets. For all intents and purposes, both instruction set extensions use the same registers in the same fashion; AVX2 adds new instructions designed for use with the ymm registers that allow for more efficient math operations that weren't available in AVX. We will go over some of these operations later when we write some SIMD code in assembly.

Most consumer-grade CPUs these days (as of 2018, anyway) will support AVX2. AVX-512 itself is still limited to only higher-grade CPUs that are generally seen only in high-end PC workstations and servers. We will not be covering usage of these registers in this tutorial, but the concepts you learn from working with the narrower xmm and ymm registers will expand to that of the zmm registers as well.

It's important to note here that the diagram above does not mean that a SIMD register has separate lanes for xmm, ymm and zmm instructions; they all form part of the same register; the names just allow you to access different parts of it. It's also worth noting here that while xmm/ymm registers number 0 through 15 (for a total of 16 SIMD registers), AVX-512 expands on this and offers a whopping 32 SIMD registers, all 512-bit wide, numbering from 0 through 31.

Special purpose registers

In addition to the registers that are available for use, there are also other registers on the CPU that technically can be used, but really shouldn't under almost all circumstances. These are registers that are reserved for special uses, or part of the evolution of the ISA and left behind for compatibility reasons. We'll talk about some of the more significant ones that you might run into on a more frequent basis when dealing with assembly programming.

Firstly, there's the instruction pointer, or rip. This 64-bit register holds the address of where the next instruction to be executed is at in the assembly. While you can do some neat tricks with this by moving rip around to simulate jumping around the program, this is almost always a bad idea, and you should leave this register alone under most circumstances. It is, however, useful for guiding yourself if you're looking at the disassembly of an application as to figuring out where the program's current execution is at.

If this is a little hard to understand, here's a diagram that should clear things up.

foo:pushrbpmovrbp,rspsubrsp,32movrax,0ExecutionhaltedhereleaveAddressofthisinstructionisnowretinthe`rip`register

Figure 5: Explanation of how the rip register contents work.

The next instruction to be executed would be the leave instruction, which would mean that the address pointing to its location in the program's memory would be in the register rip at this point.

Secondly, there's the stack pointer, or rsp. While this is technically considered a general-purpose register, this is meant to point to the bottom of the stack. Messing with this is allowed, but if you break the calling conventions (i.e. aligning on a non-16 byte boundary), you might trigger a crash. We'll go over the calling conventions a little bit later.

Third, the next special-purpose register that I want to cover is the base pointer, or the rbp register. This is usually used to refer to the original value of rsp, which points to the current position of the last item pushed onto the stack. This thus allows us to remember where the stack started from, and allows us to unwind the stack when we leave the current scope (i.e. from a function call, for example).

Finally, the last special-purpose register you should know about is the flags register, also referred to as the FLAGS/EFLAGS/RFLAGS/efl/rfl register. Unlike the other GPRs, this register stores bit flags that are set after certain assembly instructions have executed. A list of the bit flags used commonly are shown in the diagram below:

CFPFAFZFSFTFIFDFOFIOPLNTN/ALegend:CF=CarryflagPF=ParityflagAF=AdjustflagZF=ZeroflagSF=SignflagTF=TrapflagIF=InterruptflagDF=DirectionflagOF=OverflowflagIOPL=I/Oprivilegelevelflag(legacy)NT=Nestedtaskflag(legacy)N/A=Reserved

Figure 6: Illustration of the bit flags in the FLAGS register.

While this is a register that you normally would not be setting values in yourself, it is useful for referring to in order to infer the result of a previous operation. For example, if you notice that the sign flag is set, it means that the previous instruction was probably a test or cmp instruction. Instructions like jl (Jump short if Less) rely on checking the bits set in this register in order to know which branch that they should execute.

It's worth noting that there that there are only 16 bits laid out here, with one of them already unused (the last bit, which is reserved for future standards extensions). While the register itself is technically a 64-bit register (rfl), only the lower 16 bits are used in this case. The rest of the bits are reserved, and should not be tampered with or relied upon.

We will mostly be referring to the sign, zero, carry, direction, parity and overflow flags when working in this tutorial, so I would advise you to try and remember what their semantic meanings are.

DF: Determines left/right direction for moving (i.e. comparing string data). 0 means left-to-right and vice versa for 1.
SF: Shows the sign of the result of an arithmetic operation. Indicated by the high-order of the left-most bit. Positive means this bit will be set to 1, and vice versa.
ZF: Non-zero result gives 0 and vice versa.
OF: Used to indicate when an arithmetic operation has resulted in overflow, which means that the result did not fit in the number of bits used for the operation by the Arithmetic Logic Unit (ALU). This can happen when you add a large enough integer to another, resulting in a value greater than can be represented in a 32-bit operation, resulting in integer overflow. 1 means an overflow, and vice versa.
PF: Used to indicate the total number of bits that are set in the result. For example, if the result of a operation was 01100, this would be set to 1, indicating an even number of bits set (2), and vice versa.

Windows: the window to the hardware

Now that we understand the basics of the physical hardware that we're going to be working with, it's about time to take a look at the interface that we're going to be using in order to access it, that is, the operating system itself.

Let's start by taking a look at how the rsp register is used more thoroughly when working with the stack.

The stack and heap, revisited

While you should be familiar with the concept of the heap and stack by now if you've been programming in C/C++, let's take this opportunity to do a quick refresher. The stack and heap are just different memory segments of a program; while there are OS kernel functions (HeapAlloc, VirtualAlloc, etc.) that are designed to work specifically with the heap segment of a program's memory address space, both are essentially just regions of memory in the layout of a program's address space.

addressspaceof(2^64)-1highermemoryaddressesstack-------------growssubtractdownwardsrspcallgrowsmalloc()upwards------------heap(uninitializeddata)bss(initializeddata)data(code)textlowermemoryaddresses

Figure 7: Illustration of a program executable's memory layout.

From the illustration shown, we can see that in order to grow the stack, we actually subtract from the stack pointer; thus, as new objects are pushed onto the stack, their memory address decreases. This may seem counter-intuitive, but this is how both Windows and Linux/OS X have defined their memory layouts, so it's important to keep this in mind. In contrast, the heap grows by expanding upwards towards the stack's memory addresses, which means that new objects allocated on the heap will have increasingly higher memory addresses.

Generally, you won't allocate objects on the heap by yourself directly; you'll be using the OS kernel functions to do so. The main takeaway here is that while there is historically a distinction between the two regions of memory, the fact is that they are just that; two distinct regions of memory with specific semantics that apply to either of them. Neither is inherently faster than the other since they occupy the same virtual address space and map to the same physical hardware. The only difference is up to us and how we choose to make use of these separate regions of memory.

Direction of heap growth

John Calsbeek points out that this diagram is slightly misleading; the heap isn't guaranteed to always grow upwards as a continguous pool of memory. The implementation of how the OS heap functions work is technically not mandated by the x64 ABI, and thus heap addresses should not be given the same treatment of predictability as you would with stack addresses.

Other program segments

You may have noticed in the the diagram above that in addition to the stack and heap, there are separate, distinct regions of memory in the program's address space that I drew, namely the bss, data and text segments. If you recall the Hello, world section example, you might also have realized that there are labels in that example that line up with the names of these program segments.

Well, that's no coincidence; it turns out that is what is being defined in those regions, and these regions exist as a result of the PE executable file format, which is a standardized format that Microsoft dictates for how executable's memory segments should be arranged in order for the operating system's loader to be able to load it into memory.

Crossing the platforms

On Linux/OS X, the executable file format is known as the Executable and Linkable Format, or ELF. While there are differences between it and the PE file format used by Windows, the segments we'll be dealing with here exist on both, and function the same way on both. On a fun note, Fabian Giesen has an amusing tweet regarding the name.

Let's go over what each of the major segments are for.

Block Started by Symbol (BSS): This segment of memory is meant for variables that are uninitialzed (e.g. int a;). The name, as is the case with many things in assembly, has a very long history.
Data segment: This segment of memory contains constants or initialized data (e.g. int a = 5; or const int a = 5;). This is fairly straightfoward.
Text segment, also known as the code segment: This contains the actual assembly instructions, hence the name.

Holy Segmentation Fragmentation Batman, what on earth is all that?

If you did click on the MSDN link above, you might have been taken aback at all the different types of possible segments that can exist in a x64 program's memory address space. For this tutorial, we will only be focusing on the ones just described, but if you're going to be debugging code in production that wasn't originally written in assembly (which is likely to be the case), you might want to keep that reference page handy.

Now that we have a better understanding of how memory is mapped by the operating system into the address space of a process, let's take a closer look at how that mapping process works.

The Virtual Memory Address System

If you've never heard of the term virtual memory before, this section is for you. Basically, every time you've stepped into a debugger and seen a hexadecimal address for your pointers, your stack variables, or even your program itself, this has all been a lie; the addresses you see there are not the actual addresses that the memory resides at.

What actually happens here is that every time a user-mode application asks for an allocation of memory by the operating system, whether from calling HeapAlloc, reserving it on the stack through int a[5] or even doing it in assembly, the operating system does a translation from a physical address of a real, physical addressable location on the hardware to a virtual address, which it then provides you with to perform your operations as you deem fit. Processes are mapped into a virtual address space, and memory is reserved by the operating system for that space, but only in virtual memory; the physical hardware could still have memory unmapped to that address space until the user-mode process actually requests an allocation.

If the memory is already reserved and available for use, the CPU translates the physical addresses into virtual addresses and hands them back to the user-mode process for use. If the physical memory is not available yet (i.e. perhaps because the CPU caches are full and we now need to use the DDR RAM sticks' storage instead), the OS will page in memory from whatever available physical memory is available, whether that be DDR RAM, the hard drive, etc.

The translation of the physical memory addresses of the hardware to virtual memory addresses that the operating system can distribute to applications is done using a special piece of hardware known as the Memory Management Unit, or MMU. As you might expect, if every single time a program requested memory required a corresponding translation operation, this might prove costly. Hence, there is another specialized piece of hardware known as the Translation Look-aside Buffer, or TLB cache. That also sits on the CPU and caches the result of previous addresses translations, which it then hands to the operating system, which in turn hands it to the application requesting the allocation.

Both pieces of hardware are present on all modern CPUs. The exact way in which both of them work to do the actual translation process is implementation-dependent and can vary across different CPUs, so we won't worry about them in this tutorial.

In addition to the TLB cache, there is also the presence of what's known as a page table that the VMAS implements. This table is where the operating system itself stores the mappings of the virtual and physical addresses, with each individual assocation referred to as a page table entry. This essentially acts as a larger version of the TLB cache, but due to its implementation (the TLB cache is what's known as a fully associative cache, as opposed to the page table, which is usually implemented similar to a linked list in terms of data structure and thus has to be walked in order to find an entry) is slower. If the TLB cache and the PTEs are both missed, then a page fault occurs, and new physical memory addresses are mapped into the PTEs and returned to the CPU.

CPU(entiredie)MMU(kicksinifTLBvirtualaddresscacheCoremisses)TLBphysicaladdressCPUcaches(L1/L2/L3)physicaladdressPageTableEntries(PTE)Mainmemory

Figure 8: How a virtual address is translated to a physical address.

There are two important things to note here. The first is that the TLB cache is in practice a very small cache, and thus both temporal and spatial locality are very important when we do programming in order to avoid TLB cache misses (i.e. we have to rely on the MMU to do a translation since the previous translation is no longer in-cache). The second is that while TLB cache misses are costly, page faults are even moreso, and so we should write our code to reduce the possibility of either scenario from taking place if we want our programs to run quickly. So even though we are working at the level closest to the bare metal possible, there are still layers of abstraction between us and the actual, physical hardware, both by the operating system and the CPU.

Trick or travesty?

If you're wondering what the purpose of all this CPU machinery is, there was a time when there was no such thing as a VMAS, and programs were allowed to use actual physical memory addresses on the hardware. This was a much simpler, but also much more dangerous time, since an application could easily bring down the machine by utilizing reserved memory addresses (perhaps for hardware interrupts, I/O, etc.) in a manner that was unsupported. For example, the Game boy CPU (similar to an Intel 8080) did not have the concept of virtual memory.

The Intel 286 was the first processor to introduce the concept of a protected mode, which introduced the virtual address system that has become ubiquitous among all modern CPUs. This allowed for safer multi-tasking between concurrently running applications (since the OS could intercept calls to invalid memory addresses when translating the virtual addresses to physical ones and trigger a segmentation fault instead) and concepts such as privilege levels to become possible.

So, while there's no doubt that this concept has certainly increased the complexity of how we approach programming these days, there's clear benefits to the paradigms introduced here that were deemed significant enough to outweigh the drawbacks of added complexity and a slight performance hit.

A black hole of knowledge...

This topic's rabbit hole goes very deep, and I will not cover all the implementation details regarding the TLB here. However, I highly recommend perusing the Resources section, especially the book by Ulrich Drepper to get more details regarding the TLB cache and how it affects your code. Specifically, I would recommend looking at the way the TLB maps page-table entries to the segment-table entries in order to generate the final virtual address from a physical one. There's also good information in there about how PTEs are implemented on the x64 ISA, which should make it clear why they're slower than the TLB cache.

Bits, bytes and sizes

At this point, it's also probably a good idea to give a quick refresher for the non-Win32 API programmers out there who might not be as familiar with the nomenclature of the Windows data types.

Bit: 0 or 1. The smallest addressable form of memory.
Nibble: 4 bits.
Byte: 8 bits. In C/C++, you might be familiar with the term char.
WORD: On the x64 architecture, the word size is 16 bits. In C/C++, you might be more familiar with the term short.
DWORD: Short for “double word”, this means 2 × 16 bit words, which means 32 bits. In C/C++, you might be more familiar with the term int.
oword: Short for “octa-word” this means 8 × 16 bit words, which totals 128 bits. This term is used in NASM syntax.
yword: Also used only in NASM syntax, this refers to 256 bits in terms of size (i.e. the size of ymm register.)
float: This means 32 bits for a single-precision floating point under the IEEE 754 standards.
double: This means 64 bits for a double-precision floating point under the IEEE 754 standard. This is also referred to as a quad word.
Pointers: On the x64 ISA, pointers are all 64-bit addresses.

These terms will come up often in assembly programming, so make sure you have them all memorized.

Crossing the platforms

This nomenclature is specific to the Intel x86-64 ISA. For example, on the ARM64 architecture, the word size is 32 bits, so don't always assume the size specifiers mean the same thing on different ISAs!

The Microsoft x64 Calling Convention

Finally, we get to one of the most confusing parts of assembly. The calling conventions. At the end of the day, our assembly code must run on an operating system, and in this case, we're focusing on Windows 10, so let's take a look at what it will take for the operating system to run our assembly code.

The Microsoft compiler (MSVC) actually has a couple of calling convention modifiers that developers can elect to use which changes the rules of the conventions, but we'll be focusing on the most common one here, which is known as the Microsoft x64 Calling Convention.

For more information on these other calling convention modifiers that MSVC has available, you can refer to the MSDN article. These conventions are usually applicable in very unusual situations where performance is of an utmost priority, and should not be used under normal circumstances.

What is a calling convention?

Simply put, it's a set of strict guidelines that our code must adhere to in order for the operating system to be able to run our assembly code. This is because assembly is, by nature, as close to the hardware as we can possibly get, and so compilers standardize how our functions should be implemented and how they should be called by the actual hardware. This allows code from multiple sources to all run together harmoniously, instead of having every developer try to develop their own conventions around how the hardware should be used and potentially causing incompatibilties between each other's code. When the operating system can make assumptions about how the code runs, it itself can then use the hardware to perform certain operations behind-the-scenes (i.e. fix up calls to HeapAlloc() to the right addresses in memory). If you break those assumptions, you risk incorrect execution of the code.

The conventions themselves, much like the ISA, have evolved over the years into a long, gruelling set of standards that will most likely take a bunch of trial-and-error before you get the hang of them. Keep practising, and eventually they'll become second nature to you once you've hit enough crashes.

We won't go over the entire list of requirements here, since that would become more of an academic exercise, but we'll focus on the most practical ones that impact the way you write your code, and talk about any others that come up as we go along.

Alignment requirements

Most data structures will be aligned to their natural alignment. This means that the compiler will insert padding as needed if a data structure needs to be aligned to a specific boundary. Most of the time, this won't be the case, and your data structures will be left as-is. However, the compiler will sometimes add padding to your structures if it thinks that it will help increase performance by aligning it to certain byte boundaries, depending on the architecture your code is being compiled on. Since we're working with assembly and don't have the benefit of that, we must fulfill any special alignment requirements ourselves if necessary.
Most importantly, the stack pointer rsp must be aligned to a 16-byte boundary. This means that if we move the stack pointer by adding 8 bytes, we must move it by another 8 bytes or we will be violating the calling convention.

Function parameters and return values

The calling convention also dictates how functions should be called, and how they should return their results. In order to ensure that we are able to craft assembly that works harmoniously with other code, we must adhere to these rules.

This is a common source of confusion for those new to the calling convention, so I'm going to do my best to illustrate it here.

Integer arguments

From the MSDN documentation:

The first four integer arguments are passed in registers. Integer values are passed in left-to-right order in RCX, RDX, R8, and R9, respectively. Arguments five and higher are passed on the stack.

This might seem a little confusing, so let's illustrate this a little better with an example function that takes 5 integers.

void foo(int a, int b, int c, int d, int e)
{
    /// Some stuff happens here with the inputs passed in...
    return;
}

In order to follow the x64 calling convention, we must therefore pass a, b, c and d in the registers rcx, rdx, r8 and r9 respectively, with e being pushed onto the stack before calling the function foo(). Thus, if I wanted to call foo() like so:

foo(1, 3, 5, 7, 9);

Then I would have to make sure that the registers held the appropriate values, along with the first item on the stack as well:

rcx1stackrdx3------9(1stitem)r85`rsp`pointsherer97......(growsdown)

Figure 9: How the memory should be laid out before calling foo().

Floating-point arguments

Again, from the MSDN documentation:

Any floating-point and double-precision arguments in the first four parameters are passed in XMM0 - XMM3, depending on position.

For floating-point arguments, the xmm registers 0 through 3 are used, for a total of 4 arguments being passed in the SIMD registers, with the rest being pushed onto the stack.

Let's see how this works with another example:

void foo_fp(float a, float b, float c, float d, float e)
{
    // Do something with these floats...
    return;
}

Assuming I wanted to call foo_fp() like so:

foo_fp(0.1f, 0.2f, 0.3f, 0.4f, 0.5f);

I would need to have the following memory set up before making the function call:

xmm00.1fstackxmm10.2f------0.5f(1st)xmm20.3fxmm30.4f......(growsdown)

Figure 10: How the memory should be laid out before calling foo_fp().

As we can see, the concept applies much like it for integers. But what if we have something like this?

void foo_mixed(int a, int b, float c, int d, float e)
{
    // Variable types of arguments...now what?
    return;
}

Which values go in which registers now?

Mixing parameter types

The answer is that the position of the argument dictates which register it goes in. Therefore, if we called foo_mixed() like so:

foo_mixed(1, 2, 0.3f, 4, 0.5f);

Then we would need to have the following memory layout:

xmm0///////rcx1stackxmm1///////rdx2------0.5fxmm20.3fr8///////xmm3///////r94............(growsdown)

Figure 11: How the memory should be laid out before calling foo_mixed().

Hopefully this makes some sense. There are other conventions that govern parameter passing as well, which I'll try to summarize here:

Intrinsic types, arrays and strings are never passed into a register directly. A pointer to their memory locations is what goes into the registers instead.
Structs/unions 8/16/32/64 bits in size may be passed as if they were integers of the same size. Those of other sizes are passed as a pointer as well.
For variadic arguments (i.e. foo_var(int a, ...)), the aforementioned conventions apply depending on the type of the arguments that are passed in. It is the caller's responsibility that the memory layout is correct before calling the function. Additionally, for floating-point values, both the integer and floating-point registers must have the same argument's value, in case the callee expects the value in the integer registers.
For unprototyped functions (e.g. forward-declarations), the caller passes integer values as integers and floating-point values as double-precision. The same rule about floating-point values needing to be in both the integer and floating-point registers applies as well.

For example:
```
void foo_unprototyped();
void foo2()
{
    foo_unprototyped(1, 0.2f, 3);

    return;
}
```
Would mean the following memory layout is required to be set up by the caller:

xmm0///////rcx1xmm10.2rdx0.2xmm2///////r83xmm3///////r9///////............

Figure 12: The memory layout for a variadic/unprotyped function.

Return values

Now that we've sorted our inputs, it's time to do the same for our outputs as well. The Microsoft x64 calling convention, thankfully, has simpler rules when it comes to return values from functions.

Any scalar return value 64 bits or less in size is returned in rax.
Any floating-point value is returned in xmm0.

Thus, a function int foo() would return its value in the rax register, and a function float foo() or double foo() would return its value in the xmm0 register. If your function is returning a user-defined type (such as a struct), the same rules apply to it as if it were being passed in a register: it must be of a size of 1/2/4/8/16/32/64 bits. If it is, it will be returned in the rax register; if not, a pointer to its memory will be returned instead. The caller is responsible for allocating this memory and passing the pointer in the appropriate integer register before making the call to the function.

An example to illustrate this case is shown below:

struct Foo
{
    int a, b, c; // Total of 96 bits. Too big to fit in one of the GPRs.
}

Foo foo_struct(int a, float b, int c)
{
    // Stuff happens...
    return result; // This is a `Foo` struct.
}

Foo myStruct = foo_struct(1, 2.0f, 3);

The caller must ensure the memory in the registers is like so:

xmm0///////rcx*myStructxmm1///////rdx1xmm22.0fr8///////xmm3///////r93............

Figure 13: The memory layout for a user-defined struct.

The return value for the function will be a pointer to the myStruct result, in the rax register.

Hopefully, just from reading these conventions, you start to realize just how important simple things like the layout of your data structures and function prototypes can be in affecting the behaviour of the hardware! While it's not something you should sweat over every time you write a new function, it's definitely something to keep in mind as you design new APIs, especially if performance is a requirement.

Crossing the platforms

Remember, this is only for Windows platforms! On Linux/OS X, the System V ABI calling convention calls for the first 6 integer arguments to be passed in rdi, rsi, rdx, rcx, r8, r9 and the first 8 floating-point arguments to be passd in xmm0 through xmm7, with the rest of the respective arguments being pushed onto the stack.

There is also the requirement of a red zone) of 128 bytes below the stack address space, which is basically a region of memory that is reserved for the operating system and should not be modified by the function under any circumstances. On current Windows architectures, the red zone is usually not present.

For light reading, you might be interested in Raymond Chen's musings regarding the subject.

Volatile and non-volatile

The concept of volatility differs from the volatile keyword in C/C++ slightly, though the general meaning doesn't change. In the Microsoft x64 calling convention, certain registers are treated as volatile, meaning that they are subject to change and are not guaranteed to be preserved between function calls or scope changes. However, some registers are what's known as non-volatile, which means that we are potentially responsible for preserving the state of the registers and restoring their states after our function call is complete. If we do not, then the caller might act upon invalid data in those registers, since we failed to preserve their original values that the caller was dependent on.

Under the Microsoft x64 calling convention, the rax, rcx, rdx, r8, r9, r10 and r11 registers are considered volatile, while the rbx, rbp, rdi, rsi, rsp, and r12 through r15 registers are considered non-volatile and must have their states saved and preserved through function calls. We'll see how this is done in actual assembly code later.

The shadow space

Under the Microsoft x64 calling convention, there is a unique concept of what's known as a shadow space, also referred to as a home space. This is a space that is reserved every time you enter a function and is equal to at least 32 bytes (which is enough space to hold 4 arguments). This space must be reserved whenever you're making use of the stack, since it's what is reserved for things leaving the register values on the stack for debuggers to inspect later on. While the calling convention does not explicitly require the callee to use the shadow space, you should allocate it regardless when you are utilizing the stack, especially in a non-leaf function.

Also, as a reminder, no matter how much space you allocate for the shadow space and your own function's variables, you still need to ensure that the stack pointer is aligned on a 16-byte boundary after all is said and done.

Hello, world revisted

Ok, that's enough theory for now. Let's take what we've learned so far, and go back to the “Hello, world** example again.

Assembly directives: a refresher

Before we delve back into the code, here's a quick reminder on how assembly instructions work in the Intel x86-64 ISA.

Assembly instructions generally follow a specific pattern that can be described as follows:

movrax,rbxsourceoperanddest.operandpsuedo-op/directive/mmemonicraxrbxvalueofrbxcopiedtorax

Figure 14: Illustration of what happens after an assembly mov instruction.

In this example, the directive/pseudo-op mov is issued with the rbx register as the source operand, and the rax register as the destination operand. Thus, when this instruction is executed by the CPU, the contents of the rbx register will be copied to the rax register.

Some assembly directives have more than two operands, especially specialized instructions that are designed for SIMD. Such a directive is the mulss instruction, which can take 2 or 3 operands. Let's take a look at the latter:

mulssxmm0,xmm1,xmm2pseudo-opdest.op.2nd.operand1stoperand-xmm0xmm1xmm2xmm2multipliedbyxmm1resultstoredinxmm0

Figure 15: Illustration of what happens after an assembly mulss instruction.

As we can see, this allows us to do work on 3 registers using only one single instruction, which, depending on the number of CPU cycles it costs, might be faster than using separate instructions for multiplying the xmm2 and xmm1 registers first, before moving the result to the xmm0 register. It's important to be aware of these specialized instructions that are available for use when writing bare assembly, as they can yield significant performance gains when utilized correctly.

When it comes to operands, we have other ways of using them other than using register names or constant values explicitly. NASM has support for expressions to be used in operands, which boils down to:

Having a starting address of the memory segment we're referring to.
Having an effective address from that memory, which is comprised of a displacement, and/or a base register.

Examples of such instructions are:

mov    rax, [rbx + 8]    ; 8 is the displacement.
mov    rax, [rbx + rcx]  ; rbx is the base register.

There are hundreds of assembly directives and there's no hope of doing a comprehensive revision within the scope of a single tutorial, let alone a single section, even if we limit it to only the most common assembly directives. I advise the reader to refer to the Intel Instruction Manuals, available in the Additional Resources section of this tutorial for the full reference. For now, I will continue and explain new directives as we encounter them in the “Hello, world!” example and others on a case-by-case basis.

Breaking the code down

Let's look at the code again, line-by-line.

Architecture directives

bits 64
default rel

segment .data
    msg db "Hello world!", 0xd, 0xa, 0

segment .text
global main
extern ExitProcess
extern _CRT_INIT

extern printf

main:
    push    rbp
    mov     rbp, rsp
    sub     rsp, 32

    call    _CRT_INIT

    lea     rcx, [msg]
    call    printf

    xor     rax, rax
    call    ExitProcess

The first line is the bits directive and tells the NASM assembler to generate code designed to run on 64-bit processors, which is what our hardware is. (At least, for the purposes of this tutorial.) This directive is not strictly required, since you can specify the target object format to the NASM assembler during compilation, but we're going to leave it here to make it very clear as to what architecture our assembly code is meant to be run on.

The next directive is a little more complex and requires a bit more explanation: it's basically telling the NASM assembler to use rip-relative addressing.

Addressing modes

Before we talk about the specific type of addressing mode known as rip-relative addressing, let's have a refresher course on what addressing modes are in the first place. Thankfully, this is a fairly simple concept.

Addressing modes, put simply, are the different conventions available by which an assembly instruction can access registers or other memory. For example, the instruction:

mov    rax, 0

Is what's known as an immediate addressing mode. This is because the operand has a constant value (in lieu of a explicit constant, an expression is also allowed when it evaluates to a constant as well).

In contrast, the instruction:

mov    rax, rbx

Is simply known as register addressing, for obvious reasons.

We can also do what is known as indirect register addressing:

mov    rax, [rbx]

Which basically results in the following operation:

raxregister/10010...(moreaddressesabove)/...0xa5310010...0xa54(64bits)...(moreaddressesbelow)0xa54/rbxregister/

Figure 16: Illustration of what happens after an indirect address operation.

As we can see, the memory that was located by first dereferencing the address that was stored in rbx was then moved to rax as a result of the operation.

There are various types of addressing modes available in the Intel x64 ISA, which are essentially the same as the x86 ISA (since it evolved from there anyway), with the exception of the rip-relative addressing mode; that was introduced with the x64 ISA and is only available on that architecture.

`rip`-relative addressing

What is rip-relative addressing? To understand the answer to this question, we need to understand a little bit of history.

Position-Independent Code

In the previous x86 ISA, some assembly instructions for working with memory loading/storing were absolute. There was no ability to reference data by using a displacement off the current instruction pointer (rip). This made it very difficult to generate code that was position-independent (PIC), which is to say, code that does not depend on where exactly it is loaded into memory by the operating system.

Obviously, since programs generally weren't written by hardcoding base addresses all over the code (since you couldn't assume that your program would always be loaded by the operating system at the same base address each time it was run), the loader had to dynamically, at the time of loading your program into memory, fix up all the code to relocate any instances of memory addresses that were specified in an absolute manner and add the displacement that your program's base address was actually loaded at in order to make sure everything worked. What this meant in practice was that even the simplest of operations required vastly more instructions to execute, since the amount of fix-ups the loader would have to do was tremendous.

Ok. So how does rip-relative addressing help?

Well, with the x64 ISA, the concept of allowing assembly instructions to reference memory by specifying a displacement (32 bit signed) from the current rip instruction pointer means that even though the program may be loaded at a different base address, because the data is now being referenced from the rip instruction pointer instead of an absolute address, the loader no longer has to perform any fixup process on that code. This means that x64 code in general which uses this form of addressing as opposed to the flat addressing model (i.e. hardcoding the address references) does not require the loader to perform the fixups, and results in a much reduced binary size, since the base address re-location information that the loader performs has to be stored in the PE file format (in the .reloc section).

To tell NASM to compile our program with rip-relative addressing, the following directive is used:

default rel

Thus, we get what we want; the loader does less work, and the program still runs, and we get our position-independent code.

Initializing data in assembly

The next line of “Hello, world!” is the data segment. (If you don't remember what that is, go back to the Other program segments section and refresh your memory.)

segment .data
    msg db "Hello world!", 0xd, 0xa, 0

This is fairly straightforward, if a little terse. Basically, we define a variable in NASM syntax called msg of type byte (db is a mmemonic for “define byte”) that is a string of “Hello world!”, followed by the CR and LF line terminator characters using their hexadecimal ASCII code equivalents, and finally a null terminator to indicate the end of the string.

Crossing the platforms

As a reminder, the CRLF(Carriage return, line feed) line-ending is Windows-specific; Linux and OS X use just LF to indicate a newline.

There are several ways to initialize data in NASM, which you should read up on in the manual.

Importing and Exporting symbols

Now let's look over the next few lines of the example.

segment .text
global main
extern _CRT_INIT
extern ExitProcess

extern printf

The first line, much like the segment .data directive, tells the NASM assembler that this is the beginning of the .text section, or where all the actual assembly instructions should go. No big surprise. But what's the next few lines mean?

If you have worked with DLLs/.so files before (or possibly read my previous tutorials?) then you'll most likely recognize the extern keyword, except it seems to be used backwards here: normally, in C, the extern keyword is used for exporting symbols to be exposed to be used by other code. And what's the global keyword mean, anyway?

It turns out that those two keywords mean exactly the opposite in NASM: extern is used to import symbols, and global is used to export them.

Ok, fine. But what's the odd extern _CRT_INIT symbol we're importing? What's that for?

The Microsoft C Runtime Library

If you're experienced enough with C/C++, you'll no doubt have guess that the CRT refers to the C standard run-time library, more specifically, the Microsoft Visual C++ Runtime Library (MSVCRT), which is Microsoft's implementation of the C99 ISO standard library.

Crossing the platforms

For those of you who are still beginners in C/C++ and might not understand what this means, basically functions such as printf, rand or even the entry point main are defined as part of the C standard library specification. Every C-compliant compiler, whether it's MSVC, GCC or Clang, must supply their own implementation of the specification. For Microsoft, the aforementioned MSVCRT is what we use when we need to use functions defined from the C standard. libc and libc++ are the equivalents on Linux/OS X. Both implementations differ due to both platform and historical reasons, as we shall soon see very shortly.

Thus, by the powers of deduction, we can guess that _CRT_INIT means “hey, initialize the C runtime library”. But so what? Why do we need to call this anyway?

`WinMain` and `main`

Well, let's look further down the code, to the first label in the .text section of our assembly program:

main:

All C/C++ programmers should be familiar by now with the concept of their program's entry point, that is, the initial point where code starts to execute when their program is loaded. The signature usually appears as int main(int argc, char *argv[]).

However, if you're at all familiar with Win32 API programming, you'll know that things are not that simple. Technically, the entry point for Windows programs is defined as WinMain, not main; specifying the subsystem to the MSVC linker determines which entry point symbol is chosen by default. Additionally, the MSVCRT's implementation of main actually calls WinMain, which means that it's essentially a wrapper function. More than being just a simple wrapper, however, there's one other important thing that MSVCRT's main function does, and that is to also perform any static initialization of variables required.

Thus, the following advice from MSDN:

If your project is built using /ENTRY, and if /ENTRY is passed a function other than _DllMainCRTStartup, the function must call _CRT_INIT to initialize the CRT.

If you don't do this, you'll get linker warnings when attempting to produce the final executable, such as:

warning LNK4210: .CRT section exists; there may be unhandled static initializers or terminators

Which is the linker telling us, “hey, it looks like you've got some data you wanted to statically initialize, but the appropriate actions to take for actually initializing them haven't been taken.” This refers to the fact that for global variables that need to be initialized before the `main()` function, the VCRuntime library startup code (which handles running the static initializers or terminators) hasn't been run yet.

How deep does this rabbit hole go?

Savvy readers may have remembered that the MSVC linker has the /entry flag, which allows you to specify a specific entry point for a program (and which we will be making use of very shortly). The truth is, WinMain is just a convention for the name given to user-provided entry points. Since main (which actually maps to either mainCRTStartup/WinMainCRTStartup or wmainCRTStartup/wWinMainCRTStartup depending on the subsystem and ASCII/Unicode options) in the MSVCRT ultimately calls WinMain, that is technically the real entry point of the application.

So hold on a minute: other than our msg variable, we don't actually have anything that should rely on static initializers. What is this warning for, then?

It turns out that printf from the VCRuntime library is the culprit here; it relies on static initializers for its implementation, thus requiring us to initialize the CRT before making use of it, which makes sense, seeing as printf is part of the CRT after all!

Making a stack

Let's continue looking at the example code.

    push    rbp
    mov     rbp, rsp
    sub     rsp, 32

The push psuedo-op takes the operand passed to it, decrements the stack pointer, and then stores the operand on top of the stack. We do this to the base pointer so that we can save the current position of the stack. (So that if we need to refer to variables on the stack, we have a base address to refer to, since we could be adding/removing objects from the stack all the time and thus the stack pointer alone would be insufficient.)

We then move the stack pointer to the current base pointer's position, which is is currently pointing to the top of the stack, and then subtract 32 from it, giving us the shadow space necessary to follow the Microsoft x64 calling convention.

You'll likely see this snippet of code (or some variation thereof) almost always at the beginning of every function you write in assembly, so get used to it. It's also very useful for identifying where the start of a function is if you're reading the assembly of optimized code (i.e. sometimes it could indicate the start of an inlined function call).

You then see we initialize the C runtime library as previously mentioned, with the following line:

    call    _CRT_INIT

Calling functions in assembly

The next few lines are the real business logic of the example assembly code:

    lea     rcx, [msg]
    call    printf

The first instruction here is a little confusing. Isn't lea the name of the protagonist in that one game?

Well, LEA actually stands for Load Effective Address, which comes no closer to explaining what it actually does. The way I like to think about it is that lea serves essentially the same purpose as mov; they both move memory from the second operand to the first. The difference is in how exactly they do it; the mov instruction moves memory directly from operand to operand, while lea computes the effective address of the second operand, before moving it to the first operand. As such, while both instructions technically could do the same thing (just with slightly different syntax), the way they function is different enough to distinguish their use cases.

All together now...!

There's also another reason why the lea instruction is important: on the x64 ISA, where we are generally making use of rip-relative addressing, the offset that we use for rip-relative addressing is limited to 32 bits. Thus, we usually load the address of the data that we want using the lea operation first, instead of loading 64-bit values directly from that address.

In this case, we use lea to load the address of the msg variable (which is a pointer to our “Hello, world!” string) into the rcx register. If you recall the calling conventions, this is the register that should contain the first argument for a function call.

We then call printf to print the string that we passed in as the first argument. It's as simple as that.

Shutting down the program

Now that we've printed our line out to the console, we're good. It's time to shut it down!

    xor     rax, rax
    call    ExitProcess

Remember that according to the calling convention, the return value for a function goes into rax for integers. Well, main is no different, and so we exclusive-or the rax register with itself, effectively zero-ing it out, before calling the Win32 ExitProcess function, thus ending the application.

With that, we've successfully gone over a very simple assembly program. Let's continue.

Writing a build script

Even though we compiled our assembly code and linked it earlier in the tutorial via a few simple commands, let's go ahead and make a little build.bat that we can use to compile more programs easily, while allowing us to specify build options as well. Plus, we also want to set it up so that we don't have to keep opening a special Visual Studio Developer Command Prompt every time we want to build our program; the build script should take care of that automatically for us.

If you're not familiar with Batch files on Windows, here's a good resource for getting caught up, though frankly, the syntax used here shouldn't be terribly complicated to muddle through.

What we're going to do here is make a build script that we can call like so:

build.bat [debug|release|clean] <project_name> [exe|dll] <additional_linker_arguments>

Examples:
build.bat debug hello_world                    will build hello_world.asm in debug mode
build.bat release goodbye_nothing              will build goodbye_nothing.asm in release mode
build.bat release goodbye_nothing dll          will build goodbye_nothing.asm in release mode as a DLL instead of an exe

Let's start with the basics of parsing the command-line arguments:

echo Build script started executing at %time% ...

REM Process command line arguments. Default is to build in release configuration.
set BuildType=%1
if "%BuildType%"=="" (set BuildType=release)

set ProjectName=%2
if "%ProjectName%"=="" (set ProjectName=hello_world)

set BuildExt=%3
if "%BuildExt%"=="" (set BuildExt=exe)

set AdditionalLinkerFlags=%4

echo Building %ProjectName% in %BuildType% configuration...

This essentially sets the first, second, third and fourth command-line arguments that we pass to build.bat to be stored as the BuildType, ProjectName, BuildExt and AdditionalLinkerFlags variables respectively. To retrieve the value stored in a variable, you use the %<variable_name>% syntax.

The next thing to do, assuming that we're running this batch script from a normal command prompt (since we don't want to have launch the special developer command prompt), is to set up the environment for the MSVC compiler/linker ourselves:

if not defined DevEnvDir (
    call "%vs2017installdir%\VC\Auxiliary\Build\vcvarsall.bat" x64
)

Checking if DevEnvDir has been defined before calling the helper batch script ensures that we don't invoke the environment setup script twice.

Older versions of Visual Studio

If you're using Visual Studio 2015, replace the line above with the following line:

call "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\vcvarsall.bat" x64

Next, let's create a directory to store all build artifacts in.

set BuildDir=%~dp0msbuild

if "%BuildType%"=="clean" (
    setlocal EnableDelayedExpansion
    echo Cleaning build from directory: %BuildDir%. Files will be deleted^^!
    echo Continue ^(Y/N^)^?
    set /p ConfirmCleanBuild=
    if "!ConfirmCleanBuild!"=="Y" (
        echo Removing files in %BuildDir%...
        del /s /q %BuildDir%\*.*
    )
    goto end
)

echo Building in directory: %BuildDir% ...

if not exist %BuildDir% mkdir %BuildDir%
pushd %BuildDir%

Because the focus of this tutorial isn't about batch scripting, I'm going to go over this quickly.

%~dp0 is a special variable in batch script syntax that specifies the current directory that the batch script is being run in, including the trailing slash. Thus, if we ran build.bat from C:\tmp\build.bat, %~dp0 would expand to `C:\tmp`.
The if statement is used for conditional execution in batch syntax, much like other programming/scripting languages. In this case, we're checking if the user entered clean as an argument, in which case we prompt the user for confirmation before removing the directory and all files in it to perform a clean build.
The setlocal EnableDelayedExpansion line allows execution of expressions to be done at execution time, instead of parsing time. This is an important distinction in batch files, because we want to parse the user input to the confirmation prompt only when the script executes at this point, not at the time when the script is parsed itself (otherwise the user input would not be correct). We use setlocal here since we only want this to affect the local scope of the batch file, and not the user's global environment.
The set command with the /p flag tells the command prompt to wait for user input, parse it, and then we store it in the ConfirmCleanBuild variable. We then basically delete the build directories if the user gives a Y answer.

if not exist %BuildDir% mkdir %BuildDir%
pushd %BuildDir%

set EntryPoint="%~dp0%ProjectName%.asm"

set IntermediateObj=%BuildDir%\%ProjectName%.obj
set OutBin=%BuildDir%\%ProjectName%.%BuildExt%

set CommonCompilerFlags=-f win64 -I%~dp0 -l "%BuildDir%\%ProjectName%.lst"
set DebugCompilerFlags=-gcv8

We then create the build artifacts directory if it doesn't exist, and push that directory onto the stack (yes, there is too the concept of a stack even in the Command Prompt!). This places us in the build artifacts directory, but allows us to return to the original directory that we were in later on (which is nice when you're running this build script from somewhere else since it saves you the trouble of having to navigate back to where you were.)

We set the entry point for compilation to the project name that is intended to be passed to the build script at compilation time. That way, we can keep using the same build script to build multiple projects just by changing the name passed to the script on the command line. As long as there's a .asm file that has the same project name, we can assemble it.

We also set the output assembled intermediate object's path explicitly, along with the final executable to be produced.

Then comes the NASM flags. We set the binary type that we're producing as a 64-bit PE format file, set the include path for the assembler to the current directory, and also specify that a listing file should be generated by the assembler, using the -l argument.

What is a Listing File?

A listing file, also known as a source-listing file, is basically a version of the assembly source code that has the addresses of the generated code and the actual source code associations clearly visible. It also expands all macros and other shorthand that may have been used as part of special syntax to the assembler to make it easier to debug when things go wrong.

You'll also notice that in the distinction between the common assembler flags and debug flags, there's the odd -gcv8 flag. Basically, the -g flag itself enables the assembler to generate debugging information for the binary generated, and we explicitly specify the debug format to be cv8, which is shorthand for CodeView 8, the debugging format that Microsoft uses. This will allow us to debug the assembly code more effectively in Visual Studio when the time comes.

if "%BuildExt%"=="exe" (
    set BinLinkerFlagsMSVC=/subsystem:console /entry:main
) else (
    set BinLinkerFlagsMSVC=/dll

)

This bit is pretty self-explanatory: if we're building a normal executable, we default to using main as the symbol for our entry point; otherwise, we tell the linker that we're building a dynamic link library instead.

set CommonLinkerFlagsMSVC=%BinLinkerFlagsMSVC% /defaultlib:ucrt.lib /defaultlib:msvcrt.lib /defaultlib:legacy_stdio_definitions.lib /defaultlib:Kernel32.lib /defaultlib:Shell32.lib /nologo /incremental:no
set DebugLinkerFlagsMSVC=/opt:noref /debug /pdb:"%BuildDir%\%ProjectName%.pdb"
set ReleaseLinkerFlagsMSVC=/opt:ref

These should be fairly recognizable to anyone who's worked with MSVC and C/C++ for any modest amount of time; we're just specifying the MSVC linker options here, and making two separate configurations for whether we're making a release or a debug build, with the latter turning off all optimizations and generating a .pdb file for the debugger to use later.

I have a bad feeling about this...

There is one thing of interest here, though, if you look carefully: we're linking against a library called legacy_stdio_definitions.lib, which is not something you'd normally see in C/C++ projects, even those that make extensive use of the Win32 API. We're also linking against both the ucrt.lib and the older msvcrt.lib, which seems odd, since the former is supposed to supercede the latter. What gives?

Well, in our sample code, we make use of a function from the CRT known as printf. Now, just because the C standard says that printf has to exist in your implementation, it doesn't actually govern how it should be implemented. As it turns out, printf has become an inlined function since Visual Studio 2015 and as such, we need to link to legacy_stdio_definitions.lib library in order to maintain access to that CRT function. (This isn't normally a problem in C/C++, since Microsoft fixed up their stdio.h header to automatically resolve this).

As for linking against both the newer and older CRT libraries, it turns out that some other CRT functions such as malloc and rand are only available in the latter, older CRT library, while the rest of the CRT's functionality is in the UCRT one instead. Yes, it is indeed a mess.

if "%BuildType%"=="debug" (
    set CompileCommand=nasm %CommonCompilerFlags% %DebugCompilerFlags% -o "%IntermediateObj%" %EntryPoint%
    set LinkCommand=link "%IntermediateObj%" %CommonLinkerFlagsMSVC% %DebugLinkerFlagsMSVC% %AdditionalLinkerFlags% /out:"%OutBin%"
) else (
    set CompileCommand=nasm %CommonCompilerFlags% -o "%IntermediateObj%" %EntryPoint%
    set LinkCommand=link "%IntermediateObj%" %CommonLinkerFlagsMSVC%  %ReleaseLinkerFlagsMSVC% %AdditionalLinkerFlags% /out:"%OutBin%"
)

Finally, we set the actual assembly and linking commands that are to be executed...

echo.
echo Compiling (command follows below)...
echo %CompileCommand%

%CompileCommand%

if %errorlevel% neq 0 goto error

echo.
echo Linking (command follows below)...
echo %LinkCommand%

%LinkCommand%

if %errorlevel% neq 0 goto error
if %errorlevel% == 0 goto success

:error
echo.
echo ***************************************
echo *      !!! An error occurred!!!       *
echo ***************************************
goto end


:success
echo.
echo ***************************************
echo *    Build completed successfully!    *
echo ***************************************
goto end


:end
echo.
echo Build script finished execution at %time%.
popd
exit /b %errorlevel%

...and execute them. The full build script is available in the Code Repository section for reference.

Now that you have a build toolchain set up, and a good initial understanding of how assembly programs work, I encourage you to take a look at the other examples in the Code repository section. Build them, play with them and try to understand how they work. The Additional Resources section also has links to material that contains other samples for you.

Using assembly in C/C++ programs

The next part of this tutorial will focus on actually starting to write some assembly. To get ourselves comfortable with writing assembly that we can actually use, let's start off with something simple. Let's write a basic mathematical function in assembly that we or others can then call from our C/C++ code: a factorial function.

A factorial function

Let's refresh our memory a little bit on what a factorial is with a few examples:

factorial(5) = 5! = 5 x 4 x 3 x 2 x 1

factorial(8) = 8! = 8 x 7 x 6 x 5 x 4 x 3 x 2 x 1

factorial(10000) = 10000 x 9999 x 9998 x ... x 1

Ok, so by simple deduction:

factorial(n) = n! = n x (n - 1)!

Writing this as a simple C function gives us:

int factorial(int n)
{
    if (n == 0) {
        return 1;
    }

    int result = 1;
    for (int c = 1; c <= n; ++c) {
        result = result * c;
    }

    return result;
}

Fairly straightforward, then? Let's start writing the assembly version of this. We'll start with a skeleton function, with just the bare minimum definition.

default rel
bits 64

segment .text

global factorial

factorial:
    push    rbp
    mov     rbp, rsp
    sub     rsp, 32

    leave
    ret

This code should be fairly straightforward. We define a label factorial that will act as the anchor for where our function's text (i.e. instruction code) actually lives in the DLL's memory. We also declare that factorial is to be exported with the global directive.

After that, within factorial itself, we set up the stack for our function, and reserve the basic amount of shadow space as required by the Microsoft x64 calling convention.

Finally, we call leave to undo the stack frame changes we've made (it copies ebp to esp, which is the opposite of what we did when entering the function), and then ret to return to the site of the call instruction (which would have placed the address to return to on the stack) and exit the function.

Because our function takes a single argument, n, let's go ahead and implement the first part of the logic, which returns 1 if n is equal to 0.

...
factorial:
    push    rbp
    mov     rbp, rsp
    sub     rsp, 32

    test    ecx, ecx
    jz     .zero

.zero:
    mov    eax, 1

    leave
    ret

The first argument to the function is of type int, which means that it will be passed in ecx according to the Microsoft x64 calling convention. Since int refers to a 32-bit signed integer on MSVC, we can just use ecx to address the lower 32 bits of the register and save a tiny bit on performance (by not having to address the entire register).

We then test the register with itself to check if it's equal to 0; if it were, it would set the zero flag in the EFLAGS register, which is used by the jz (jump if zero) jump instruction immediately after that to perform the conditional branch. If n was indeed zero, We jump to a new label, .zero, which has a single mov eax, 1 instruction within it. Recall that the int return value for a function under the Microsoft x64 calling convention should be returned in the rax register, and it becomes obvious why this instruction is required: our function is returning 1 if the input n is zero.

Let's continue implementing the logic of the factorial function.

...
factorial:
    push    rbp
    mov     rbp, rsp
    sub     rsp, 32

    test    ecx, ecx    ; n
    jz     .zero

    mov    ebx, 1       ; counter c
    mov    eax, 1       ; result

    inc    ecx

.for_loop:
    cmp    ebx, ecx
    je     .end_loop

    mul    ebx          ; multiply ebx * eax and store in eax

    inc    ebx          ; ++c
    jmp    .for_loop

.zero:
    mov    eax, 1

.end_loop:
    leave
    ret

We add a new .for_loop label to indicate the start of the loop, and the .end_loop label to indicate where execution should resume once the loop condition has been met. Simple enough.

A few new instructions are also introduced here: we start with the cmp instruction, which is a comparison operation. Depending on the result of the comparison, it will set various bit flags in the EFLAGS register to indicate the result.

We then use the je operand to check if the value in ebx was equal to ecx + 1 (we added the 1 using the inc psuedo-op, which means “increment”). If the values are equal, we exit the loop by jumping to the end of the function. If not, we use the mul pseudo-op to perform our multiplication for the current iteration of the loop, and store the result in eax, which is our return value.

And that's it! Our factorial function is complete. But how do we test if it works? Well, before wasting a lot of time exporting it into a DLL and calling it from C/C++, let's just take an easier approach and implement a test right in assembly first:

default rel
bits 64

segment .data
    fmt db "factorial is: %d", 0xd, 0xa, 0

segment .text

global main
global factorial

extern _CRT_INIT
extern ExitProcess
extern printf


factorial:
    ...


main:
    push    rbp
    mov     rbp, rsp
    sub     rsp, 32

    mov     rcx, 5
    call    factorial

    lea     rcx, [fmt]
    mov     rdx, rax
    call    printf

    xor     rax, rax
    call    ExitProcess

We declare the .data segment of our assembly program, and initialize a string in it that we can pass to printf for writing out the result of our factorial function. We then make our main entry point, and basically do the equivalent of:

char fmt[] = "factorial is: %d\n";
int result = factorial(5);
printf(fmt, result);

Now compile that using the build script as mentioned, or manually from the command-line if you wish:

nasm -f win64 -o "factorial_test.obj" "factorial_test.asm"

link "factorial_test.obj" /subsystem:console /entry:main /defaultlib:ucrt.lib /defaultlib:msvcrt.lib /defaultlib:legacy_stdio_definitions.lib /defaultlib:Kernel32.lib /nologo /incremental:no  /opt:ref /out:"factorial_test.exe"

If you run the resulting factorial_test.exe, you should get the following output:

factorial is: 120

Which is indeed the factorial of 5.

Making a DLL from assesmbly

Now that we've verified that our factorial function seems to work, let's undertake another exercise, and see how to call this function from a normal C program. To do this, we'll generate a DLL that contains our factorial function. Since it follows the Microsoft x64 calling convention, it will allow us to link a normal C program with it and call functions from within it as you would expect from a normal DLL.

If you've written the build script as described in the Writing a build script section, you should be able to just build the DLL version by using the following command:

build.bat release factorial_test dll

If you haven't made the build script, you can try doing so on the command line the exact same way you make DLLs when doing normal C/C++ programming by passing the /dll flag to the linker:

nasm -f win64 -o "factorial_test.obj" "factorial_test.asm"

link "factorial_test.obj" /dll /defaultlib:ucrt.lib /defaultlib:msvcrt.lib
/defaultlib:legacy_stdio_definitions.lib /defaultlib:Kernel32.lib /nologo
/incremental:no /opt:ref /export:factorial /out:"factorial_test.dll"

You should get a factorial_test.dll file, along with its accompanying factorial_test.lib and factorial_test.exp files generated by the linker. We can then go ahead and make a simple C program to test this with:

#include <stdio.h>

int main(int argc, char **argv)
{
    int result = factorial(5);

    printf("Factorial is: %d", result);
    return 0;
}

Nothing in there should be foreign to anyone who's worked with C/C++. We can then go ahead and compile this program, linking against our factorial_test.lib file with the following command:

cl factorial_test_main.c /Ox /link /subsystem:console /entry:main /incremental:no /machine:x64 /nologo /defaultlib:Kernel32.lib /defaultlib:User32.lib /defaultlib:ucrt.lib /defaultlib:msvcrt.lib /opt:ref factorial_test.lib

Finally, upon running the factorial_test.exe executable generated, you should get the following output:

Factorial is: 120

As you can see, as long as you've followed the OS calling conventions, being able to use assembly from your C programs is extraordinarily easy. In this way, you can have high-performance versions of certain functions readily available to be swapped in/out in your code as the situation dictates, while keeping the rest of your codebase in C/C++ and continuing to reap the benefits of the optimizing compilers.

Comparing our work

So now that we know how to write some assembly, let's return to reality a bit: while you're potentially a pretty smart person (unlike the author of this tutorial), chances are you're not going to be functioning at 100% all the time. Whereas compilers like GCC, Clang and MSVC are. It's therefore tremendously helpful in situations to see what the compiler output is compared to the hand-written assembly you've written. While tools like Godbolt are amazing resources for looking at the assembly of your functions, I'd like to also make sure that you're familiar with how to do this using the compiler itself.

The /FA flag is used to generate a listing file by the MSVC compiler, which will list the assembly for our C source files in MASM syntax. So let's go ahead and compare our factorial function, with what MSVC thinks a good factorial function is.

We add an empty main function to our C program to allow for MSVC to compile it, in test.c:

int factorial(int n)
{
    if (n == 0) {
        return 1;
    }

    int result = 1;
    for (int c = 1; c <= n; ++c) {
        result = result * c;
    }

    return result;
}

int main(int argc, char* argv)
{
    return 0;
}

and compile it using the following command:

cl test.c /FA /Ox

To generate a test.asm listing that contains our assembly:

; Listing generated by Microsoft (R) Optimizing Compiler Version 19.16.27025.1

include listing.inc

INCLUDELIB LIBCMT
INCLUDELIB OLDNAMES

PUBLIC	factorial
PUBLIC	main
; Function compile flags: /Ogtpy
_TEXT	SEGMENT
argc$ = 8
argv$ = 16
main	PROC
; File c:\users\sonictk\tmp\test.c
; Line 17
	xor	eax, eax
; Line 18
	ret	0
main	ENDP
_TEXT	ENDS
; Function compile flags: /Ogtpy
_TEXT	SEGMENT
n$ = 8
factorial PROC
; File c:\users\sonictk\tmp\test.c
; Line 2
	mov	edx, ecx
; Line 3
	mov	eax, 1
	test	ecx, ecx
	je	SHORT $LN3@factorial
; Line 7
	mov	ecx, eax
; Line 8
	cmp	edx, eax
	jl	SHORT $LN3@factorial
$LL9@factorial:
; Line 9
	imul	eax, ecx
	inc	ecx
	cmp	ecx, edx
	jle	SHORT $LL9@factorial
$LN3@factorial:
; Line 13
	ret	0
factorial ENDP
_TEXT	ENDS
END

Reading MASM syntax might be a little confusing, but basically, the PROC and ENDP directives indicate the start and the end of a procedure, or function. Thus, the important part we're interested in is this, with other bits and bobs removed:

factorial PROC
	mov	edx, ecx
	mov	eax, 1
	test	ecx, ecx
	je	SHORT $LN3@factorial
	mov	ecx, eax
	cmp	edx, eax
	jl	SHORT $LN3@factorial
$LL9@factorial:
	imul	eax, ecx
	inc	ecx
	cmp	ecx, edx
	jle	SHORT $LL9@factorial
$LN3@factorial:
	ret	0
factorial ENDP

Comparing this to our assembly:

factorial:factorialPROCpushrbpmovedx,ecxmovrbp,rspmoveax,1subrsp,32testecx,ecxjeSHORT$LN3@factorialtestecx,ecxmovecx,eaxjz.zerocmpedx,eaxjlSHORT$LN3@factorialmovebx,1$LL9@factorial:moveax,1imuleax,ecxincecxincecxcmpecx,[email protected]_loop:$LN3@factorial:cmpebx,ecxret0je.end_loopfactorialENDPmulebxincebxjmp.for_loop.zero:moveax,1.end_loop:leaveret

Figure 17: Comparison of the hand-written assembly vs MSVC's output.

So first of all, we can see that the compiler elected to not set up the shadow space. (push rbp etc.) It did that because it could determine that our function was a leaf function, in that it never called any other functions, and also did not make use of the stack in the first place (In that simple program, our function was never even used at all, as a matter of fact!). You too can elect not to do this if you can make the same assumptions in your assembly functions as well.

Other than that, we see that the compiler also generated a more-efficient version of the assembly, not requiring a separate inc instruction (I originally elected to do that since I knew the je instruction technically is a little faster than the jge instruction since it tests against just the zero flag as opposed to the sign and overflow flags). Clearly, the compiler felt differently, and in this case, it's probably right: the separate inc instruction alone would eat up an entire additional μop, while the Jcc family of instructions also takes the same time.

What is a μop?

A micro-operation, or μop, is one way we use to measure how expensive an assembly pseudo-operation is. These are decoded instructions from the higher-level macro-operations (i.e. our add, mul, cmp instructions) that tell the CPU what exactly to execute in terms of micro-operations. Some macro-operations get broken down into more μops than others, and information on their timings is available thanks to Agner Fog's fantastic work. There's plenty of CPU implementation details that tie into this (For example, the decoded micro-ops go into a μop cache that can be used to avoid having to incur the cost of a decode operation for a macro-op if the cache still has the decoded μop in it), but the most important thing to note here is that just because you have fewer instructions in your assembly, it doesn't mean that you actually will have better performance. It depends on a number of factors, and micro-operation timings are just one of them.

We also see that the compiler elected to use the imul operation instead of the mul one that I did, which allows for signed multiply operations. This is also correct behaviour according to the C standard, since our function technically takes an unsigned integer, even if the factorial of a negative integer doesn't make sense.

An interesting behaviour...

For a kicker, try modifying the C function so that all inputs, outputs and operations are done unsigned, and re-generate the assembly from MSVC. You'll notice that the multiplication operation still uses the imul instruction instead of mul. Why is this? I don't know for sure, but my guess is that mul is actually a more limited instruction, whereas the imul instruction allows for 2, or even 3 operands to be used for multiplication in one instruction. They both should take the same amount of CPU μops in terms of performance.

Debugging assembly

Assembly is verbose. There's no denying that. It's also incredibly dense and easy to get lost in when trying to figure things out. So how do you navigate around and find the memory you want when things go wrong?

As it turns out, it takes a combination of experience and tooling to help us in this regard.

The tools

On Windows, we're lucky enough to have (in my opinion) much better tooling for working with assembly than is offered on Linux or OS X. We'll go over some of the ones I personally use when I need to drop to assembly.

AsmDude

AsmDude is an extension for Visual Studio that provides some significant quality of life improvements when working with assembly in Visual Studio. Code completion, syntax highlighting, nice little popups that remind you of what a mnemonic means or does...honestly, I don't know why the features in this extension don't come standard with Visual Studio.

Now if only it could provide a better registers window than the glorified edit control that comes standard...

WinDbg Preview

The venerable tool of choice for debugging crash dumps on Windows, WinDbg now has an upgraded user interface. Unfortunately, for some infuriating reason, Microsoft decided that it would be a good idea to now limit its distribution to only be via the Windows Store exclusively. While you should (hopefully) be fine using it from there, if you're running into problems using it in corporate environments like I am, do what you can to try and get it working. Trust me, the improvements made are worth the trouble compared to using the old WinDbg.

It also has features that the Visual Studio debugger does not have, such as a suite of useful commands for examining and disassembling memory, a proper disassembly window (that doesn't require a 3rd-party extension to get syntax highlighting for!), a much better registers window, among other things. You can take a look at how to use WinDbg using the MSDN documentation, which is fairly comprehensive.

radare2/Cutter and IDA

For those of you who are more adventurous (or who work on Linux/OS X), there's radare2 and a front-end GUI for it called Cutter. There's also IDA, which is favoured in the Infosec community as a disassembler. I've dabbled with both, but I mostly use WinDbg and Visual Studio on Windows, and radare2/gdb on Linux when I need to drop to disassembly.

I've also heard good things about Binary Ninja, but I haven't had the chance to try it yet.

Recognizing patterns

This is where the experience part comes in. By understanding both how assembly code works, how the OS calling conventions work, and how the compilers and optimizers work, we can make several guesses and inferences when looking at raw assembly. This is not a comprehensive list, and you'll notice new cases over time as you deal with more and more assembly code.

Halt, plagarist! You violated the law!

Most of this information is available from Jorge Rodriguez's video on the subject, but in the interests of summarizing the video, I'll be posting some key points here, along with my own notes on the topic.

If you're noticing a section such as:
```
push    rbp
mov     rbp, rsp
sub     rsp, 32 ; or some multiple of 16
```
You can be almost certain that it is the work of a function call creating a home space, which means that it's the start of a function boundary. If it also appears in an optimized build, it also provides a clue that this is most likely a non-leaf function, since the compiler did not omit the home space since the function could continue to call sub-routines that use it.
Additionally, if you notice code that starts moving values from rcx, rdx, r8, r9, or xmm0 through xmm3, you can infer from knowing the Microsoft x64 calling convention that you are most likely at the start of a function call, since these are most likely values from arguments passed to the function. Since this is usually a common cause of crashes (i.e. passing the wrong arguments to a function), inspecting the memory of the registers here (and the memory at rsp as well, if there are additional parameters passed onto the stack) can be crucial to gathering clues about what might have gone wrong.
Even though compilers are fairly good at inlining, they might not be able to do so for more complicated cases (i.e. recursive functions). This means that you might see the call instructions for those cases when looking at the assembly, even if the first call is inlined, which you could use to identify where you are at in the source code in terms of execution.
In debug builds, switch statements will compile to jump tables; in optimized builds, this doesn't always happen, so don't count on being able to identify sections of your source code this way!
In debug builds, values will in all likelihood be kept on the stack so that the debugger is able to display the values to you interactively. In release builds, values are more likely to be kept in the registers without copying them to the stack since it's faster to perform operations on. One way to take advantage of this fact is that if you're debugging assembly from a debug build, you can check the memory around rsp to see if the memory there provides any clues as to what might be happening in the current scope.
For MSVC, uninitalized memory gets cleared to 0xCC in debug builds. This is not the case in release builds. So if you notice that the value in a register or on the stack is equal to this, and you're getting an odd crash thanks to it, you know that your C/C++ code likely has a bug in it related to uninitialized memory.
If you're unfortunate enough to be debugging C++ code that was written in an “object-oriented” manner, remember that code such as:
```
obj.method(param)
```
implicitly passes the this pointer to the method being executed. I.e. the call above effectively becomes:
```
method(&obj, param)
```
This means that if you're looking at a function call where the rcx register contains a pointer address that can be casted to a obj *, you know that the function is one of the methods available on that object being called.
Furthermore, if you see a call to a method like the following:
```
call    qword ptr [rdi + 18h]
```
This is most likely a virtual method call, since there is an offset in the second operand off a base address. To find out which exact method it is, we can inspect the __vfptr (which points to the Virtual Function table pointer) and try to match the offset to the corresponding index in the virtual function table. Since the virtual function table pointer exists at the beginning of the this object, and since a pointer's size on the x64 ISA is 64 bits (or 8 bytes), for an offset of 18h (or 24 in base 10), we know that this call is for index 3 from the table, which you can then match up in a debugger to find out which exact virtual method call the instruction is referencing.

OOPObjectVirtualFunctionTableVirtualFunction8h(8bytes)TablePointer10hClassdata18h(obj->method)20h...Subclassdata

Figure 18: An illustration of the Virtual Function Table and how it works.

This is an incredibly useful tool to have in your arsenal when debugging convoluted code, as “object-oriented” code is wont to be.

When trying to find your code, don't always assume that a __declspec(noinline) will work; the compiler has the right to make the final decision on whether to inline your function or not. Additionally, other optimizations such as tail-call optimization can affect the inlining of your function as well.
To add to that, remember that things like Return Value Optimization, where compilers may choose to elide a copy operation to a temporary object before it is returned do affect how the resulting assembly is generated; the compiler may put return values of subroutines as arguments so that it looks like they are in the registers as function parameters, or it may use the stack frame for the caller to place the necessary memory for the to-be-returned object as well. Every compiler implements RVO and copy elision differently, and there's no guarantee about their behaviour between versions of the compiler, either. Common cases for RVO include things like return Object(), where the temporary object created by calling Object's constructor gets elided.
As we've noted before, the optimizer can re-order your code around and change the order of execution in order to improve performance. Don't always assume that your source code will map in the same execution order to the assembly generated in release builds!
If you have a hunch that a value in one of the GPRs looks like a pointer to an object or structure, you can try casting it in the debugger (i.e. (Object *)rcx) and seeing if you get the debugger to display the contents correctly. You could also navigate to the memory directly and view the contents to inspect the memory layout and see if it matches that of the object/structure's.
If you see statements like:
```
mov    ebx, dword ptr [rdx + rax + 3ch]
```
where one of the operands has a base register, an index register and an immediate offset, what is most likely happening is that rdx would be containing the base address of an array, rax would be the offset into that array to a specific object, and 3ch would be the access to a member of that object. Thus, you can do things like:
```
&((Object *)0)->member
```
To see what the offset amount is for that particular member. If it matches that of 3ch you know that is the specific member that the instruction is attempting to mov into ebx.
If you see statements like:
```
mov    rcx, qword ptr [rsp + 78h]
```
You know immediately that this is looking for a variable on the stack. Thus you can check your source code and try to figure out what might have been allocated on the stack, and keep comparing the memory at that address to what you see in the source until the data layouts line up.
If you suspect that you may have floating-point exception errors (i.e. caused by quiet NaNs), you can try checking the xmm registers for clues; perhaps the registers show NaN results in them. One thing to try is that if you suspect that a value in an xmm register is a NaN, you can try zero-ing the portion of the register that contains the NaN value and see if the exception still gets raised. If it doesn't, you've found a clue as to what the problem might be.
Usually, when you see a combination of cmp and jcc instructions, you can probably assume that you're in a loop. As we implemented a loop ourselves earlier, you can use this to try to help you determine where you are at with relation to the source code and determine the point of execution.
On Windows, the stack size defaults to 1 MB, and is typically located below 0x400000, so if you're looking at an address region that rsp references beyond those limits, you can make a guess that you might be about to hit a segmentation fault.
Data/text segments for user processes are limited to the first 2 GB of virtual address space available if /largeaddressaware is set to no. Additionally, data is allocated only when the process is loaded into memory.

These are just some of the notes that I've picked up from various tutorials and my own experiences. It is by no means an exhaustive list of technqiues; the only real thing to do here is to keep practising and reading other people's experiences on the subject.

Debugging: A real life issue

Problems that occur only in release builds instead of debug builds are usually tremendously annoying to solve, simply because it becomes much harder to just attach a debugger and find what the problem is (especially if your release build doesn't have debugging information or symbols!) However, sometimes even with the power of a debugger, it's difficult to figure out the issue, since the compiler/linker has done its round of optimizations and more than likely erased whatever information you needed in order for the debugger to inspect the memory you're looking for. Or, you might be dealing with interactive behaviour (i.e. a bug that only happens when you move the mouse a certain speed while clicking furiously with the left mouse button). In such situations, dropping to assembly is one possible approach to triaging and solving such issues in a timely manner.

Now let's see a real-world example of using assembly knowledge to solve an actual problem that I encountered recently when working on an application. I've chosen an example that, while relatively simple to diagnose and fix, should illustrate the benefits of understanding the fundamentals when trying to tackle larger, more obtuse problems in production.

The symptoms

I was working on a 3D engine that was exhibiting an all-too-familiar sounding problem: when the engine was built using debug compiler/linker settings (/Od /Zi /D_DEBUG /opt:noref /debug /pdb for the exact flags), everything worked as expected. Running the application would pop up the engine window instantly, I'd see my 3D view, and there wouldn't be any problems.

However, when building in release mode (/Ox /Zo /Zi /DNDEBUG /pdb /opt:ref), something odd happened: the engine window would start, but instead of displaying my 3D view, the window would be blank, without the telltale pink default colour that I had set my backbuffer render target clear colour to. I would hear audio, indicating my audio engine had initialized and that I was in the main engine loop, but whatever I did to try to gain focus on the window wouldn't work to kickstart the draw of the 3D view until I resized the window or moved it.

Because the code for handling window messages in the engine was very straightforward (very similar to Raymond Chen's scratch program template), I was fairly confident the problem lay somewhere in the realm of how the application was processing the messages that Windows was sending it. The problem was, where was it? And why was it only happening in release, and not in debug builds?

The suspect (code)

Here's a little challenge: I'm going to post the entirety of the window procedure callback function, and you try to spot where the problem is. Take your best guess, and read on to see if your instincts were right!

/**
 * Callback for processing the messages that are sent to the window. This is executed
 * each time the window receives a message from Windows.
 *
 * @param hwnd		A handle to the window.
 * @param uMsg		The message.
 * @param wParam	Additional message information.
 * @param lParam	Additional message information.
 *
 * @return			The result of the message processing.
 */
LRESULT CALLBACK WindowProc(HWND hwnd, UINT uMsg, WPARAM wParam, LPARAM lParam)
{
	switch(uMsg) {
	// NOTE: (sonictk) Sent whenever the window is activated/deactivated.
	case WM_ACTIVATE:
	{
		WORD wParam = LOWORD(wParam);
		switch (wParam) {
		case WA_INACTIVE:
		{
			globalApplicationState = DX11_ENG_APPLICATION_PAUSED;
			BOOL status = ClipCursor(NULL);
			ShowCursor(TRUE);
			if (status == 0) {
				OutputDebugString("Unable to release the cursor!\n");
			}
			break;
		}
		case WA_CLICKACTIVE:
			ShowCursor(FALSE);
		case WA_ACTIVE:
			globalApplicationState = DX11_ENG_APPLICATION_RUNNING;
			break;
		}

		return 0;
	}
	case WM_SETCURSOR:
		// NOTE: (sonictk) Checking against ``HTCLIENT`` ensures cursor is only
		// applied when we are actually within the client area to avoid overriding built-in cursors.
		if (LOWORD(lParam) == HTCLIENT) {
			SetCursor(globalMouseCursorHandle);

			return 0;
		}
		break;
	// NOTE: (sonictk) Sent when the user is resizing the window.
	case WM_SIZE:
		// NOTE: (sonictk) Save the new client dimensions.
		globalWindowWidth = (int)LOWORD(lParam);
		globalWindowHeight = (int)HIWORD(lParam);

		if (globalD3DDevice) {
			switch (wParam) {
			case SIZE_MINIMIZED:
				globalApplicationState = DX11_ENG_APPLICATION_PAUSED;
				globalWindowState |= DX11_ENG_WINDOW_STATE_MINIMIZED;
				globalWindowState &= DX11_ENG_WINDOW_STATE_MAXIMIZED;

				break;

			case SIZE_MAXIMIZED:
				globalApplicationState = DX11_ENG_APPLICATION_RUNNING;
				globalWindowState |= DX11_ENG_WINDOW_STATE_MAXIMIZED;
				globalWindowState &= DX11_ENG_WINDOW_STATE_MINIMIZED;

				break;

			case SIZE_RESTORED:
				if (globalWindowState & DX11_ENG_WINDOW_STATE_MINIMIZED) {
					globalApplicationState = DX11_ENG_APPLICATION_RUNNING;
					globalWindowState |= DX11_ENG_WINDOW_STATE_MAXIMIZED;
					globalWindowState &= DX11_ENG_WINDOW_STATE_MINIMIZED;

					OnResize();

				} else if (globalWindowState & DX11_ENG_WINDOW_STATE_MAXIMIZED) {
					globalApplicationState = DX11_ENG_APPLICATION_PAUSED;
					globalWindowState |= DX11_ENG_WINDOW_STATE_MINIMIZED;
					globalWindowState &= DX11_ENG_WINDOW_STATE_MAXIMIZED;

					OnResize();

				} else if (globalWindowState & DX11_ENG_WINDOW_STATE_RESIZING) {
					// NOTE: (sonictk) If a user is dragging the resize bars,
					// do not resize the buffers since ``WM_SIZE`` messages are
					// being sent continuously to the window. Instead, only resize
					// the buffers once the user releases the bars, which sends a
					// ``WM_EXITSIZEMOVE`` message.
					break;

				} else {
					OnResize();
				}

				break;
			default:
				break;
			}
		}

		return 0;

	// NOTE: (sonictk) Sent when the user is grabbing the resize handles.
	case WM_ENTERSIZEMOVE:
		globalApplicationState = DX11_ENG_APPLICATION_PAUSED;
		globalWindowState |= DX11_ENG_WINDOW_STATE_RESIZING;

		return 0;

	// NOTE: (sonictk) Sent when the user releases the resize handles.
	case WM_EXITSIZEMOVE:
		globalApplicationState = DX11_ENG_APPLICATION_RUNNING;
		globalWindowState &= DX11_ENG_WINDOW_STATE_RESIZING;

		return 0;

	case WM_DESTROY:
		PostQuitMessage(0);

		return 0;

	// NOTE: (sonictk) This is sent when a menu is active and the user presses a
	// key that does not correspond to any mnemonic or accelerator.
	case WM_MENUCHAR:
		// NOTE: (sonictk) We do this to avoid the beep when hitting alt-enter.
		return MAKELRESULT(0, MNC_CLOSE);

	// NOTE: (sonictk) Gets sent when the size/pos of the window is about to change.
	case WM_GETMINMAXINFO:
		// NOTE: (sonictk) We intercept this to prevent the window from becoming too small.
		((MINMAXINFO *)lParam)->ptMinTrackSize.x = globalMinimumWindowWidth;
		((MINMAXINFO *)lParam)->ptMinTrackSize.y = globalMinimumWindowHeight;

		return 0;

	case WM_MOUSEMOVE:
		OnMouseMove(wParam, GET_X_LPARAM(lParam), GET_Y_LPARAM(lParam));

		return 0;

	case WM_INPUT:
	{
		// NOTE: (sonictk) Only move the camera if the UI modifier key is not being
		// held down actively.
		if (globalModifiersDown & globalModifiersMask_Control) {
			return 0;
		}

		UINT dwSize;
		GetRawInputData((HRAWINPUT)lParam, RID_INPUT, NULL, &dwSize, sizeof(RAWINPUTHEADER));

		globalVar LPBYTE lpb = (LPBYTE)HeapAlloc(globalProcessHeap, 0, (SIZE_T)dwSize);
		if (lpb == NULL) {
			return 0;
		}

		if (GetRawInputData((HRAWINPUT)lParam, RID_INPUT, lpb, &dwSize, sizeof(RAWINPUTHEADER)) != dwSize) {
			OutputDebugString("GetRawInputData returns mismatched sizes!\n");

Understanding Windows x64 Assembly

Code Repository

Introduction

Why should I care about assembly?

Requirements

What you will need

What you should know

Setting up

Netwide Assembler

Hello, world

CPUs: A hardware refresher

Types of memory

Registers

General-purpose registers

Floating point registers

Extensions: SSE/AVX/AVX2/AVX512

Special purpose registers

Windows: the window to the hardware

The stack and heap, revisited

Other program segments

The Virtual Memory Address System

Bits, bytes and sizes

The Microsoft x64 Calling Convention

What is a calling convention?

Alignment requirements

Function parameters and return values

Integer arguments

Floating-point arguments

Mixing parameter types

Return values

Volatile and non-volatile

The shadow space

Hello, world revisted

Assembly directives: a refresher

Breaking the code down

Architecture directives

Addressing modes

rip-relative addressing

Position-Independent Code

Initializing data in assembly

Importing and Exporting symbols

The Microsoft C Runtime Library

WinMain and main

Making a stack

Calling functions in assembly

Shutting down the program

Writing a build script

Using assembly in C/C++ programs

A factorial function

Making a DLL from assesmbly

Comparing our work

Debugging assembly

The tools

AsmDude

WinDbg Preview

radare2/Cutter and IDA

Recognizing patterns

Debugging: A real life issue

The symptoms

The suspect (code)

`rip`-relative addressing

`WinMain` and `main`