Maximum FPS: Three Tips for Faster Code

Welcome back to Maximum FPS! Last month I spent a long time discussing the issues involved with writing to vertex buffers in AGP memory. If you downloaded and looked at the sample code that accompanied the column, you may have noticed that I did some tricky stack allocations in theUpdateWorld function. This month I'm going to give details about cache alignment concerns which resulted in that code. Afterwards, we'll look briefly at something called 'store forwarding' and 'fast string moves' to round off the number of tips in this column to three. As usual, if you have comments or suggestions for topics you'd like to hear about in the future please drop me a line. Now let's get started with a little bit of alignment background information.

Background

If you've done any processor specific optimizations, beginning with the introduction of the Intel® Pentium® processor with MMX* technology and continuing through the Streaming SIMD Extensions 2 (SSE2) instructions added to the Intel Pentium® 4 processor, you're probably well aware of the data alignment requirements of the MMX, SSE, and SSE2 instruction set additions. With these instructions, you're required to meet the specific alignment specifications or your code will cause an exception. I won't go into the chip design reasons for why some instructions require aligned data and why others can work with unaligned data. The point is to help you realize the benefits that can be gained by taking care to properly align your data in all speed critical code.

Data Size (Bits)	C/C++ Declaration	Natural Alignment (Bytes)
8	Char	1
16	Short	2
32	int or long or float	4
64	double or __int64 or long long (icl)	8
80	N/A	8
128	__m128	16

Table 1 - Natural Data Alignment Values

On the simplest level, make sure to align your data on natural boundaries: on even bytes for WORDS, multiples of four for DWORDS, etc., as s hown in Table 1. The one exception to the simple rule is when working with 80-bit extended precision floating point numbers. These values should be aligned on 8-byte boundaries for optimal performance. Now let's take a look at the cache hierarchy of a system so we can see how to optimize our code for the caches.

Cache Alignment Concerns

When you write data to or read data from normal system memory, several levels of cache (typically two) are used to improve the speed of access to frequently accessed data. The size of the data caches differs among P6 family processors (Pentium®, Pentium® II, Pentium® III and Celeron* brand processors) and the Intel® Pentium® 4 processor. Table 2 shows the various data cache sizes and some additional information about them. Instruction caches may share the same cache space as the data or they may be entirely separate. It's possible to obtain all this information within your program using the CPUID instruction, but I won't go into the details here.

Processor	Cache (Data for L1, unified for L2)	Size	Number of Ways	Cache Line Size(Set Size)
Pentium®, Pentium® Processor with MMX* Technology	L1	16KB (8KB on early Pentium® processors)	4 (2 on early processors)	32 bytes
Pentium®, Pentium® Processor with MMX* Technology	L2	256KB or 512KB	4	32 bytes
Pentium® Pro, Pentium® II, Pentium® III, Pentium®, Xeon®, Celeron*	L1	16KB (8KB on early P6 family processors)	4 (2 on early processors)	32 bytes
	L2	128KB, 256KB, 512KB, 1MB, 2MB	4	32 bytes
Pentium® 4	L1	8KB	4	64 bytes
Pentium® 4	L2	256KB	8	128 bytes (2, 64-byte sectors)

Table 2 - Data Cache Characteristics for Intel Processors

When cache descriptions are specified, you'll often see them described as '4-way se t associative' or '8-way set associative'. There are actually three parts to this description. Let's start with the middle: set. In a set associative cache, the cache consists of a number of 'sets' that are used to cache memory addresses. On Intel Architecture processors, a set is simply a cache line and varies in size from 32 bytes to 128 bytes. Set associative means that each set is 'associated' with multiple regions of memory. So, one cache line may be associated with hundreds or more likely thousands of memory addresses. So say that a given set (cache line) is associated with memory regions X, Y, and Z. To enable all three of these memory regions to be cached simultaneously, the cache is divided into a number of ways. So, in the example just given, each of X, Y, and Z would be mapped to the same setbut in a different way. Figure 1 shows an example of this with three regions of memory mapping to three different ways in the cache.

Figure 1 - Example Main Memory to Cache "Way" Mapping

To determine how many sets are in a given cache, take the cache size and divide by the number of ways and then divide by the size of each set. So, the L1 data cache of the Intel Pentium® 4 processor has 32 sets (32 = 8192 / 4 / 64).

A simple example can better illustrate the workings of the caches. Let's look at two 32-bit memory addresses in hexadecimal notation: 0x01234008 and 0x56789008 and assume we're on an Intel Pentium® 4 processor-based system. We want to determine where these addresses get mapped to the L1 cache. For starters, from Table 2 we see that the L1 data cache is 8KB in size and from the calculation we did previously, we know there are 32 sets in each way of the cache. So, we can determine which set each of the addresses will be in by shifting off the lower 6 bits (because there are 64 bytes in a cache line and 2⁶ = 64) and then looking at just 5 bits (because there are 32 sets and 2⁵ = 32). After shifting off the lower 6 bits, our addresses become: 0x48D00 and 0x159E240. Now we can look at just the lower 5 bits of each of these addresses. What we see is that they both map to set '0' because 5 bits is 0x1F in hexadecimal and 0x48D00 & 0x1F = 0 and 0x159E240 & 0x1F = 0. Here '&' is the C/C++ operator that does a logical AND. This means that to both be in the cache at the same time, they must be put in different ways.

I've taken a bit of time to describe the workings of the caches. If you think about doing operations that work on multiple pieces of memory at the same time, you'll see that you need to be careful to align your data so that it doesn't all end up mapping to the same sets in the cache. And this alignment can vary from processor to processor because the sizes of the caches change as do the number of ways and numbers of sets in each way. Computations that require more than four items will need to take special care to align the data properly to make maximum use of the caches.

Avoiding Cache Line Splits

The last thing I want to mention about the caches is that if a single data item crosses a cache line boundary (known as a cache-line split), there will be a performance hit by the processor. So, if you're working with 64-bit (8-byte) double precision values, and you have one that has 4-bytes on one cache line and the next 4-bytes on the next cache line, when you read or write that item there will be a performance penalty. I previously mentioned that you should align data on natural boundaries. If you do, then you'll avoid the problems with cache-line splits for simple data types. Avoiding cache line splits for blocks of data larger than the simple data types can improve the performance of your code. Let's take a look at the tricky alignment I did last month to see how it fits into our understanding of data alignment and the caches.

Take a quick look at that piece of code, shown in Listing 1.

1    BYTE tmpBuffer[128];
2 
3    D3DVERTEX
*temp = (D3DVERTEX *)(((DWORD)tmpBuffer+63)
4&
~63 );

Listing 1 ' Aligning a D3DVERTEX to a 64-byte boundary

Here, I've allocated a temporary buffer, tmpBuffer, and then a pointer that points into that buffer at the first 64-byte aligned offset. I get the alignment by adding in one less then the alignment (63) and then masking off the lower bits (~63 = negation of 63 = 0xFFFFFFC0). Note that this will only work if you're aligning to a power of two (which all the alignments I've discussed should be). I chose 64-byte alignment primarily because of the L1 cache size of the Intel® Pentium® 4 processor. In this way, no cache line splits will occur when copying the data from the temporary memory to the write combining AGP memory, giving me excellent performance for that piece of code.

Store Forwarding

The discussion of the caches that we just went through is simple enough that you can work directly in a high-level language like C or C++ to achieve good performance. The next topic, store forwarding, requires that you be writing assembly code. Most likely, if you're already writing in assembly or you think you'll need to in order to achieve the highest performance possible, this won't be a big deal. If your performance critical sections of code are written in C or C++, then assuring your code is properly aligned as described previously will help the compiler to take advantage of store forwarding where possible.

Beginning with the P6 family processors and continuing through the Pentium® 4 processor, Intel® processors have been able to perform out-of-order execution. This means that if your instruction stream contains instructions in the order A, B, and then C, that the processor may execute them in another order, like C, B, A if the earlier instructions (A and B) have to wait for data to become available but the later instruction (C) doesn't. Fortunately, this all happens behind the scenes and the processor makes sure that the results of the instructions appear in the same order as the program. So for the most part, we don't have to worry about this behind-the-scenes reordering in our code. It's useful, though, to know some things you can do in your code to ensure that the processor can take full advantage of its ability to perform out-of-order execution.

One mechanism that is provided to improve performance is the concept of store-to-load forwarding also known as store forwarding. When instructions are executed out-of-order and speculatively (which means that assumptions may have been made about the results of a branch), in order to commit the results of a store to memory, the store instruction must have retired. The deep pipeline of the P6 family processors and the even deeper pipeline of the Pentium® 4 processor means that instructions may take many cycles to retire. A special case that is optimized within the processor is the case where a load instruction is dependent on the result of a store instruction, like in Figure 2.

Figure 2 - Example of Store Forwarding

Special buffers within the processor enable the load to proceed before the store has retired if certain conditions are met:

The data in the load must be aligned to the same address as the store
The data in the load must be the same size or smaller than the data in the store
The load can not require forwarding from two partial stores
For 128-bit data on both load and store, the data must be 16-byte aligned

Figure 3 shows the various combinations of stores and loads that can and can't forward.

Figure 3 - Store Forwarding Rules

Figure 4a gives an example in which store forwarding is blocked, and Figure 4b shows an alteration to the code that enables the store forwarding. Figure 4a violates rule number (3) from above.

Figure 4 - (a) Store Forwarding Blocked, (b) Store Forwarding Enabled

As mentioned previously, to take full advantage of store forwarding you'll need to work in assembly code. What you might want to do, though, if you have a piece of performance critical C or C++ code that isn't performing the way you want, is take a look at the code generated by the compiler. Look for code that violates the store forwarding guidelines. Perhaps you can improve on it by switching to assembly code.

Fast String Moves

Last month, using the 64-bit aligned temporary buffer that I mentioned previously, I used a bit of inline assembly code to perform a string move operation as shown in Listing 2.

Listing 2 ' String Move Operation

The value of Floats ranged from zero to 16. If the value could have gone to 64 (which it could by changing the striding to every eighth vertex), a special thing called a 'Fast String' move could have taken place. In a fast string move, the processor moves data into and out of the processor on a cache line by cache line basis. In order for a fast string move to occur, five conditions must be met:

The source and destination address must be 8-byte aligned.
The string operation (rep movs) must operate on the data in ascending order
The initial count (ECX) must be at least 64 (this is the restriction we didn't meet in the code from last month)
The source and the destination can't overlap by less than a cache line
The memory types of both source and destination must either be write back cacheable or write combining.

In the case of the AGP performance application we didn't meet all the criteria so the fast string move didn't happen. By making a few tweaks to the code, I got the fast string move to happen and found a performance increase of about a percent or so over the best performance achieved previously.

Conclusion

We've covered several ways in which you can modify your code to enable the processor to work most efficiently on your data. Optimization is as much of an art as a science, but knowing data alignment issues and special things that happen under the hood of the processor can help you get better performance with possibly minor changes. All of the information collected in this column can be found in the manuals for the Intel® Architecture processors. Check out reference material at http://developer.intel.com/products/processor/manuals/index.htm.