by Dean Macri, Solutions Enabling Group, Intel Corp.
Welcome back to Maximum FPS! Last month I spent a long time discussing the issues involved with writing to vertex buffers in AGP memory. If you downloaded and looked at the sample code that accompanied the column, you may have noticed that I did some tricky stack allocations in the UpdateWorld function. This month I'm going to give details about cache alignment concerns which resulted in that code. Afterwards, we'll look briefly at something called 'store forwarding' and 'fast string moves' to round off the number of tips in this column to three. As usual, if you have comments or suggestions for topics you'd like to hear about in the future please drop me a line. Now let's get started with a little bit of alignment background information.
Background
If you've done any processor specific optimizations, beginning with the introduction of the Intel® Pentium® processor with MMX technology and continuing through the Streaming SIMD Extensions 2 (SSE2) instructions added to the Intel Pentium 4 processor, you're probably well aware of the data alignment requirements of the MMX, SSE, and SSE2 instruction set additions. With these instructions, you're required to meet the specific alignment specifications or your code will cause an exception. I won't go into the chip design reasons for why some instructions require aligned data and why others can work with unaligned data. The point is to help you realize the benefits that can be gained by taking care to properly align your data in all speed critical code.
| C/C++ Declaration | Natural Alignment (Bytes) | |
| 8 | Char | 1 |
| 16 | Short | 2 |
| 32 | int or long or float | 4 |
| 64 | double or __int64 or long long (icl) |
8 |
| 80 | N/A | 8 |
| 128 | __m128 | 16 |
Table 1. Natural Data Alignment Values
On the simplest level, make sure to align your data on natural boundaries: on even bytes for WORDS, multiples of four for DWORDS, etc., as shown in Table 1. The one exception to the simple rule is when working with 80-bit extended precision floating point numbers. These values should be aligned on 8-byte boundaries for optimal performance. Now let's take a look at the cache hierarchy of a system so we can see how to optimize our code for the caches.
![]()
![]() | Wireless application security: what's up with that? from Intel Corp. The world of mobile data presents many uniqu... |
![]() | J2EE performance optimization, part 3 - design of experiments for performance tuning by Kingsum Chow, Ph.D., senior performance architect, Managed Runtime ... |
If you're interested in this topic, these articles may be helpful:
![]() | Source code for XML security layers, part 1: basic plumbing technologies by Manish Verma, principal architect, Second Foundation. First publ... |
![]() | Culture: the next big thing in code by Geoff Koch, writer. Intel Corp. From San Francisco to Singapore,... |
![]() | Writing code to reveal the performance details of mobile processors: calculations reveal relationships by Richard Winterton, senior software engineer, Intel Corp. James J... |
![]() | Web Code Optimization: Google does it. Yahoo! does it. Why don't you do it? by Tad Fleshman. Port80 Software Inc. Google and Yahoo! know that s... |
![]() | J2EE performance optimization, part 1: laying the foundation by Kingsum Chow, PhD, senior performance architect with the Managed ... |
![]()
Related Jobs:

