by Michael Stoner, senior applications engineer, Intel Corp.
Misalignment of memory access is a problem commonly encountered when optimizing code with Streaming SIMD Extensions 2. An SSE2 algorithm often requires loading and storing data 16 bytes at a time to match the size of the XMM registers. If alignment cannot be guaranteed, some part of the performance gain achieved by processing multiple data elements in parallel will be lost because either the compiler or assembly programmer must use unaligned move instructions.
How much of a penalty hit will you experience? Empirical evaluation using a 2.8-GHz Pentium 4 processor system shows that an unaligned 16-byte load contained within one cache line (128 bytes) is only moderately slower—about 40 percent—compared to an aligned access. The cost rises sharply though when the 16-byte chunk crosses a cache line boundary. Such cache line splitting loads can be up to five times slower!
These penalties can sometimes be avoided by forcing 16-byte alignment on the data structures from which the SIMD operands are being drawn. You can do this rather cleanly using either the "__declspec(align(16))" directive for static variables, or the "_mm_malloc()" call for dynamic memory allocation. Both of these language extensions originated with the Intel® Compiler but are also supported in more recent versions of the Microsoft Visual C++ compiler (see the Intel Compiler documentation for details).
Given the nature of a particular algorithm, some misalignments cannot be resolved by any means. Motion estimation or motion compensation algorithms used in a video codec are good examples. These algorithms typically process pixel data in 16x16 chunks, also known as macroblocks. While this matches well with the XMM register size (each pixel is one byte in length), misaligned loads are prevalent because the 16x16 block can reside anywhere within the video frame.
Where unaligned access is unavoidable, several techniques can be employed to minimize the performance loss. This paper will present several guidelines to consider, using a quarter-pixel–interpolation routine to illustrate.
![]()
![]() | Intel C++ Compiler 8.1 for Windows free evaluation software This product provides tools for Windows software developers to creat... |
![]() | Determine the correct XML parser type for a Java application by Padma Apparao, senior performance architect, Software Solutions Gro... |
![]() | Combining Linux Message Passing and Threading in High-Performance Computing by Andrew Binstock, principal analyst, Pacific Data Works LLC. Intel C... |
If you're interested in this topic, these articles may be helpful:
![]() | Intel StrataFlash® Wireless Memory (L18) with A/D-Multiplexed I/O Functional Overview The Intel StrataFlash® Wireless Memory (L18) ... |
![]() | Intel StrataFlash® Embedded Memory (P30/P33) Delivering world class, price-performance flash solutions for embedded... |
![]() | Maximum FPS: three tips for faster code by Dean Macri, Solutions Enabling Group, Intel Corp. Welcome back t... |
![]() | Intel® Threading Tools Boost Performance for CPU Cycle-Hungry Digital Artists Introduction Intel ® Threading Tools help Autodesk optimize its M... |
![]() | Measuring Performance on HT-Enabled Multi-Core: Advantages of a Thread-Oriented Approach Introduction By: Sergey N. Zheltov and Stanislav V. Bratanov, Senior ... |
![]()
Related Jobs:

