Reducing the impact of misaligned memory accesses
Misalignment of memory access is a problem commonly encountered when optimizing code with Streaming SIMD Extensions 2. An SSE2 algorithm often requires loading and storing data 16 bytes at a time to match the size of the XMM registers. If alignment cannot be guaranteed, some part of the performance gain achieved by processing multiple data elements in parallel will be lost because either the compiler or assembly programmer must use unaligned move instructions.

by Michael Stoner, senior applications engineer, Intel Corp.

Misalignment of memory access is a problem commonly encountered when optimizing code with Streaming SIMD Extensions 2. An SSE2 algorithm often requires loading and storing data 16 bytes at a time to match the size of the XMM registers. If alignment cannot be guaranteed, some part of the performance gain achieved by processing multiple data elements in parallel will be lost because either the compiler or assembly programmer must use unaligned move instructions.

How much of a penalty hit will you experience? Empirical evaluation using a 2.8-GHz Pentium 4 processor system shows that an unaligned 16-byte load contained within one cache line (128 bytes) is only moderately slower—about 40 percent—compared to an aligned access. The cost rises sharply though when the 16-byte chunk crosses a cache line boundary. Such cache line splitting loads can be up to five times slower!

These penalties can sometimes be avoided by forcing 16-byte alignment on the data structures from which the SIMD operands are being drawn. You can do this rather cleanly using either the "__declspec(align(16))" directive for static variables, or the "_mm_malloc()" call for dynamic memory allocation. Both of these language extensions originated with the Intel® Compiler but are also supported in more recent versions of the Microsoft Visual C++ compiler (see the Intel Compiler documentation for details).

Given the nature of a particular algorithm, some misalignments cannot be resolved by any means. Motion estimation or motion compensation algorithms used in a video codec are good examples. These algorithms typically process pixel data in 16x16 chunks, also known as macroblocks. While this matches well with the XMM register size (each pixel is one byte in length), misaligned loads are prevalent because the 16x16 block can reside anywhere within the video frame.

Where unaligned access is unavoidable, several techniques can be employed to minimize the performance loss. This paper will present several guidelines to consider, using a quarter-pixel–interpolation routine to illustrate.

Subscribers who liked this article also read:
Combining Linux Message Passing and Threading in High-Performance Computing
by Andrew Binstock, principal analyst, Pacific Data Works LLC. Intel C...
Intel C++ Compiler 8.1 for Windows free evaluation software
This product provides tools for Windows software developers to creat...

If you're interested in this topic, these articles may be helpful:

Intel StrataFlash® Embedded Memory (P30/P33)
Delivering world class, price-performance flash solutions for embedded...
Intel® Threading Tools Boost Performance for CPU Cycle-Hungry Digital Artists
Introduction Intel ® Threading Tools help Autodesk optimize its M...
Maximum FPS: three tips for faster code
by Dean Macri, Solutions Enabling Group, Intel Corp. Welcome back t...
Measuring Performance on HT-Enabled Multi-Core: Advantages of a Thread-Oriented Approach
Introduction By: Sergey N. Zheltov and Stanislav V. Bratanov, Senior ...
Intel StrataFlash® Wireless Memory (L18) with A/D-Multiplexed I/O
Functional Overview The Intel StrataFlash® Wireless Memory (L18) ...

Related Jobs:

Programmer Analyst II #0081391 - CA - San Francisco - IndyMac Bancorp, Inc.
Post Date 12/19/2006 Requisition Number 0081391 Job Title Programm...
Sr. Systems Engineer (Modeling/Simulation-SEIT) #305838 - VA - Hampton - Zel Technologies, LLC
Job Summary: Sr. Systems Engineer (Modeling & Simulation - SEIT) ...
IT Support Specialist - MA - Somerville - Boston Logic Technology
POSITION: IT Support Specialist We are looking for a part-time IT c...
Sr. Programmer Analyst #0076784 - CA - Pasadena - IndyMac Bancorp, Inc.
Post Date 3/15/2006 Requisition Number 0076784 Job Title Sr. Progr...
Sr Database Administrator - Edw Database Administrator #42016 - FL - Orlando - The Home Depot U.S.A., Inc.
SR DATABASE ADMINISTRATOR - EDW DATABASE ADMINISTRATOR (# 42016) Orl...
Network Field Technician, Deployment - CA - Mountain View - Google
Can you tell from 100 feet away whether you're going to need cage nuts...
Software Developer- C++ Windows - CA - San Francisco - Adobe Systems
Are you passionate about the world of Flash? Do you want to have an im...
Senior Business Information Analyst #57248 - NJ - Bridgewater - HSBC Bank Usa
Senior Business Information Analyst Here are all the details for thi...
Sr Systems Engineer #42843 - GA - Atlanta - The Home Depot U.S.A., Inc.
SR SYSTEMS ENGINEER (# 42843) Atlanta, GA Date: 11/29/200...
DB2 Database Administrator on AS400 Platform #2511925 - TX - Dallas - Ajilon Consulting
Title : DB2 Database Administrator on AS400 Platform Posted : ...