Reducing the impact of misaligned memory accesses
Misalignment of memory access is a problem commonly encountered when optimizing code with Streaming SIMD Extensions 2. An SSE2 algorithm often requires loading and storing data 16 bytes at a time to match the size of the XMM registers. If alignment cannot be guaranteed, some part of the performance gain achieved by processing multiple data elements in parallel will be lost because either the compiler or assembly programmer must use unaligned move instructions.

by Michael Stoner, senior applications engineer, Intel Corp.

Misalignment of memory access is a problem commonly encountered when optimizing code with Streaming SIMD Extensions 2. An SSE2 algorithm often requires loading and storing data 16 bytes at a time to match the size of the XMM registers. If alignment cannot be guaranteed, some part of the performance gain achieved by processing multiple data elements in parallel will be lost because either the compiler or assembly programmer must use unaligned move instructions.

How much of a penalty hit will you experience? Empirical evaluation using a 2.8-GHz Pentium 4 processor system shows that an unaligned 16-byte load contained within one cache line (128 bytes) is only moderately slower—about 40 percent—compared to an aligned access. The cost rises sharply though when the 16-byte chunk crosses a cache line boundary. Such cache line splitting loads can be up to five times slower!

These penalties can sometimes be avoided by forcing 16-byte alignment on the data structures from which the SIMD operands are being drawn. You can do this rather cleanly using either the "__declspec(align(16))" directive for static variables, or the "_mm_malloc()" call for dynamic memory allocation. Both of these language extensions originated with the Intel® Compiler but are also supported in more recent versions of the Microsoft Visual C++ compiler (see the Intel Compiler documentation for details).

Given the nature of a particular algorithm, some misalignments cannot be resolved by any means. Motion estimation or motion compensation algorithms used in a video codec are good examples. These algorithms typically process pixel data in 16x16 chunks, also known as macroblocks. While this matches well with the XMM register size (each pixel is one byte in length), misaligned loads are prevalent because the 16x16 block can reside anywhere within the video frame.

Where unaligned access is unavoidable, several techniques can be employed to minimize the performance loss. This paper will present several guidelines to consider, using a quarter-pixel–interpolation routine to illustrate.

Subscribers who liked this article also read:
Combining Linux Message Passing and Threading in High-Performance Computing
by Andrew Binstock, principal analyst, Pacific Data Works LLC. Intel C...
Intel C++ Compiler 8.1 for Windows free evaluation software
This product provides tools for Windows software developers to creat...
Determine the correct XML parser type for a Java application
by Padma Apparao, senior performance architect, Software Solutions Gro...

If you're interested in this topic, these articles may be helpful:

Intel® Threading Tools Boost Performance for CPU Cycle-Hungry Digital Artists
Introduction Intel ® Threading Tools help Autodesk optimize its M...
Intel StrataFlash® Wireless Memory (L18) with A/D-Multiplexed I/O
Functional Overview The Intel StrataFlash® Wireless Memory (L18) ...
Maximum FPS: three tips for faster code
by Dean Macri, Solutions Enabling Group, Intel Corp. Welcome back t...
Measuring Performance on HT-Enabled Multi-Core: Advantages of a Thread-Oriented Approach
Introduction By: Sergey N. Zheltov and Stanislav V. Bratanov, Senior ...
Intel StrataFlash® Embedded Memory (P30/P33)
Delivering world class, price-performance flash solutions for embedded...

Related Jobs:

Network Administrator #292126 - VA - Chantilly - Zel Technologies, LLC
Zel Technologies and their Recruiting Teaming Partner is seeking two N...
Programmer Analyst II #0081391 - CA - San Francisco - IndyMac Bancorp, Inc.
Post Date 12/19/2006 Requisition Number 0081391 Job Title Programm...
IT Support Specialist - MA - Somerville - Boston Logic Technology
POSITION: IT Support Specialist We are looking for a part-time IT c...
DB2 Database Administrator on AS400 Platform #2535366 - TX - Dallas - Ajilon
Description : Must be a DB2 DBA on AS400 platform. Permanent ...
Database Developer #12612 - TX - Houston - Radiant Systems, Inc
Summary: Develops simple solutions to support business process automat...
Senior Business Information Analyst #57248 - NJ - Bridgewater - HSBC Bank Usa
Senior Business Information Analyst Here are all the details for thi...
Software Developer- C++ Windows - CA - San Francisco - Adobe Systems
Are you passionate about the world of Flash? Do you want to have an im...
Sr. Systems Engineer (Modeling/Simulation-SEIT) #305838 - VA - Hampton - Zel Technologies, LLC
Job Summary: Sr. Systems Engineer (Modeling & Simulation - SEIT) ...
DB2 Database Administrator on AS400 Platform #2528092 - TX - Dallas - Ajilon Consulting
Title : DB2 Database Administrator on AS400 Platform Posted : ...
DB2 Database Administrator on AS400 Platform #2511925 - TX - Dallas - Ajilon Consulting
Title : DB2 Database Administrator on AS400 Platform Posted : ...