Data placement in threaded programs
The move to multithreading is inevitable - as is the need to understand the intricacies, such as the problem of data that is modified by threads on different physical processors.

by Andrew Binstock, principal analyst, Pacific Data Works LLC. Intel Corp.

The previous two installments of this column have discussed adjustments that programmers need to make when moving from single-threaded to multi-threaded programming. I paid specific attention to the aspects that apply to Hyper-Threading Technology, which allows multiple threads to execute on the same physical processor. In the first installment I discussed the need for the new conception of program and data design based around threads as units of execution. Then, in the second installment I presented an overview of the challenges caused by thread interactions, with specific focus on how operating system APIs and special processor instructions are effective ways of synchronizing threads if done correctly. If done incorrectly, however, I noted these techniques are punished severely by the system with poor performance. The key is to know what you're doing and to thoroughly test performance of the finished software.

While that discussion focused on thread synchronization via synchronization variables, it is here expanded to the problem of data that is modified by threads on different physical processors.

Sharing data between physical processors
Threads on separate physical processors that modify common data encounter both physical and programmatic challenges. The programmatic challenge arises from a thread's need to control the data in order to modify it. If several threads are accessing the same data item, the thread must observe an operating-system–specific protocol for getting solo access to the data item in order to update it. All modern operating systems have extensive tools and APIs that enable programmers to accomplish this task. They rely on mutexes, locks, and critical sections—all concepts that manage access among threads—to handle the problem. Because these tools operate at the operating-system level, they tend to be implemented differently, and the literature specific to the given operating system should be consulted. Some useful references are included at the end of this article.

However, while operating system APIs for handling shared data are effective, they tend to be slow. In addition, the necessary interactions between threads can be complicated and it is easy to have threads affect each other detrimentally. Hence, programmers are well advised to redesign programs, even change algorithms to minimize sharing of modifiable data.

The silicon-level challenge caused by this sharing is the effect that data updates have on the individual processors. This problem is distinct from the ownership contention just discussed. When a shared data item is modified, all processors that have a copy of the item in cache are notified their copy is invalid, and they are required to refresh their cache with the new data. This activity ensures cache coherency—a principle that obliges all cache instances of a data item on the system to be identical.

The overhead caused by cache coherency is substantial and can be reduced only by diminishing its frequency. One way that works in a subset of cases is to have each thread maintain its own copy of the data and update the common data only when necessary. For example, the common data might contain fields of varying importance, such that some fields do not have to be updated immediately upon modification. In such cases, the private data can hold the updated version and refresh the common data all at once when it is imperative to do so.

The problem of cache coherency, however, can have a costly side effect that occurs even when two programs are not sharing common data. This problem is called false sharing.

False Sharing
On Pentium® 4 and Xeon™ processors, the minimum size of cache blocks is 128 bytes. (The term cache line is frequently used to refer to the minimum block size loaded into cache.) This block concept means that when a variable is loaded into cache, the entire 128-byte block in which it's located is moved into cache. The cache itself does not know or care which specific bytes of the 128 are to be used by the thread; the cache simply fetches the whole block.

False sharing occurs when two threads store data items private to each of them in the same 128-byte data block. Since the block with these different variables is read as a whole into cache, each time one thread updates its own variable the entire block is updated in all the various caches. As a result, a processor is forced to update the block in its cache even though the data element it's interested in has not changed. This activity is caused by the purely accidental placement of thread 1's variable in the same cache block that contains thread 2's data.

The penalty for false sharing can be extremely high. In many cases it can degrade performance by as much as 50%. To solve the problem of false sharing, programmers should make sure that frequently used data items assigned to different threads are separated in memory by more than the size of a cache line, that is by more than 128 bytes for Intel® Pentium 4 and Intel Xeon processors. Padding out the space is the simple way to guarantee this.

As described here, false sharing is primarily a concern when separate physical processors share the same cache line. However, on Hyper-Threading Technology enabled systems—where two threads share the same cache—the problem exists in a different form. Due to the way that the Intel NetBurst™ microarchitecture sequences the reads and writes as it speculatively executes instructions, false sharing can induce a "machine clear" exception on Hyper-Threading Technology enabled systems. This event forces the processor to retire all pending instructions and restart the execution pipeline. As we have seen in previous columns, restarting an execution pipeline is an expensive proposition that causes an immediate drag on performance. Data from Intel suggests that false sharing in such systems can cause a tenfold increase in machine clear counts. Avoid false sharing at all costs.


If you're interested in this topic, these articles may be helpful:

Getting Started with OpenMP*
Abstract By Richard Gerber As you probably know by now, to get the m...
Threading Games for High Performance on Intel Processors
The evolution of the multi-threaded processor design is the trend for ...
Multithreaded Technology and Multi-Core Processors
Introduction by Craig Szydlowski Infrastructure Processor Divi...
Transforming Education with Technology: An overview of Intel’s Worldwide Programs
Intel: More Than 35 Years of Revolutionary Technology By enabling ...
Backgrounder: Intel Government-Assisted PC Programs
Introduction Increasingly, governments are realising that Information...

Related Jobs:

Data Architect #3101985 - TX - Houston - Ajilon
Description : Ajilon Consulting, a leader in the placement of Informa...
Data Center Support / Operator (Linux, Unix, Cisco) - CA - Santa Clara county - Sunrise Systems Inc.
Job Description: Data Center Support / Operator (Linux, Unix, Cisco) ...
Business Analyst - PI #972 - MO- Fenton - Maritz Inc.
This position will report to the Director, Marketing and Business Inte...
Security And Development Manager #951 - NJ - Montvale - RCG Information Technology
Description: All applicants must have a minimum of 3 years IT Indus...
Sr. J2EE/Java Developer #3265527 - CA - Irvine - Ajilon
Description : OUTSTANDING PAY - FULL BENEFITS MUST BE US CITI...
SPADOC/SDS Subject Matter Expert (IBM/UNIX) #361276 - CA - Lompoc - Zel Technologies, LLC
Placement will be immediate with job candidates required to report w...
Senior Software Engineer - Windows Mobile Developer - GA - Alpharetta - iAnywhere, a Sybase Company
iAnywhere, a Sybase company, enables success at the front lines of bu...
Enterprise Architect #508082 - VA - Chantilly - BAE Systems
Description: Act in the capacity of the BAE SYSTEMS Technical Lead; r...
Senior Software Engineer, Online Search Development #13B990DAFA945403 - NJ - Edison - Ask Jeeves,Inc.
Title Senior Software Engineer, Online Search Development City Ediso...
Development Release Manager #1807 - MA - Tewksbury - Avid Technology. Inc.
Avid Technology, Inc. is the world wide leader in media creation/editi...