|
|
Intel BlogsWhitepapers
|
Data placement in threaded programsThe move to multithreading is inevitable - as is the need to understand the intricacies, such as the problem of data that is modified by threads on different physical processors.
by Andrew Binstock, principal analyst, Pacific Data Works LLC. Intel Corp. The previous two installments of this column have discussed adjustments that programmers need to make when moving from single-threaded to multi-threaded programming. I paid specific attention to the aspects that apply to Hyper-Threading Technology, which allows multiple threads to execute on the same physical processor. In the first installment I discussed the need for the new conception of program and data design based around threads as units of execution. Then, in the second installment I presented an overview of the challenges caused by thread interactions, with specific focus on how operating system APIs and special processor instructions are effective ways of synchronizing threads if done correctly. If done incorrectly, however, I noted these techniques are punished severely by the system with poor performance. The key is to know what you're doing and to thoroughly test performance of the finished software. While that discussion focused on thread synchronization via synchronization variables, it is here expanded to the problem of data that is modified by threads on different physical processors.
Sharing data between physical processors However, while operating system APIs for handling shared data are effective, they tend to be slow. In addition, the necessary interactions between threads can be complicated and it is easy to have threads affect each other detrimentally. Hence, programmers are well advised to redesign programs, even change algorithms to minimize sharing of modifiable data. The silicon-level challenge caused by this sharing is the effect that data updates have on the individual processors. This problem is distinct from the ownership contention just discussed. When a shared data item is modified, all processors that have a copy of the item in cache are notified their copy is invalid, and they are required to refresh their cache with the new data. This activity ensures cache coherency—a principle that obliges all cache instances of a data item on the system to be identical. The overhead caused by cache coherency is substantial and can be reduced only by diminishing its frequency. One way that works in a subset of cases is to have each thread maintain its own copy of the data and update the common data only when necessary. For example, the common data might contain fields of varying importance, such that some fields do not have to be updated immediately upon modification. In such cases, the private data can hold the updated version and refresh the common data all at once when it is imperative to do so. The problem of cache coherency, however, can have a costly side effect that occurs even when two programs are not sharing common data. This problem is called false sharing.
False Sharing False sharing occurs when two threads store data items private to each of them in the same 128-byte data block. Since the block with these different variables is read as a whole into cache, each time one thread updates its own variable the entire block is updated in all the various caches. As a result, a processor is forced to update the block in its cache even though the data element it's interested in has not changed. This activity is caused by the purely accidental placement of thread 1's variable in the same cache block that contains thread 2's data. The penalty for false sharing can be extremely high. In many cases it can degrade performance by as much as 50%. To solve the problem of false sharing, programmers should make sure that frequently used data items assigned to different threads are separated in memory by more than the size of a cache line, that is by more than 128 bytes for Intel® Pentium 4 and Intel Xeon processors. Padding out the space is the simple way to guarantee this. As described here, false sharing is primarily a concern when separate physical processors share the same cache line. However, on Hyper-Threading Technology enabled systems—where two threads share the same cache—the problem exists in a different form. Due to the way that the Intel NetBurst™ microarchitecture sequences the reads and writes as it speculatively executes instructions, false sharing can induce a "machine clear" exception on Hyper-Threading Technology enabled systems. This event forces the processor to retire all pending instructions and restart the execution pipeline. As we have seen in previous columns, restarting an execution pipeline is an expensive proposition that causes an immediate drag on performance. Data from Intel suggests that false sharing in such systems can cause a tenfold increase in machine clear counts. Avoid false sharing at all costs.
If you're interested in this topic, these articles may be helpful:
Related Jobs: |
|
|