by Alan Zeichick, principal analyst, Camden Associates. Intel Corp.
It's fast, efficient, and ideal for desktop and server applications-but Intel Corp.'s Pentium® 4 processor, like all microprocessors, has its quirks. That means that some application code, whether written in C, C++ or Assembler, might look great, and might indeed execute correctly. But, the app might be v-e-r-y slow, due to specific features of the processor's architecture, or (more commonly) the design of the compiler.
Floating away
Take the tasks of converting a floating-point number to a 32-bit integer. That's a common enough task, which, according to the ANSI C/C++ definitions, should be handled by simply truncating the fractional portion of the number.
Microsoft Visual C++, version 6.0, handles this operation by issuing a call to the _ftol (float to long int) function, which is included in the Microsoft C runtime library. And indeed, a properly ANSI-compliant integer does appear after you make the call.
That _ftol function is a general-purpose call, which can handle both 32-bit and 64-bit conversion. It temporarily modifies the floating-point rounding mode to perform a straight truncation, does the roundoff and type conversion, and then sets the rounding mode back again to its previous state. This method, according to Intel research, is a long latency operation-much longer than you might think.
How can you see if your application has this problem? In a couple ways. You can create and scan an assembly listing, and look for an explicit float-to-int cast that's being compiled as a call to _ftol. You could also check the compiler messages to see if there's an implicit cast that should generate a warning.
If you have float-to-int operations in your code, there are several steps you can take. One is to enable the /QIfist compiler switch. This switch suppresses the use of _ftol when performing conversions, and makes the operation go much faster. That's the good news. The bad news is that the results may not quite conform to the ANSI C standard, unless you change the default rounding mode to "truncate" using the _controlfp C runtime function in the float.h header. However, use this trick with caution: Changing that default might have unwanted side effects in your application. So be sure to test the app thoroughly.
Another option is to use the Streaming SIMD Extensions 2 (SSE2), found in the Intel® Pentium 4 and Intel Xeon™ processors, to accelerate the operation. Here, the appropriate instruction is CVTTSS2SI. Intel recommends that you implement that function using intrinsics, rather than inlining the assembly code. The only problem here, of course, is that you'll have to set up code branches for older processors, such as the Intel Pentium III chip, which lacks the SSE2 extensions.
For more, see Microsoft's online C++ reference on SSE2-based numerical conversions. Another resource is Microsoft's "Floating-Point Intrinsics Using Streaming SIMD Extensions."
![]()
![]() | Boosting Cryptography Performance with Intel® Libraries by Muneesh Nagpal, server applications engineer, Core Software Divisio... |
![]() | Multiprocessors, clusters, grids, and parallel computing: what's the difference by Andrew Binstock, principal analyst, Pacific Data Works LLC. Intel C... |
If you're interested in this topic, these articles may be helpful:
![]() | Web services extend high-performance computing grid capabilities by Matt Gillespie, technical author. Intel Corp. Grid computing bas... |
![]() | Enterprise Java performance: best practices by Kingsum Chow, Ricardo Morin, Kumar Shiv, Software and Solutions Gro... |
![]() | Rational Performance Tester V8 Download a free trial version of Rational® Performance Tester. R... |
![]() | A high-performance architecture for distributed object computing by Douglas C. Schmidt, professor, Vanderbilt University Distributed... |
![]() | Boosting Cryptography Performance with Intel® Libraries by Muneesh Nagpal, server applications engineer, Core Software Divisio... |
![]()
Related Jobs:

