by Khang Nguyen, senior applications engineer, Software and Solutions Group, Intel Corp.
Media transcoding, which enables media interoperation, plays an important role in the digital home. The Intel Networked Media Product Requirements (INMPR) promotes interoperation between networked devices in the digital home. Optimizing the codec engine (the encoder/decoder, the heart of the transcoder) will make the media transcoding process more efficient, in turn improving the user experience in the digital home. This paper features practical tips and tricks on how to increase the performance of the codec engine. These tips include using Intel® VTune™ Performance Analyzer events, OpenMP for threading, and Prescott New Instructions (Streaming SIMD Extensions 3 (SSE3)). We also discuss when to use faster instructions, employing different execution units to improve parallelism, and when to use MMX™ instead of SSE for speed. You will also learn when to take advantage of the Intel compiler optimized switches.
What is Transcoding?
Since content comes in many different formats, transcoding is necessary to tailor the content, converting one media format to another, before it arrives at the other device. The most common way to convert one media format to another is to first decode to raw data, then encode to the target format. Since an MPEG stream consists of audio and video, we need to split these separately and decode them into raw data before re-encoding them to the desired formats and merging them again.

Codec Optimization
Codec is the compressed and decompressed process. It is the heart, or engine, of the transcoder.
Optimizing the codec can be done by reducing the time to encode and/or decode a file/stream. We can also enhance the engine by reducing the CPU utilization, which lets us pack more features or data into the same time frame: for example, more voices to represent more people in a game. Finally, we need to cut down the size for size sensitive or mobile applications since media applications exist in desktop, laptop, PDA, and smartphone form factors.
General
The optimized process starts with the following steps:
- Use better hardware
- Use the Intel VTune Performance Analyzer to find hotspots
- Look at functions that have highest clock ticks and clock ticks per instruction retired (CPI)
- Turn on counters for branch misprediction, store forwarding, 64K aliasing, cache split, and trace cache miss
- Follow general optimization rules
- Loop unrolling, reduce branching, use SSE2/SSE3
- Use the Intel compiler
- Use the Intel Performance Library Suite
- Follow general optimization rules
Observe the following steps at all time:
- All pitfalls applied (cache split, branch misprediction, store forwarding, etc.)
- Thread at the highest level possible to avoid running out of resources. Since this is an engine that is used by other applications, its functions can be called many times, especially since the applications are also threaded.
- Pay attention when threading applications that make use of Intel performance libraries, since some of their functions are threaded.
- Do not unroll loops too much to avoid trace cache thrash.
- Do not ignore MMX, since it can be faster than SSE/SSE2 in cases when applications make extensive use of 64-bit data, and it takes effort to rearrange the data to fit into 128-bit registers.
- Watch out for battery life on mobile applications.
- Use Intel compiler: /O3, /QaxW, /QaxN, /QaxP, /Qipo, /Qparallel, /Qopenmp. Often you can gain a significant amount of performance just by using the Intel compiler with the right switches.
- Use special functions like reciprocal (rcp and rcp_nr) to replace division with multiplication and speedup the application.
- Use SSE3 instruction LDDQU instead of MOVDQU whenever possible.
- Faster instructions
- Different execution units
- MOVNTxx: Store values using Non-Temporal Hint to prevent caching of the data.
- Use combined instruction like PMADDWD.
-
Examples
When to Use Thread
Before:
When Not to Use Thread
Before:
After:
At first, this loop seems to be a good candidate for threading. In fact, it will improve the performance if it is at the outermost level. However, if this loop is in a function that is deeply buried in many sub-levels, threading it may mean running out of resources. In one case, this loop was implemented within a function that only takes about 8.8% of the total execution time. After threading only 2 loops, it degraded the whole system down to 5X slower.
![]()
![]() | The "Rich-Client" Advantage for .NET Web Services by Dan Fineberg, enterprise/business marketing manager, and Gary Hayco... |
If you're interested in this topic, these articles may be helpful:
![]() | Writing robust code by Glen McCluskey, Glen McCluskey & Associates LLC Many of the te... |
![]() | Web services essentials: code examples by Ethan Cerami, O'Reilly Media Inc. This .zip file contains code e... |
![]() | Optimize Game Code for Better Real-Time Physics Gamers are constantly looking for the next hot playing experience. Gam... |
![]() | Getting the bubbles out of code: designing for the Itanium 2 processor by Andrew Binstock, principal analyst, Pacific Data Works LLC. Intel C... |
![]() | Web Code Optimization: Google does it. Yahoo! does it. Why don't you do it? by Tad Fleshman. Port80 Software Inc. Google and Yahoo! know that s... |
![]()
Related Jobs:


