We all crave high performing code and in the process we try hard to optimize the algorithms, reorder instructions, unroll loops, avoid branches, reduce pointer usage to allow compilers to optimize, replace dynamic allocation with static allocation where the size is known and so on. One such optimization is with respect to data loads and stores from memory which consume a majority of processing cycles in data-intensive applications. Here, I’ll take you through one such optimization with respect to data alignment while using SSE (Streaming SIMD Extension) instructions.
Why use SSE instructions?
SSE instructions operate on 16 bytes of data in parallel. We can load 16 bytes of data at a time and compute those 16 bytes of packed data using a single SSE instruction.
Example: ADDPS xmm1, xmm2 – Add 4 single precision floating point elements packed in xmm1 register with corresponding elements packed in xmm2 and store the result back in xmm1.
SSE instructions are widely used in developing computation-intensive multimedia applications. Typically, these applications process large amounts of sequential data through the following steps:
- Load data from memory
- Perform computation on the data
- Store data back to memory
First we will discuss the intricacies involved in optimizing memory operations using SSE instructions on the AMD-K8™ family of processors (first and second generation AMD Opteron™ processors) and then we will discuss the architectural enhancements provided by the “Barcelona” or Family 10h processors (including Quad-Core AMD Opteron and AMD Phenom™ X4 Quad-Core and X3 Triple-Core processors).
SSE instructions consist of two types of load and store instructions. The first type is aligned loads and stores (ex: MOVDQA, MOVAPD, MOVPS) that operate on 16 byte aligned memory addresses. The second type is unaligned loads and stores (ex: MOVDQU, MOVUPD, MOVUPS) that operate on both aligned and unaligned memory addresses. On the AMD-K8 family of processors the aligned version of load and store operations are faster than the unaligned operations even if the memory is 16 byte aligned. For details on the latencies of the various types of load and store instructions, refer to the AMD Software Optimization Guide for AMD Family 10h Processors.
If we use the aligned version of memory operations without verifying the memory address alignments then there are two possible outcomes. First, if the memory is aligned then the memory operations are fast. Second, if the memory is unaligned then the system throws an exception and hence the application crashes (Bang!!!). Now, the solution to this problem is to align the input data to both gain performance and eliminate exceptions and crashes. This solution may not work always since the target user using the application may not align the data or because enforcing such a rule may be inappropriate at times. The easy solution here is to use the safer unaligned loads and stores, sacrificing performance irrespective of the data alignment.
If you are a programmer looking for the best possible performance, saving every single processing cycle, then the solution here is to handle both aligned and unaligned data by checking for alignment of the data at runtime and call the appropriate function that handles either aligned data or unaligned data.
The code to handle aligned and unaligned data is as follows:
if ( isAligned(data) ) { process_aligned (data); } else { process_unaligned(data); } //The 16 byte alignment check code is as follows. bool isAligned(void* data) { return ((data%16) == 0); }
Typically, the process_aligned
and process_unaligned
routines have identical code except for the type of load and store instructions.
Architectural enhancements in AMD Family 10h processors (“Barcelona” processors)
“Barcelona” comes with load instructions that are twice as fast as the previous generation processors. For example, the aligned loads take 2 processor cycles in “Barcelona,” compared to 4 processor cycles in the AMD-K8 architecture. This is only the latency of the instruction execution; there could be additional latency depending on the locality of the actual data being present in cache or main memory.
The unaligned loads in “Barcelona” run at the speed of aligned loads if the data is aligned. Thus, it is safer to use unaligned loads whenever the alignment of the data is not guaranteed, hence eliminating the check for 16 byte alignment at runtime. If the data is unaligned then the instruction is slightly slower than aligned loads but at an improved speed compared to the unaligned loads on AMD-K8 processors. The FPU unit in “Barcelona” has been widened to 128 bits from 64 bits and the load instructions are fast path instructions. (Note: In AMD-K8 processors, SSE loads are vector path instructions which block the execution units from executing any other instruction in parallel.)
The above optimizations are not applicable for SSE stores. The unaligned stores are slower than aligned stores even when the data is aligned.
–Ravindra Babu
Software Engineer, AMD
This post is the opinion of the author and may not represent AMD’s positions, strategies or opinions. Links to third party sites and references to third party trademarks are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.