“Barcelona” Processor Feature: SSE Misaligned Access

We all crave high performing code and in the process we try hard to optimize the algorithms, reorder instructions, unroll loops, avoid branches, reduce pointer usage to allow compilers to optimize, replace dynamic allocation with static allocation where the size is known and so on. One such optimization is with respect to data loads and stores from memory which consume a majority of processing cycles in data-intensive applications. Here, I’ll take you through one such optimization with respect to data alignment while using SSE (Streaming SIMD Extension) instructions.

Why use SSE instructions?

SSE instructions operate on 16 bytes of data in parallel. We can load 16 bytes of data at a time and compute those 16 bytes of packed data using a single SSE instruction.

Example: ADDPS xmm1, xmm2 – Add 4 single precision floating point elements packed in xmm1 register with corresponding elements packed in xmm2 and store the result back in xmm1.

SSE instructions are widely used in developing computation-intensive multimedia applications. Typically, these applications process large amounts of sequential data through the following steps:

Load data from memory
Perform computation on the data
Store data back to memory

First we will discuss the intricacies involved in optimizing memory operations using SSE instructions on the AMD-K8™ family of processors (first and second generation AMD Opteron™ processors) and then we will discuss the architectural enhancements provided by the “Barcelona” or Family 10h processors (including Quad-Core AMD Opteron and AMD Phenom™ X4 Quad-Core and X3 Triple-Core processors).

SSE instructions consist of two types of load and store instructions. The first type is aligned loads and stores (ex: MOVDQA, MOVAPD, MOVPS) that operate on 16 byte aligned memory addresses. The second type is unaligned loads and stores (ex: MOVDQU, MOVUPD, MOVUPS) that operate on both aligned and unaligned memory addresses. On the AMD-K8 family of processors the aligned version of load and store operations are faster than the unaligned operations even if the memory is 16 byte aligned. For details on the latencies of the various types of load and store instructions, refer to the AMD Software Optimization Guide for AMD Family 10h Processors.

If we use the aligned version of memory operations without verifying the memory address alignments then there are two possible outcomes. First, if the memory is aligned then the memory operations are fast. Second, if the memory is unaligned then the system throws an exception and hence the application crashes (Bang!!!). Now, the solution to this problem is to align the input data to both gain performance and eliminate exceptions and crashes. This solution may not work always since the target user using the application may not align the data or because enforcing such a rule may be inappropriate at times. The easy solution here is to use the safer unaligned loads and stores, sacrificing performance irrespective of the data alignment.

If you are a programmer looking for the best possible performance, saving every single processing cycle, then the solution here is to handle both aligned and unaligned data by checking for alignment of the data at runtime and call the appropriate function that handles either aligned data or unaligned data.

The code to handle aligned and unaligned data is as follows:

if ( isAligned(data) )
{
      process_aligned (data);
}
else
{
      process_unaligned(data);
}

//The 16 byte alignment check code is as follows.
bool isAligned(void* data)
{
      return ((data%16) == 0);
}

Typically, the process_aligned and process_unaligned routines have identical code except for the type of load and store instructions.

Architectural enhancements in AMD Family 10h processors (“Barcelona” processors)

“Barcelona” comes with load instructions that are twice as fast as the previous generation processors. For example, the aligned loads take 2 processor cycles in “Barcelona,” compared to 4 processor cycles in the AMD-K8 architecture. This is only the latency of the instruction execution; there could be additional latency depending on the locality of the actual data being present in cache or main memory.

The unaligned loads in “Barcelona” run at the speed of aligned loads if the data is aligned. Thus, it is safer to use unaligned loads whenever the alignment of the data is not guaranteed, hence eliminating the check for 16 byte alignment at runtime. If the data is unaligned then the instruction is slightly slower than aligned loads but at an improved speed compared to the unaligned loads on AMD-K8 processors. The FPU unit in “Barcelona” has been widened to 128 bits from 64 bits and the load instructions are fast path instructions. (Note: In AMD-K8 processors, SSE loads are vector path instructions which block the execution units from executing any other instruction in parallel.)

The above optimizations are not applicable for SSE stores. The unaligned stores are slower than aligned stores even when the data is aligned.

–Ravindra Babu
Software Engineer, AMD

This post is the opinion of the author and may not represent AMD’s positions, strategies or opinions. Links to third party sites and references to third party trademarks are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

“Barcelona” Processor Feature: SSE Misaligned Access

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112