Quantcast
Viewing latest article 5
Browse Latest Browse All 12

“Barcelona” Processor Feature: 128-bit FPU

Image may be NSFW.
Clik here to view.
barcelona128-bit_90x660

The new AMD “Barcelona” processors introduce dramatically improved numerical performance when using the standard SSE, SSE2 and SSE3 instruction extensions. Previous AMD processors typically could execute a vectorized SSE instruction (for example, MULPS to perform four multiplies) every two clock cycles. In the AMD “Barcelona” processor, this performance is doubled so a new vectorized SSE instruction like MULPS can typically be issued every cycle. This feature is called SSE128 because an entire 128-bit SSE register is processed on each clock tick. A detailed discussion of SSE128 can be found in the article “SSE128: AMD’s New Floating-Point Enhancements.”

Furthermore, with separate pipelines for add-class and multiply-class instructions, the new processor has a theoretical peak throughput of 8 single-precision floating point calculations per clock cycle. Integer SSE instructions get a similar boost. For complete timing details on all the instructions, see the Software Optimization Guide for AMD Family 10h Processors, appendix C.

The easiest way to realize the benefits of SSE128 in real applications is to leverage existing library code which has been optimized using vectorized SSE instructions. The AMD Performance Library (APL) is one such library, providing a collection of popular software routines designed to accelerate application development, debugging, and optimization on x86 class processors to provide a quick path to high performance development. Also, the new release of the AMD Core Math Library (ACML), version 4.0, features new kernels tuned for great performance on the new processors. Specifically, DGEMM, SGEMM and CFFT have all been optimized to take advantage of the improved floating point performance. Another new feature of ACML 4.0 is the upgrade of the LAPACK routines to the new LAPACK 3.1. Many of these LAPACK routines have been optimized with OpenMP to take advantage of AMD’s new quad-core processors. ACML will continue to improve, with more optimized functions in future releases.

Intermediate or advanced programmers can write custom vectorized SSE code to improve performance. Using Microsoft’s Compiler Intrinsic functions for SSE, developers can write one version of SSE code that compiles for both 32-bit and 64-bit native platforms, something which is not possible using pure assembly code. See the article “Performance Optimization of 64-bit Windows Applications for AMD Athlon™ 64 and AMD Opteron™ Processors using Microsoft Visual Studio 2005” for an easy-to-follow tutorial with demo code showing some examples using Microsoft Visual Studio 2005.


This post is the opinion of the author and may not represent AMD’s positions, strategies or opinions. Links to third party sites and references to third party trademarks are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.


Viewing latest article 5
Browse Latest Browse All 12

Trending Articles