“Barcelona” Processor Feature: Advanced Bit Manipulation (ABM)

One of the new instruction sets introduced in the Third Generation AMD Opteron™ processor is Advanced Bit Manipulation (ABM), comprising two instructions that operate on general purpose registers: LZCNT and POPCNT. We’ll first explore what POPCNT can do for you.

In almost every interview I have given to date, I have been asked the question, “How would you calculate the number of bits set in a given 32-bit word?” Of course, by the thirtieth time I was asked that question, I was finally able to figure out what answer to give, which wasn’t very efficient. If you’ve tried to calculate this number yourself, or have tried to answer this question for others, I hope this discussion will be helpful because there are many ways to do it in software. One way is using lookup tables, which access memory, but multiple lookups are needed (unless you have a 4 GB table for all 32 bits!). Alternatively, you can use another common algorithm. Subtract one from the number, then perform the AND operation with the original number. Do this until the number is 0. The number of iterations it takes for the number to become zero is the number of bits set. A typical pop count function using this method would look like this:

int popcount(int x)
{
	int popcount;
	for (popcount = 0; x; x = x & (x-1), popcount++)
		;
	return popcount;
}

This function is generic and can be applied to multiple integer types. If your integer size is limited, there are a few more techniques that are floating around (easily Googled) but none of them are as efficient as one instruction.

Before I describe POPCNT (“pop count” or population count), the first of two advanced bit manipulation instructions that are provided in the new AMD Family 10h processors, you might have the exact same question that I had the first several times I was asked this in an interview:

Why on earth would anyone need this?

As it turns out, counting the number of bits set in a word (a machine word, that is), can be quite useful. I started realizing this when I moved to using bit arrays for computations.

Let me give you a quick scenario. I have implemented an array which stores the results of a network transmit operation. Each element represents a true or a false, depending on whether that particular block transmitted correctly or not. I need to use this data to calculate how much packet loss I have experienced.

Let’s say that block numbers 7, 32, and 62 were not transmitted. The values at array index 7, 32, and 62 would be set to 1 and the rest to 0. If I am transmitting megabytes of information, this array could grow very large and it would be using a minimum of 8 bits of storage for each 1 or 0 it needs to store (if I am using the smallest data type provided to me) unless I use a bit array.

If I use a bit array, my array becomes much smaller, which means that I need to do fewer memory accesses to traverse the entire array, less memory is being used, etc. The only problem is with accessing an element in this array. To see if a bit is set in the bit array, I need to read one chunk of the array into a word and then shift bit by bit to see if anything has failed.

Enter, pop count! Pop count would simply tell me how many bits were set in the word I’ve just accessed, with just one instruction! Let’s take a look at the gain I realize by using POPCNT.

For 1MB of data with a 1k block size, I have 1,000 elements. Therefore, the number of instructions taken by each approach would have been:

Original [byte array based]:
Execution: For each element, I need to read the byte value and check if it is 1. If it is, I need to increment my counter.
Cost: 3 instructions [read, compare, increment] x 1000 = 3,000
Results: 3k instructions. Not a very good idea.

Bit array [without pop count]:
Execution: For every 32 elements, I need to read one word, shift the bits out, check if the left-most bit is 1 or not (check the sign of the resultant number), and then increment my counter if the bit is 1.
Cost: (1 read + 32 shifts + 32 compares + 32 increments) x (1000/32) = 3032
Results: Considering there are much fewer reads here, this approach would still be a lot faster because of a lot fewer memory accesses.

Bit array [with pop count]:
Execution: For every 32 elements, I need to read one word, do one pop count, and increment my counter by the return value from the pop count.
Cost: (1 read + 1 POPCNT + 1 add to the counter) x (1000/32) = 94
Results: Using the POPCNT instruction here gives me a whopping 32x reduction of instructions, representing a significant performance gain! This is with using 32-bit words. For 64 bits, there is even greater performance gain.

NOTE: There are other algorithms that could result in fewer instructions without using pop count, but we have chosen this x and x-1 approach because it is easily portable. Other algorithms that could perform this function faster often require a fixed number of bits, and hence are not suitable for all purposes. Even so, pop count is faster than the most optimal approach without pop count.

In addition to this specific scenario, there are several applications where pop count can substantially increase performance. Pop counts are used in cryptography (in fact, this instruction is also commonly called the ‘canonical NSA instruction’ because of the fact that the NSA refused to buy processors which didn’t support this instruction), encoding/decoding, databases (for quickly assessing information about data), and many others. One application that I find POPCNT most useful for is to quickly calculate Hamming distances. A Hamming distance is essentially a measure of how different one word is from another. Remember, this is not how different the values held by the words are (we could just use a subtract instruction to find that out!) but how the words themselves differ. For machine words, it is defined as the number of bits that are different between the two words.

For example, take the following 8-bit words:

00110001
11010001
^^^^^

The lower 5 bits, denoted by the carats, are the same; hence only three bits are different. Therefore, the Hamming distance between these two words is 3.

A POPCNT instruction can give us the Hamming weight of a word, which is the difference between a word and the base word in its class. Because the difference between any particular word and a word with all 0s is the number of bits which are set, that is exactly what POPCNT gives us!

Of course, this doesn’t give us the Hamming distance directly, but that’s easily fixed. All we have to do is zero out the common bits between the two words and the result holds only the bits that are different. Our friendly neighborhood XOR instruction can do that for us, leaving us with the following sequence of instructions for calculating the Hamming distance between two words:

; RAX and RBX contain the two words

mov rcx, rax
xor rcx, rbx
popcnt rcx, rcx

; RCX now contains the Hamming distance between RAX and RBX

Hamming distances can be used to calculate things like error in data, as in how much error exists. They can be used as thresholds in encoding or decoding of audio/video. In fact, any place where you need any sort of fuzzy logic, Hamming distances could be useful. There are many other potential applications, but too many to be covered here.

This covers the first of two new advanced bit manipulation instructions that are introduced with the new Family 10h architecture. This leaves us with another interesting instruction, LZCOUNT, which counts the number of leading zeros in a given word. But, I’ll leave that for next time.

–Rahul Chaturvedi

This post is the opinion of the author and may not represent AMD’s positions, strategies or opinions. Links to third party sites and references to third party trademarks are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

“Barcelona” Processor Feature: Advanced Bit Manipulation (ABM)

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

'My best friend looked possessed, then he stabbed me', teenager tells court

Jamaican drug mule caught

What happened to the guy who stabbed Ron shirley from the show lizard lick...

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Named and shamed: a round up of cases heard by Essex magistrates

Various Artists – StarStruck (Original Soundtrack) [iTunes Plus AAC M4A]

Scizo – One Stone (Ft. Kwabena Kwabena) Throwback Music

Definition of Power, Duties and Organization of the Water Development...

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

The 10 Tennessee Cities With The Largest Black Population For 2021

Download: Bicko Bicko ft Rich Bizzy & Crew G- Wanfulanganya (Prod by: Bicko...

Farmington, Lewiston men plead guilty to unlawful sexual contact with children

Ndakasvirwa naGarden Boy aive neZIMBORO rinenge DANDA

GTA 5 PPSSPP Zip File Download For Android Mediafire 382 MB

ページングファイルサイズの推奨設定とその背景について

Black Angus Grilled Artichokes

I want to a weather coin buyer genuine buyer r welcome

Shatta Wale – You Shock Me (Prod. by Willis Beatz)