Removing characters from strings faster with AVX-512

101
[favorite_button]
Removing characters from strings faster with AVX-512
Hello reader! Welcome, let's start-

I adore ingredients, because they are nice!
In software, it is a common problem to want to remove specific characters from a string. To make the problem precise, let us consider the removal of all ASCII control characters and spaces. In practice, it means the removal of all byte values smaller or equal than 32.

I covered a related problem before, the removal of all spaces from strings. At the time, I concluded that the fastest approach might be to use SIMD instructions coupled with a large lookup table. A SIMD instruction is such that it can operate on many words at any given time: most commodity processors have instructions able to operate on 16 bytes at a time. Thus, using a single instruction, you can compare 16 consecutive bytes and identify the location of all spaces, for example. Once it is done, you must somehow move the unwanted characters. Most instruction sets do not have instructions for that purpose, however x64 processors have an instruction that can move bytes around as long as you have a precomputed shuffle mask (pshufb). ARM NEON has similar instructions as well. Thus you proceed in the following manner:

Identify all unwanted characters in a block (e.g., 16 bytes).
Lookup a shuffle mask in a large table.
Move the unwanted bytes using the shuffle mask.

Such an approach is fast but it requires possibly large tables.  Indeed, if you load 16 bytes, you need a table with 65536 shuffle masks. Storing such large tables is not very practical.

Recent Intel processors have handy new instructions that do exactly what we want: they prune out unwanted bytes (vpcompressb). It requires a recent processor with AVX-512 VBMI2 such as Ice Lake, Rocket Lake, Alder Lake, or Tiger Lake processors. Intel makes it difficult to figure out which features is available on which processor, so you need to do some research to find out if your favorite Intel processors supports the desired instructions. AMD processors do not support VBMI2.

On top of the new instructions, AVX-512 also allows you process the data in larger blocks (64 bytes). Using Intel instructions, the code is almost readable. I create a register containing only the space byte, and I then iterate over my data, each time loading 64 bytes of data. I compare it with the space: I only want to keep values that are large (in byte values) than the space. I then call the compress instruction which takes out the unwanted bytes. I read at regular intervals (every 64 bytes) but I write a variable number of bytes, so I advance the write pointer by the number of set bits in my mask: I count those using a fast instruction (popcnt).

__m512i spaces=_mm512_set1_epi8(‘ ‘);
size_t i=0;
for (; i + 63
Read More
Share this on knowasiak.com to discuss with people on this topicSign Up on Knowasiak.com now if you’re not registered yet.

Advertisements
Charlie
WRITEN BY

Charlie

Fill your life with experiences so you always have a great story to tell
Get Connected!
One of the Biggest Social Platform for Entrepreneurs, College Students and all. Come and join our community. Expand your network and get to know new people!

Discussion(s)

No comments yet

🌟 Latest Members

Knowasiak We would like to show you notifications so you don't miss chats & status updates.
Dismiss
Allow Notifications