String processing is quite common in software. This can be simple operations like searching for characters in a string, more complex subset and sub string searches to complex transformations. Most string processing code processes strings byte by byte (or 16bit word by word for 16bit unicode). That is fine for code that is not time critical.
A modern CPU can process 32bit-256bit at a time inside a core, if it does only 8 bit in each operation then large parts of the computing capacity are unused. If the string code processes a lot of data and slows down the program it’s better to find some way to use this unused computing capability. I’m only talking about a single core here (thread level parallelism), not multi-core.
For simple standard algorithms like strlen there are algorithms available that allow to process the string in 32bit or 64bit units using various bit manipulation tricks. (Hacker’s delight has all the details) Typically C libraries already use optimized algorithms (although the compiler may not always use the standard library if you just call strlen). But they are usually too difficult to use for custom string processing.
Also they usually need special cases to handle the end of the string (the length of a string is often not known in advance) and are complicated to validate and write. Also typically is only a win for relatively simple algorithms, and come at a heavy cost of programer time.
On modern x86 CPUs an alternative is to use the SSE 4.2 string instructions PCMP*STR*. They are essentially programmable comparison engines to compare two 128bit vectors at a time. They can handle string end detection and various other cases directly. They are very powerful, but since they are so configurable also hard to use (2^9 possible combinations). The hardware can do all these operations in parallel, so it’s more efficient than a explicit loop.
These instructions are much easier to use than the old bit twiddling tricks, but they are still hard.
They operate on two 128 bit vector inputs and compare all the 8bit (or 16bit) pairs based on a 8bit configuration flag word. The input can be 0 terminated strings or explicit lengths for both vectors. The output is either a vector mask or a index, plus some conditional codes.
The instructions can be either used from assembler, or using C intrinsics.
To make this all easier to use I did the PCMPSTR calculator that allows to interactively configure and explore these instructions. It generates snippets that can be used in programs.
Feedback welcome and let me know if you find a new interesting use with this.
Some tips for tuning
- Always profile first to see if something is time critical. Don’t bother with code that does not show up in the profiler
- Understand what’s common and uncommon in typical input data
- Always have a reference version of the algorithm that does things in a straight forward matter and is optimized for clarity, not speed. First write the program using the reference code and profile before tuning. Have a debug build that always compares the output of reference and optimized.
- Make sure the reference version stays up-to-date when the program changes.