PCMPSTR calculator

Generate SSE4.2 vector string instructions snippets.

These instructions are flexible programable parallel comparison engines on modern x86 CPUs, that can be used for faster string processing and other things. Their flexibility makes them difficult to use. Explore them interactively. More introduction on Andi's blog.
Happy vectorization!

src1 and src2 are 16 byte SSE values. When loading from memory they must not cross into a unmapped page. Intrinsics should work for gcc, icc, msvc.

Unit width:
Result Masking:
String end check:

Result check:

AT&T Asm

Intel Asm

C intrinsics

Pseudo code

no code
no code

no code
no code

Building Block Examples

Thanks to Shihjong Kuo

Find substring (strstr)

Looping through a buffer containing null-terminated string to find a prefix matching the first 16-byte of the needle, which is a null-terminated string of at least 16-bytes long.

        int cmp_c = 0, cmp, cmp_z = 0;
	_mm128i frag2, haystack;
	while (!cmp_c) {
           if (cmp_z)
             return NULL;  // EOS and matching prefix not found
           buf += 16;
           haystack = strloadu (buf); // load 16 bytes safely across 4KB page boundary. 
           cmp_c = _mm_cmpistrc (frag2, haystack, 0x0c); // indicates a prefix match if not 0
           cmp = _mm_cmpistri (frag2, haystack, 0x0c);   // offset into haystack of the matching prefix
           cmp_z = _mm_cmpistrz (frag2, haystack, 0x0c);  // indicates end of string encountered in haystack if not 0
Details of a strstr() implementation using SSE4.2 can be found in GLIBC 2.16 source tarball. See sysdeps/x86_64/multiarch/strstr.c.

Set search (strspn/strcspn)

Looping through a buffer to find the first character that is not part of an inclusive subset of characters represented by "accept”, assuming that accept" contains no more than 16 characters.

   __m128i mask; // the "accept" string representing a list of inclusive characters loaded into an XMM register.
   while (1) {
         __m128i value = _mm_load_si128 ((__m128i *) aligned); // continue from aligned buffer address to load 16 bytes of unprocessed data
         int index = _mm_cmpistri (mask, value, 0x12); // where is the first non-matching character?
         int cflag = _mm_cmpistrc (mask, value, 0x12); // Is there a non-marching character in the current 16-byte chunk?
         if (cflag)   return (size_t) (aligned + index - s);
         aligned += 16;
Details of strspn(), strcspn, strpbrk() implementations using SSE4.2 can be found in GLIBC 2.16 source tarball. See sysdeps/x86_64/multiarch.

Load string vector safely at page boundary

When your code cannot guarantee that there is at least 15 byte padding at the end of the string to the next 4k page boundary or the next page is never unmapped you can use this function to safely load vectors for string processing. This will add some overhead to the inner loop, but typically still beat byte-by-byte prcessing. This is useful for libraries if they cannot control their input values.

Various examples floating around the interwebs get this wrong and are unsafe.

It is also possible to combine the checks of multiple pointers by oring them together first (at some risk of false positives). This can lower the fast path check overhead in the loop.

/* Load string at S near page boundary safely. */
inline __m128i _strloadu (const char *s)
  int offset = ((size_t) s & (16 - 1));
  if (offset) {
    __m128i v = _mm_load_si128 ((__m128i *) (s - offset));
    __m128i zero = _mm_setzero_si128 ();
    int bmsk = _mm_movemask_epi8 (_mm_cmpeq_epi8 (v, zero));
    if ( (bmsk >> offset) != 0 ) return __m128i_shift_right (v, offset);
  return _mm_loadu_si128 ((__m128i *) s);
The Intel Optimization Manual has more examples.

Feedback to pcmpstr at halobates.de.