Quantcast
Viewing latest article 1
Browse Latest Browse All 3

Answer by Peter Cordes for AVX512BW: handle 64-bit mask in 32-bit code with bsf / tzcnt?

First of all, if your program depends much on strlen performance for large buffers, you're probably doing it wrong. Use explicit-length strings (pointer + length) like std::string so you don't have to scan the data to find the end.

Still, some APIs use implicit-length strings so you can't always avoids it. Being fast for short to medium buffers is usually important. A version that's allowed to over-read its buffer makes startup much more convenient.


Avoid 32-bit mode in the first place if you can; are you sure it's worth the effort to hand-write 32-bit AVX512 asm?

Also, are you sure you want to use 64-byte vectors at all? On Skylake-Xeon, that limits max turbo (for a long time after the last 512-bit uop) and also shuts down port 1 for vector ALU uops (at least while 512-bit uops are in flight). But if you're already using 512-bit vectors in the rest of your code then go for it, especially if you have a sufficient alignment guarantee. But it seems odd to use AVX512 and then not unroll your loop at all, unless that balance of small code footprint but good large-case handling is what you need.

You might be better off just using AVX2 for strlen even if AVX512BW is available, with some loop unrolling. Or AVX512BW + VL to still compare into mask regs, but with 32-bit masks. Or maybe not; Skylake-X can only run vpcmpeqb k0, ymm, ymm/mem on port 5, and can't micro-fuse a memory operand (note retire_slots: 2.0 in the uops.info results; It decodes to 2 separate uops even with a simple addressing mode). But AVX2 vpcmpeqb ymm, ymm, ymm/mem is 1 uop for p01, and can micro-fuse. So it could load+compare 2x ymm per clock cycle if L1d can keep up, using only 2 fused-domain uops out of the 4/clock front-end bandwidth. (But then checking it will cost more than kortest)

AVX512 integer compare takes the comparison predicate as an immediate (not part of the opcode like SSE/AVX pcmpeq/pcmpgt), so that might be what's stopping it from micro-fusing a load. But no, vptestmb k1,zmm0,[ebx] can't micro-fuse either, otherwise you could use it or vptestnmb with an all-ones vector to check for zeros in memory.

(Note that micro-fusion only works on Intel Skylake CPUs with non-indexed addressing modes. Like vpcmpeqb ymm1, ymm0, [ebx], not [ebx+eax]. See Micro fusion and addressing modes. So use a pointer-increment and subtract at the end.)


If you want to optimize for large strings, you can check two cache lines at once. Align your pointer by 128 bytes (i.e. checking normally up to a 128-byte boundary). kortestq k0,k1 Just Works for no extra cost after comparing into 2 separate mask registers.

You might want to have a look at glibc's AVX2 strlen works: https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/strlen-avx2.S.html. Its main loop (after short-string startup) uses vpminub (min of unsigned bytes) to combine 4 YMM vectors (128 bytes = 2 cache lines) down to one and checks that for a zero. After breaking out of the loop, it sorts out where the first zero actually was. (It still has the vectors in registers because it used separate vmovdqa loads; reloading them would let the main loop micro-fuse the loads to be more HT-friendly, but require reloads after breaking out.)

On SKX, vpminub zmm runs on port 0 but can micro-fuse a memory operand, while vpcmpeqb zmm runs on p5 only. Of if data is in registers, use vptestmb k0, zmm0,zmm0 so you don't need a zeroed register to compare against. Combining those could get lots of checking done with very few uops, allowing the out-of-order execution window to "see" very far ahead and maybe help with memory-level parallelism. (Data prefetch across 4k page boundaries isn't perfect.)

But this kind of optimization probably just makes the loop more hyperthreading-friendly without improving its own throughput much, and increases the amount of data to sort through when you do break out of the loop. Especially if you're using memory source operands so the original data isn't still there in vector regs. So if you care about medium-length strings (hundreds or thousands of bytes), not just large multi-megabyte strings, limiting the inner loop to look at only a couple cache lines per check sounds reasonable.


But anyway, in 32-bit code, you could simply re-check the candidate region using 32-byte vectors -> 32-bit bitmaps. Perhaps vextracti64x4 to grab the high half of a ZMM into a YMM for an AVX2 vpcmpeqb / vpmovmskb -> integer register

But it's small so you'd want to fully unroll and optimize, which is what you're asking about.

Actual answer to the question asked:

kshift + kmov is the obvious way to get the high half of a k register into a 32-bit GP register. Store/reload is extra latency (like maybe 5 or 6 cycles for store-forwarding) but avoids port 5 ALU uops. Or maybe worse, like <= 10 cycles. uops.info's dep chain to test that makes the store address dependent on the load as a way to couple store/reload into a loop-carried dep chain, so IDK if that would be different with addresses ready early.

Redoing the compare with a 256-bit vector would also work as an alternative to kmov, like AVX2 vpcmpeqb ymm1, ymm0, [ebx+32] / vpmovmskb eax, ymm1. That's 2 fused-domain uops for any port, and has no data dependency on k0 so out-of-order exec can run it in parallel with kmov. Both kmov eax, k0 and vpcmpeqb need port 0 so it might not actually be great. (Assuming the vector ALU on port 1 is still shut down because of running 512-bit uops recently.)

kmov eax, k0 has 3 cycle latency on SKX. kshiftrq is has 4 cycle latency, on a different port. So kmov + kshift + kmov could get the high half ready in an integer register in 7 cycles from when the kmov and kshift start executing (when k0 is ready, or after they're issued after a branch mispredict on leaving the loop). The loop-branch typically does mispredict when leaving the loop (definitely for large loop trip counts, but maybe not for repeated use on strings of similar length). Optimizing for avoiding a data dependency may not be helpful, e.g. doing a separate 256-bit compare.

IDK if branchless cleanup is the best bet or not. If the first non-zero byte is in the low half, avoiding a data dependency on extracting the high half is very good. But only if it predicts well!

;; UNTESTED
; input pointer in ecx, e.g. MS Windows fastcall
strlen_simple_aligned64_avx512_32bit:
   vpxor     xmm0, xmm0, xmm0       ; ZMM0 = _mm512_setzero_si512()
   lea       eax, [ecx+64]          ; do this now to shorten the loop-exit critical path
.loop:
   vpcmpeqb  k0, zmm0, [ecx]     ; can't micro-fuse anyway, could use an indexed load I guess
   add       ecx, 64
   kortestq  k0, k0 
   jnz   .loop                   ; loop = 5 uops total :(
    ;;; ecx - 64 is the 64-byte block that contains a zero byte

; to branch: `kortestd k0,k0` to only look at the low 32 bits, or kmovd / test/jnz to be optimistic that it's in the low half

   kmovd     edx, k0              ; low bitmap
   kshiftrq  k0, k0, 32
    sub       ecx, eax            ; ecx = end_base+64 - (start+64) = end_base
   kmovd     eax, k0              ; high bitmap

   tzcnt     eax, eax             ; high half offset
   bsf       edx, edx             ; low half offset, sets ZF if low==0
   lea       eax, [ecx + eax + 32]  ; high half length = base + (32+high_offset)
       ;; 3-component LEA has 3 cycle latency
       ;; with more registers we could have just an add on the critical path here
   lea       ecx, [ecx + edx]       ; ecx = low half length not touching flags

    ; flags still set from BSF(low)
   cmovnz    eax, ecx             ; return low half if its bitmap was non-zero
   vzeroupper                 ; or use ZMM16 to maybe avoid needing this?
   ret

Note that bsf sets flags based on its input while tzcnt sets flags based on the result. It's a single uop with 3 cycle latency on Intel, same as tzcnt. AMD has slow bsf but doesn't support AVX512 on any current CPUs. I'm assuming Skylake-avx512 / Cascade Lake here as the uarch to optimize for. (And Ice Lake). KNL / KNM have slow bsf but Xeon Phi doesn't have AVX512BW.

Using more instructions could shorten the critical path, e.g. creating base+32 in parallel with the tzcnt / bsf so we could avoid a 3-component LEA between that and cmov. I think I would have had to push/pop a call-preserved register like EBX or EDI to keep all the temporaries.

Simple lea runs on p15 on Skylake, complex lea (3 component) runs on p1. So it doesn't compete with any of the kmov and kshift stuff, and with 512-bit uops in flight port 1 is shut down for SIMD. But tzcnt/bsf runs on port 1 so there is competition there. Still, with LEA dependent on the output of tzcnt, resource conflicts are probably not a problem. And Ice Lake puts LEA units on every port which can handle 3-component LEA in a single cycle (InstLatx64).

If you were using kortest k0, k1 with 2 separate masks, you'd probably want to use kortest k0,k0 to figure out if there was a zero in just the first mask or not, and only then pick apart k0 or k1 with 32-bit GP integer registers.


bsf leaves its destination unmodified when its input is all zero. This property is documented by AMD but not Intel. Intel CPUs do implement it. You might want to take advantage of it, especially if you include a unit-test to make sure it works on the CPU you're running on.

But maybe not because it couples the dependency chains together, making the bsf of the low half dependent on the tzcnt+add on the high half. It does look like it saves uops, though. Still, depending on the use-case latency might not be very important. If you're just calculating a loop bound for some other loop, it's not needed right away and there will be later work that's independent of the strlen result. OTOH if you're about to loop over the string again, you can often do strlen on the fly instead.

(I also changed from pointer-increment to indexed addressing, in a way that saves 1 more uop because it doesn't micro-fuse anyway. It does introduce an extra add of address latency before the first load.)

;; untested, uses BSF's zero-input behaviour instead of CMOV
;; BAD FOR LATENCY
strlen_aligned64_throughput:
   vpxor     xmm0, xmm0, xmm0       ; ZMM0 = _mm512_setzero_si512()
   mov       edx, -64
.loop:
   add       edx, 64
   vpcmpeqb  k0, zmm0, [ecx+edx]     ; can't micro-fuse anyway on SKX, might as well use an indexed
   kortestq  k0, k0 
   jnz   .loop                   ; loop = 5 uops total :(
    ;;; edx is the lowest index of the 64-byte block

   kshiftrq  k1, k0, 32
   kmovd     eax, k1              ; high bitmap
   tzcnt     eax, eax              ; could also be bsf, it's just as fast on Skylake
   add       eax, 32              ; high index = tzcnt(high) + 32

   kmovd     ecx, k0              ; low bitmap
   bsf       eax, ecx             ; index = low if non-zero, else high+32

   add       eax, edx             ; pos = base + offset
   vzeroupper
   ret

Note using kshift into a separate register so we can get the high half first (in program order), avoiding the need to save/restore any extra registers. With only 3 architectural registers (without saving/restoring more), we can let register renaming + OoO exec take care of things.

Critical path latency is not great. From k0 being ready, kmovd can get the low-half bitmap out, but bsf eax, ecx can't begin until eax is ready. That depends on kshift (4) -> kmov (3) -> tzcnt (3), add (1) = 11 cycles, then bsf is another 3 cycles on top of that.

If we did the bsf operations in parallel, best case we could have tzcnt(hi) + add feeding into a CMOV (1 extra cycle) which has 2 integer inputs from the two BSF chains, and flags input from something on the low half. (So the critical path would just come from the high half, the low half doesn't involve kshift and can be ready sooner).

In the previous version of this, I used a 3-component lea on the high-half dep chain which isn't great either.


Related: AVX512CD has SIMD vplzcntq

But you can't use it for tzcnt because we don't have an efficient bit-reverse.

Also, you'd need the 64-bit mask back into a vector element, and then vmovd to an integer reg.

There are instructions for exploding a bitmask into a vector mask (like VPMOVM2B, but there's also VPBROADCASTMW2D xmm1, k1 to just copy a mask to vector elements. Unfortunately it's only available for byte or word mask widths (not AVX512BW). So that doesn't solve the problem. In 64-bit mode obviously you could kmovq to an integer reg and vmovq to a vector, but then you'd just use scalar lzcnt or tzcnt


Viewing latest article 1
Browse Latest Browse All 3

Trending Articles