First of all, if your program depends much on strlen
performance for large buffers, you're probably doing it wrong. Use explicit-length strings (pointer + length) like std::string
so you don't have to scan the data to find the end.
Still, some APIs use implicit-length strings so you can't always avoids it. Being fast for short to medium buffers is usually important. A version that's allowed to over-read its buffer makes startup much more convenient.
Avoid 32-bit mode in the first place if you can; are you sure it's worth the effort to hand-write 32-bit AVX512 asm?
Also, are you sure you want to use 64-byte vectors at all? On Skylake-Xeon, that limits max turbo (for a long time after the last 512-bit uop) and also shuts down port 1 for vector ALU uops (at least while 512-bit uops are in flight). But if you're already using 512-bit vectors in the rest of your code then go for it, especially if you have a sufficient alignment guarantee. But it seems odd to use AVX512 and then not unroll your loop at all, unless that balance of small code footprint but good large-case handling is what you need.
You might be better off just using AVX2 for strlen
even if AVX512BW is available, with some loop unrolling. Or AVX512BW + VL to still compare into mask regs, but with 32-bit masks. Or maybe not; Skylake-X can only run vpcmpeqb k0, ymm, ymm/mem
on port 5, and can't micro-fuse a memory operand (note retire_slots: 2.0 in the uops.info results; It decodes to 2 separate uops even with a simple addressing mode). But AVX2 vpcmpeqb ymm, ymm, ymm/mem
is 1 uop for p01, and can micro-fuse. So it could load+compare 2x ymm per clock cycle if L1d can keep up, using only 2 fused-domain uops out of the 4/clock front-end bandwidth. (But then checking it will cost more than kortest
)
AVX512 integer compare takes the comparison predicate as an immediate (not part of the opcode like SSE/AVX pcmpeq
/pcmpgt
), so that might be what's stopping it from micro-fusing a load. But no, vptestmb k1,zmm0,[ebx]
can't micro-fuse either, otherwise you could use it or vptestnmb
with an all-ones vector to check for zeros in memory.
(Note that micro-fusion only works on Intel Skylake CPUs with non-indexed addressing modes. Like vpcmpeqb ymm1, ymm0, [ebx]
, not [ebx+eax]
. See Micro fusion and addressing modes. So use a pointer-increment and subtract at the end.)
If you want to optimize for large strings, you can check two cache lines at once. Align your pointer by 128 bytes (i.e. checking normally up to a 128-byte boundary). kortestq k0,k1
Just Works for no extra cost after comparing into 2 separate mask registers.
You might want to have a look at glibc's AVX2 strlen works: https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/strlen-avx2.S.html. Its main loop (after short-string startup) uses vpminub
(min of unsigned bytes) to combine 4 YMM vectors (128 bytes = 2 cache lines) down to one and checks that for a zero. After breaking out of the loop, it sorts out where the first zero actually was. (It still has the vectors in registers because it used separate vmovdqa
loads; reloading them would let the main loop micro-fuse the loads to be more HT-friendly, but require reloads after breaking out.)
On SKX, vpminub zmm
runs on port 0 but can micro-fuse a memory operand, while vpcmpeqb zmm
runs on p5 only. Of if data is in registers, use vptestmb k0, zmm0,zmm0
so you don't need a zeroed register to compare against. Combining those could get lots of checking done with very few uops, allowing the out-of-order execution window to "see" very far ahead and maybe help with memory-level parallelism. (Data prefetch across 4k page boundaries isn't perfect.)
But this kind of optimization probably just makes the loop more hyperthreading-friendly without improving its own throughput much, and increases the amount of data to sort through when you do break out of the loop. Especially if you're using memory source operands so the original data isn't still there in vector regs. So if you care about medium-length strings (hundreds or thousands of bytes), not just large multi-megabyte strings, limiting the inner loop to look at only a couple cache lines per check sounds reasonable.
But anyway, in 32-bit code, you could simply re-check the candidate region using 32-byte vectors -> 32-bit bitmaps. Perhaps vextracti64x4
to grab the high half of a ZMM into a YMM for an AVX2 vpcmpeqb
/ vpmovmskb
-> integer register
But it's small so you'd want to fully unroll and optimize, which is what you're asking about.
Actual answer to the question asked:
kshift
+ kmov
is the obvious way to get the high half of a k register into a 32-bit GP register. Store/reload is extra latency (like maybe 5 or 6 cycles for store-forwarding) but avoids port 5 ALU uops. Or maybe worse, like <= 10 cycles. uops.info's dep chain to test that makes the store address dependent on the load as a way to couple store/reload into a loop-carried dep chain, so IDK if that would be different with addresses ready early.
Redoing the compare with a 256-bit vector would also work as an alternative to kmov
, like AVX2 vpcmpeqb ymm1, ymm0, [ebx+32]
/ vpmovmskb eax, ymm1
. That's 2 fused-domain uops for any port, and has no data dependency on k0
so out-of-order exec can run it in parallel with kmov
. Both kmov eax, k0
and vpcmpeqb
need port 0 so it might not actually be great. (Assuming the vector ALU on port 1 is still shut down because of running 512-bit uops recently.)
kmov eax, k0
has 3 cycle latency on SKX. kshiftrq
is has 4 cycle latency, on a different port. So kmov + kshift + kmov could get the high half ready in an integer register in 7 cycles from when the kmov and kshift start executing (when k0
is ready, or after they're issued after a branch mispredict on leaving the loop). The loop-branch typically does mispredict when leaving the loop (definitely for large loop trip counts, but maybe not for repeated use on strings of similar length). Optimizing for avoiding a data dependency may not be helpful, e.g. doing a separate 256-bit compare.
IDK if branchless cleanup is the best bet or not. If the first non-zero byte is in the low half, avoiding a data dependency on extracting the high half is very good. But only if it predicts well!
;; UNTESTED
; input pointer in ecx, e.g. MS Windows fastcall
strlen_simple_aligned64_avx512_32bit:
vpxor xmm0, xmm0, xmm0 ; ZMM0 = _mm512_setzero_si512()
lea eax, [ecx+64] ; do this now to shorten the loop-exit critical path
.loop:
vpcmpeqb k0, zmm0, [ecx] ; can't micro-fuse anyway, could use an indexed load I guess
add ecx, 64
kortestq k0, k0
jnz .loop ; loop = 5 uops total :(
;;; ecx - 64 is the 64-byte block that contains a zero byte
; to branch: `kortestd k0,k0` to only look at the low 32 bits, or kmovd / test/jnz to be optimistic that it's in the low half
kmovd edx, k0 ; low bitmap
kshiftrq k0, k0, 32
sub ecx, eax ; ecx = end_base+64 - (start+64) = end_base
kmovd eax, k0 ; high bitmap
tzcnt eax, eax ; high half offset
bsf edx, edx ; low half offset, sets ZF if low==0
lea eax, [ecx + eax + 32] ; high half length = base + (32+high_offset)
;; 3-component LEA has 3 cycle latency
;; with more registers we could have just an add on the critical path here
lea ecx, [ecx + edx] ; ecx = low half length not touching flags
; flags still set from BSF(low)
cmovnz eax, ecx ; return low half if its bitmap was non-zero
vzeroupper ; or use ZMM16 to maybe avoid needing this?
ret
Note that bsf
sets flags based on its input while tzcnt
sets flags based on the result. It's a single uop with 3 cycle latency on Intel, same as tzcnt
. AMD has slow bsf
but doesn't support AVX512 on any current CPUs. I'm assuming Skylake-avx512 / Cascade Lake here as the uarch to optimize for. (And Ice Lake). KNL / KNM have slow bsf
but Xeon Phi doesn't have AVX512BW.
Using more instructions could shorten the critical path, e.g. creating base+32
in parallel with the tzcnt / bsf so we could avoid a 3-component LEA between that and cmov. I think I would have had to push/pop a call-preserved register like EBX or EDI to keep all the temporaries.
Simple lea
runs on p15 on Skylake, complex lea
(3 component) runs on p1
. So it doesn't compete with any of the kmov
and kshift
stuff, and with 512-bit uops in flight port 1 is shut down for SIMD. But tzcnt
/bsf
runs on port 1 so there is competition there. Still, with LEA dependent on the output of tzcnt
, resource conflicts are probably not a problem. And Ice Lake puts LEA units on every port which can handle 3-component LEA in a single cycle (InstLatx64).
If you were using kortest k0, k1
with 2 separate masks, you'd probably want to use kortest k0,k0
to figure out if there was a zero in just the first mask or not, and only then pick apart k0 or k1 with 32-bit GP integer registers.
bsf
leaves its destination unmodified when its input is all zero. This property is documented by AMD but not Intel. Intel CPUs do implement it. You might want to take advantage of it, especially if you include a unit-test to make sure it works on the CPU you're running on.
But maybe not because it couples the dependency chains together, making the bsf
of the low half dependent on the tzcnt
+add
on the high half. It does look like it saves uops, though. Still, depending on the use-case latency might not be very important. If you're just calculating a loop bound for some other loop, it's not needed right away and there will be later work that's independent of the strlen result. OTOH if you're about to loop over the string again, you can often do strlen on the fly instead.
(I also changed from pointer-increment to indexed addressing, in a way that saves 1 more uop because it doesn't micro-fuse anyway. It does introduce an extra add
of address latency before the first load.)
;; untested, uses BSF's zero-input behaviour instead of CMOV
;; BAD FOR LATENCY
strlen_aligned64_throughput:
vpxor xmm0, xmm0, xmm0 ; ZMM0 = _mm512_setzero_si512()
mov edx, -64
.loop:
add edx, 64
vpcmpeqb k0, zmm0, [ecx+edx] ; can't micro-fuse anyway on SKX, might as well use an indexed
kortestq k0, k0
jnz .loop ; loop = 5 uops total :(
;;; edx is the lowest index of the 64-byte block
kshiftrq k1, k0, 32
kmovd eax, k1 ; high bitmap
tzcnt eax, eax ; could also be bsf, it's just as fast on Skylake
add eax, 32 ; high index = tzcnt(high) + 32
kmovd ecx, k0 ; low bitmap
bsf eax, ecx ; index = low if non-zero, else high+32
add eax, edx ; pos = base + offset
vzeroupper
ret
Note using kshift
into a separate register so we can get the high half first (in program order), avoiding the need to save/restore any extra registers. With only 3 architectural registers (without saving/restoring more), we can let register renaming + OoO exec take care of things.
Critical path latency is not great. From k0
being ready, kmovd
can get the low-half bitmap out, but bsf eax, ecx
can't begin until eax
is ready. That depends on kshift (4) -> kmov (3) -> tzcnt (3), add (1) = 11 cycles, then bsf
is another 3 cycles on top of that.
If we did the bsf
operations in parallel, best case we could have tzcnt(hi) + add
feeding into a CMOV (1 extra cycle) which has 2 integer inputs from the two BSF chains, and flags input from something on the low half. (So the critical path would just come from the high half, the low half doesn't involve kshift and can be ready sooner).
In the previous version of this, I used a 3-component lea
on the high-half dep chain which isn't great either.
Related: AVX512CD has SIMD vplzcntq
But you can't use it for tzcnt because we don't have an efficient bit-reverse.
Also, you'd need the 64-bit mask back into a vector element, and then vmovd to an integer reg.
There are instructions for exploding a bitmask into a vector mask (like VPMOVM2B
, but there's also VPBROADCASTMW2D xmm1, k1
to just copy a mask to vector elements. Unfortunately it's only available for byte or word mask widths (not AVX512BW). So that doesn't solve the problem. In 64-bit mode obviously you could kmovq
to an integer reg and vmovq
to a vector, but then you'd just use scalar lzcnt
or tzcnt