In the middle of my holidays, I was browsing my Mastodon feed and found an interesting challenge, small enough that I don't have to spend days to figure this out, but also interesting because even such small problem is leading to some "tricky" usages of Intel SSE/AVX instructions - that I keep rediscovering (When getting older, I tend to forget these details! ๐ ).
The bulk of the challenge can be summarized as:
How to find efficiently a specified
int
from anint[]
using x86-64 SIMD instructions ?
I won't claim that the version described in this blog post is the fastest on earth, and so the goal is more educational and about SIMD empowering: if you are not familiar with SIMD code or you feel intimidated about using them, please don't! They are fun to use and they can often give this satisfaction of a 10x performance boost!
So let's have a look at the challenge...
The Scalar version
The scalar code of this version is pretty simple:
static int FindSimple(ReadOnlySpan<int> data, int value)
{
for (var i = 0; i < data.Length; i++)
if (data[i] == value)
return i;
return -1;
}
I'm using the excellent Disasmo Visual Studio Extension from Egor Bogatov to quickly grab the output assembly for such functions, just a ALT+SHIFT+D
keyboard away, and boom! ๐
; Assembly listing for method BenchFind.BenchmarkFinder:FindSimple_(System.ReadOnlySpan`1[int],int):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
G_M000_IG01: ;; offset=0000H
G_M000_IG02: ;; offset=0000H
488B01 mov rax, bword ptr [rcx]
8B4908 mov ecx, dword ptr [rcx+08H]
4533C0 xor r8d, r8d
85C9 test ecx, ecx
7E11 jle SHORT G_M000_IG04
align [0 bytes for IG03]
; <<<<<<<<<<<<<< Loop - Starts here
G_M000_IG03: ;; offset=000DH
458BC8 mov r9d, r8d
42391488 cmp dword ptr [rax+4*r9], edx
740E je SHORT G_M000_IG06
41FFC0 inc r8d
443BC1 cmp r8d, ecx
7CEF jl SHORT G_M000_IG03
; <<<<<<<<<<<<<< Loop - Ends here
G_M000_IG04: ;; offset=001EH
B8FFFFFFFF mov eax, -1
G_M000_IG05: ;; offset=0023H
C3 ret
G_M000_IG06: ;; offset=0024H
418BC0 mov eax, r8d
G_M000_IG07: ;; offset=0027H
C3 ret
; Total bytes of code 40
Firstly, you will notice that I have slightly changed the challenge from an int[]
to use instead a ReadOnlySpan<int>
. This is mainly to make the method usable with any runtime data layout: native buffer, managed buffer or stackalloc.
All .NET compute intensive code these days should use
ReadOnlySpan
/Span
instead of arrays in order to allow more data input layout scenarios.
Secondly, you can also see that the generated ASM code above by the .NET JIT (net7.0
version) is quite good enough, almost identical to the generated C++ version. It is expected, as the .NET JIT is nowadays generating competitive code as long as 1) inlining of critical functions is well performed and 2) the register pressure is low (e.g you don't see registers being spilled out on the stack via the rsp
register in critical loops)..
Thirdly, a cautious reader would suggest that this loop could be slightly more optimized with scalar CPU instructions by testing e.g 4 elements instead of only 1 per loop items, and that would be correct! But let's keep the scalar version as simple as it is, we will have more fun with the SIMD version. ๐
Generic SIMD Version
One of the cool addition to .NET 7 are generic SIMD code via the namespace System.Runtime.Intrinsics
that allows you to write SIMD code without dealing with specific CPU instructions, as long as the code stays simple and you can use the various existing methods available.
This is pretty cool because we can translate the previous scalar loop to a vectorized/SIMD loop with a generic SIMD code that is fairly simple and easy to follow.
Here is the generic Vector128<int>
version that is able to process 4 ints per loop:
static int Find_Generic_128_(ReadOnlySpan<int> data, int value)
{
// In theory we should check for Vector128.IsHardwareAccelerated and dispatch
// accordingly, in practice here we don't to keep the code simple.
var vInts = MemoryMarshal.Cast<int, Vector128<int>>(data);
var compareValue = Vector128.Create(value);
var vectorLength = Vector128<int>.Count;
// Batch <4 x int> per loop
for (var i = 0; i < vInts.Length; i++)
{
var result = Vector128.Equals(vInts[i], compareValue);
if (result == Vector128<int>.Zero) continue;
for (var k = 0; k < vectorLength; k++)
if (result.GetElement(k) != 0)
return i * vectorLength + k;
}
// Scalar process of the remaining
for (var i = vInts.Length * vectorLength; i < data.Length; i++)
if (data[i] == value)
return i;
return -1;
}
And the generic Vector256<int>
version that is able to process 8 ints per loop:
static int Find_Generic_256_(ReadOnlySpan<int> data, int value)
{
// In theory we should check for Vector256.IsHardwareAccelerated and dispatch
// accordingly, in practice here we don't to keep the code simple.
var vInts = MemoryMarshal.Cast<int, Vector256<int>>(data);
var compareValue = Vector256.Create(value);
var vectorLength = Vector256<int>.Count;
// Batch <8 x int> per loop
for (var i = 0; i < vInts.Length; i++)
{
var result = Vector256.Equals(vInts[i], compareValue);
if (result == Vector256<int>.Zero) continue;
for (var k = 0; k < vectorLength; k++)
if (result.GetElement(k) != 0)
return i * vectorLength + k;
}
// Scalar process of the remaining
for (var i = vInts.Length * vectorLength; i < data.Length; i++)
if (data[i] == value)
return i;
return -1;
}
The results of the benchmark are:
Method | N | Mean | Error | StdDev | Median | Ratio | RatioSD |
---|---|---|---|---|---|---|---|
Find_Simple | 32 | 9.497 ns | 0.2087 ns | 0.2993 ns | 9.555 ns | 1.00 | 0.00 |
Find_Generic_128 | 32 | 4.572 ns | 0.0025 ns | 0.0020 ns | 4.572 ns | 0.48 | 0.01 |
Find_Generic_256 | 32 | 7.308 ns | 0.5040 ns | 1.4861 ns | 7.455 ns | 0.78 | 0.17 |
Find_Simple | 64 | 16.557 ns | 0.3431 ns | 0.4580 ns | 16.622 ns | 1.00 | 0.00 |
Find_Generic_128 | 64 | 8.531 ns | 0.0269 ns | 0.0238 ns | 8.543 ns | 0.52 | 0.01 |
Find_Generic_256 | 64 | 6.626 ns | 0.0900 ns | 0.0752 ns | 6.589 ns | 0.41 | 0.01 |
Find_Simple | 128 | 35.024 ns | 0.3709 ns | 0.3097 ns | 35.064 ns | 1.00 | 0.00 |
Find_Generic_128 | 128 | 15.533 ns | 0.0437 ns | 0.0341 ns | 15.546 ns | 0.44 | 0.00 |
Find_Generic_256 | 128 | 10.098 ns | 0.0235 ns | 0.0208 ns | 10.096 ns | 0.29 | 0.00 |
Find_Simple | 256 | 64.626 ns | 1.1894 ns | 1.1126 ns | 64.496 ns | 1.00 | 0.00 |
Find_Generic_128 | 256 | 35.388 ns | 0.0965 ns | 0.0855 ns | 35.392 ns | 0.55 | 0.01 |
Find_Generic_256 | 256 | 16.866 ns | 0.0433 ns | 0.0384 ns | 16.881 ns | 0.26 | 0.00 |
Find_Simple | 512 | 120.302 ns | 1.6310 ns | 1.5256 ns | 119.891 ns | 1.00 | 0.00 |
Find_Generic_128 | 512 | 63.086 ns | 0.1117 ns | 0.1044 ns | 63.058 ns | 0.52 | 0.01 |
Find_Generic_256 | 512 | 39.328 ns | 0.8087 ns | 2.3845 ns | 38.056 ns | 0.33 | 0.02 |
Find_Simple | 1024 | 232.160 ns | 1.9791 ns | 1.8512 ns | 232.436 ns | 1.00 | 0.00 |
Find_Generic_128 | 1024 | 119.290 ns | 0.2275 ns | 0.2017 ns | 119.350 ns | 0.51 | 0.00 |
Find_Generic_256 | 1024 | 65.283 ns | 0.1176 ns | 0.1100 ns | 65.236 ns | 0.28 | 0.00 |
Find_Simple | 4096 | 894.287 ns | 3.6022 ns | 3.0080 ns | 894.541 ns | 1.00 | 0.00 |
Find_Generic_128 | 4096 | 454.083 ns | 0.3423 ns | 0.3035 ns | 454.020 ns | 0.51 | 0.00 |
Find_Generic_256 | 4096 | 234.391 ns | 2.5310 ns | 2.1135 ns | 234.057 ns | 0.26 | 0.00 |
Find_Simple | 8192 | 1,796.290 ns | 13.4828 ns | 12.6118 ns | 1,792.636 ns | 1.00 | 0.00 |
Find_Generic_128 | 8192 | 901.999 ns | 1.8796 ns | 1.7582 ns | 902.707 ns | 0.50 | 0.00 |
Find_Generic_256 | 8192 | 465.352 ns | 5.0166 ns | 4.6925 ns | 462.971 ns | 0.26 | 0.00 |
The Vector128.Equals(vInts[i], compareValue)
and Vector256.Equals(vInts[i], compareValue)
generates a mask (int -1
/0xFFFF_FFFF
) when the value is equal, or zero otherwise.
We haven't optimized more than using just plain generic SIMD instructions, and this is already giving a nice performance boost from 2x to 4x. ๐
CPU SIMD Optimized version
That's great, but if you are looking for more performance, we can still go a bit further by using CPU specific SIMD intrinsics instructions with Intel x86-64. Let's have a look!
But first, what is wrong with the previous generic SIMD versions? If we look at the generated assembly for the SIMD 128 bit version, we will quickly find something suspicious:
; Assembly listing for method BenchFind.BenchmarkFinder:Find_Generic_128_(System.ReadOnlySpan`1[int],int):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 8 single block inlinees; 1 inlinees without PGO data
G_M000_IG01: ;; offset=0000H
4883EC38 sub rsp, 56
C5F877 vzeroupper
33C0 xor eax, eax
4889442428 mov qword ptr [rsp+28H], rax
4889442430 mov qword ptr [rsp+30H], rax
G_M000_IG02: ;; offset=0013H
488B01 mov rax, bword ptr [rcx]
8B4908 mov ecx, dword ptr [rcx+08H]
448BC1 mov r8d, ecx
49C1E002 shl r8, 2
49C1E804 shr r8, 4
4981F8FFFFFF7F cmp r8, 0x7FFFFFFF
0F8790000000 ja G_M000_IG15
C4E1796EC2 vmovd xmm0, edx
C4E27958C0 vpbroadcastd xmm0, xmm0
4533C9 xor r9d, r9d
4585C0 test r8d, r8d
7E3F jle SHORT G_M000_IG07
G_M000_IG03: ;; offset=0043H
458BD1 mov r10d, r9d
49C1E204 shl r10, 4
C4A179760C10 vpcmpeqd xmm1, xmm0, xmmword ptr [rax+r10]
C4E27917C9 vptest xmm1, xmm1
7423 je SHORT G_M000_IG06
G_M000_IG04: ;; offset=0057H
4533D2 xor r10d, r10d
660F1F440000 align [6 bytes for IG05]
G_M000_IG05: ;; offset=0060H
C4E179114C2428 vmovupd xmmword ptr [rsp+28H], xmm1
468B5C9428 mov r11d, dword ptr [rsp+4*r10+28H]
4585DB test r11d, r11d
7547 jne SHORT G_M000_IG13
41FFC2 inc r10d
4183FA04 cmp r10d, 4
7CE6 jl SHORT G_M000_IG05
G_M000_IG06: ;; offset=007AH
41FFC1 inc r9d
453BC8 cmp r9d, r8d
7CC1 jl SHORT G_M000_IG03
G_M000_IG07: ;; offset=0082H
41C1E002 shl r8d, 2
443BC1 cmp r8d, ecx
7D1B jge SHORT G_M000_IG09
0F1F440000 align [5 bytes for IG08]
G_M000_IG08: ;; offset=0090H
443BC1 cmp r8d, ecx
7332 jae SHORT G_M000_IG16
458BC8 mov r9d, r8d
42391488 cmp dword ptr [rax+4*r9], edx
7412 je SHORT G_M000_IG11
41FFC0 inc r8d
443BC1 cmp r8d, ecx
7CEA jl SHORT G_M000_IG08
G_M000_IG09: ;; offset=00A6H
B8FFFFFFFF mov eax, -1
G_M000_IG10: ;; offset=00ABH
4883C438 add rsp, 56
C3 ret
G_M000_IG11: ;; offset=00B0H
418BC0 mov eax, r8d
G_M000_IG12: ;; offset=00B3H
4883C438 add rsp, 56
C3 ret
G_M000_IG13: ;; offset=00B8H
438D048A lea eax, [r10+4*r9]
G_M000_IG14: ;; offset=00BCH
4883C438 add rsp, 56
C3 ret
G_M000_IG15: ;; offset=00C1H
E8AA7BC35F call CORINFO_HELP_OVERFLOW
CC int3
G_M000_IG16: ;; offset=00C7H
E81480C35F call CORINFO_HELP_RNGCHKFAIL
CC int3
; Total bytes of code 205
The ASM code starts with sub rsp, 56
which indicates that there is some stack spilling going on. It's not always a bad thing for long method, but for such a short compute intensive code, it might start to be a red flag, especially if the stack spilling occurs within a tight loop.
What is generating this stack spill?
We can see it in the middle of the ASM code:
G_M000_IG05: ;; offset=0060H
C4E179114C2428 vmovupd xmmword ptr [rsp+28H], xmm1
468B5C9428 mov r11d, dword ptr [rsp+4*r10+28H]
4585DB test r11d, r11d
7547 jne SHORT G_M000_IG13
41FFC2 inc r10d
4183FA04 cmp r10d, 4
7CE6 jl SHORT G_M000_IG05
and this code is actually associated with the following C# code:
// Will generate stack spilling!
for (var k = 0; k < vectorLength; k++)
if (result.GetElement(k) != 0)
return i * vectorLength + k;
D'oh! So this very inoffensive code is generating a back and forth from memory in order to check which int component in the Vector128<int>
is actually matching! ๐ฑ
The stack spilling is not that terrible because this code only runs once we have a match, but for a small input, it might adds up.
Ok, so let's optimize the SIMD version with some unsafe tricks, MOAR SIMD per loop and CPU specific intrinsics:
public static int Find_AVX_256_Optimized(ReadOnlySpan<int> data, int value)
{
// Our data cursor
ref var pv = ref MemoryMarshal.GetReference(data);
var compareValue = Vector256.Create(value);
nint length = data.Length;
// Process 32 int (8 * 4) per loop (128 bytes)
nint bound1024 = length & ~(Vector256<int>.Count * 4 - 1);
nint i = 0;
for (; i < bound1024; i += Vector256<int>.Count * 4)
{
var r1 = Vector256.Equals(Unsafe.As<int, Vector256<int>>(ref Unsafe.Add(ref pv, i)), compareValue);
var r2 = Vector256.Equals(Unsafe.As<int, Vector256<int>>(ref Unsafe.Add(ref pv, i + Vector256<int>.Count)), compareValue);
var r3 = Vector256.Equals(Unsafe.As<int, Vector256<int>>(ref Unsafe.Add(ref pv, i + Vector256<int>.Count * 2)), compareValue);
var r4 = Vector256.Equals(Unsafe.As<int, Vector256<int>>(ref Unsafe.Add(ref pv, i + Vector256<int>.Count * 3)), compareValue);
var r5 = r1 | r2 | r3 | r4;
if (r5 != Vector256<int>.Zero)
{
// r12 = pack 32 to 16 of r1/r2
var t = Avx2.Permute2x128(r1, r2, 0b_0010_0000);
r2 = Avx2.Permute2x128(r1, r2, 0b_0011_0001);
r1 = t;
Vector256<short> r12 = Avx2.PackSignedSaturate(r1, r2).AsInt16();
// r34 = pack 32 to 16 of r3/r4
t = Avx2.Permute2x128(r3, r4, 0b_0010_0000);
r4 = Avx2.Permute2x128(r3, r4, 0b_0011_0001);
r3 = t;
Vector256<short> r34 = Avx2.PackSignedSaturate(r3, r4).AsInt16();
// pack 16 to 8 of r12/r34
var t1 = Avx2.Permute2x128(r12, r34, 0b_0010_0000);
r34 = Avx2.Permute2x128(r12, r34, 0b_0011_0001);
r12 = t1;
Vector256<sbyte> r = Avx2.PackSignedSaturate(r12, r34);
// Get the mask from <8 x byte>
var idx = Avx2.MoveMask(r);
return (int)(i + BitOperations.TrailingZeroCount(idx));
}
}
// Process 8 int per loop (32 bytes)
nint bound256 = length & ~(Vector256<int>.Count - 1);
for (; i < bound256; i += Vector256<int>.Count)
{
var r1 = Vector256.Equals(Unsafe.As<int, Vector256<int>>(ref Unsafe.Add(ref pv, i)), compareValue);
if (r1 != Vector256<int>.Zero)
{
// Get the mask from <8 x int> to byte
var rByte = Avx.MoveMask(r1.AsSingle());
// And get the local index
var idx = BitOperations.TrailingZeroCount((uint)rByte);
return (int)(i + idx);
}
}
// Process remaining
for (; i < length; i++)
{
if (Unsafe.Add(ref pv, i) == value)
return (int)i;
}
return -1;
}
And this code gets translated to the following optimized ASM code:
; Assembly listing for method BenchFind.BatchFinder:Find_AVX2_256_Optimized(System.ReadOnlySpan`1[int],int):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 2 inlinees with PGO data; 3 single block inlinees; 0 inlinees without PGO data
G_M000_IG01: ;; offset=0000H
C5F877 vzeroupper
G_M000_IG02: ;; offset=0003H
488B01 mov rax, bword ptr [rcx]
C4E1796EC2 vmovd xmm0, edx
C4E27D58C0 vpbroadcastd ymm0, ymm0
8B4908 mov ecx, dword ptr [rcx+08H]
4863C9 movsxd rcx, ecx
4C8BC1 mov r8, rcx
4983E0E0 and r8, -32
4533C9 xor r9d, r9d
4D85C0 test r8, r8
7E45 jle SHORT G_M000_IG04
align [0 bytes for IG03]
G_M000_IG03: ;; offset=0025H
4D8BD1 mov r10, r9
49C1E202 shl r10, 2
C4A17D760C10 vpcmpeqd ymm1, ymm0, ymmword ptr[rax+r10]
C4A17D76541020 vpcmpeqd ymm2, ymm0, ymmword ptr[rax+r10+20H]
C4A17D765C1040 vpcmpeqd ymm3, ymm0, ymmword ptr[rax+r10+40H]
C4A17D76641060 vpcmpeqd ymm4, ymm0, ymmword ptr[rax+r10+60H]
C4E175EBEA vpor ymm5, ymm1, ymm2
C4E155EBEB vpor ymm5, ymm5, ymm3
C4E155EBEC vpor ymm5, ymm5, ymm4
C4E27D17ED vptest ymm5, ymm5
7571 jne SHORT G_M000_IG14
4983C120 add r9, 32
4D3BC8 cmp r9, r8
7CBB jl SHORT G_M000_IG03
G_M000_IG04: ;; offset=006AH
4C8BC1 mov r8, rcx
4983E0F8 and r8, -8
4D3BC8 cmp r9, r8
7D20 jge SHORT G_M000_IG06
66660F1F840000000000 align [10 bytes for IG05]
G_M000_IG05: ;; offset=0080H
C4A17D760C88 vpcmpeqd ymm1, ymm0, ymmword ptr[rax+4*r9]
C4E27D17C9 vptest ymm1, ymm1
7531 jne SHORT G_M000_IG12
4983C108 add r9, 8
4D3BC8 cmp r9, r8
7CEA jl SHORT G_M000_IG05
G_M000_IG06: ;; offset=0096H
4C3BC9 cmp r9, rcx
7D13 jge SHORT G_M000_IG08
0F1F440000 align [5 bytes for IG07]
G_M000_IG07: ;; offset=00A0H
42391488 cmp dword ptr [rax+4*r9], edx
7411 je SHORT G_M000_IG10
49FFC1 inc r9
4C3BC9 cmp r9, rcx
7CF2 jl SHORT G_M000_IG07
G_M000_IG08: ;; offset=00AEH
B8FFFFFFFF mov eax, -1
G_M000_IG09: ;; offset=00B3H
C5F877 vzeroupper
C3 ret
G_M000_IG10: ;; offset=00B7H
418BC1 mov eax, r9d
G_M000_IG11: ;; offset=00BAH
C5F877 vzeroupper
C3 ret
G_M000_IG12: ;; offset=00BEH
C5FC50C1 vmovmskps yrax, ymm1
F30FBCC0 tzcnt eax, eax
4103C1 add eax, r9d
G_M000_IG13: ;; offset=00C9H
C5F877 vzeroupper
C3 ret
G_M000_IG14: ;; offset=00CDH
C4E37546C220 vperm2i128 ymm0, ymm1, ymm2, 32
C4E37546D231 vperm2i128 ymm2, ymm1, ymm2, 49
C5FC28C8 vmovaps ymm1, ymm0
C4E36546C420 vperm2i128 ymm0, ymm3, ymm4, 32
C4E36546E431 vperm2i128 ymm4, ymm3, ymm4, 49
C5FD6BC4 vpackssdw ymm0, ymm0, ymm4
C5F56BCA vpackssdw ymm1, ymm1, ymm2
C4E37546D020 vperm2i128 ymm2, ymm1, ymm0, 32
C4E37546C031 vperm2i128 ymm0, ymm1, ymm0, 49
C5ED63C0 vpacksswb ymm0, ymm2, ymm0
C5FDD7C0 vpmovmskb eax, ymm0
F30FBCC0 tzcnt eax, eax
4103C1 add eax, r9d
G_M000_IG15: ;; offset=010CH
C5F877 vzeroupper
C3 ret
; Total bytes of code 272
As you can see, the generated code is now heavily optimized, very compact, using only registers. I wouldn't have been able to write much better ASM code If I had to write it by hand!
So in order to further optimize our code, we have followed a few very simple princples and used some specific Intel instructions:
- Using a
ref
data cursor instead ofReadOnlySpan
. - Using
nint
instead ofint
indexer. - Using more batches.
- Using specific Intel intrinsics tricks.
Let's have a look at them.
Using ref
Firstly, we are going to use a ref
data cursor instead of accessing the ReadOnlySpan<int>
, this is done with the first instruction:
// Our data cursor
ref var pv = ref MemoryMarshal.GetReference(data);
The benefit of doing this is that it will remove all bounds checks that you can still see around in the generic version (because some patterns are not fully detected). We are using a ref
instead of an unsafe *
pointer for the reason that we still want our code to be GC friendly. Using a ref
here allows the GC to update this pointer if necessary (if the pointed data is being moved by the GC) while a fixed/pinned statement would block the GC for moving this data around and might cause GC fragmentation.
You might then wonder how do we advance a ref
? Simply by using the Unsafe.Add<T>(ref T, nint elementOffset)
method! (and other related methods):
// Load a Vector256<int> from a ref int at the index i + Vector256<int>.Count * 3
// equivalent of:
// Unsafe.As<int, Vector256<int>>(ref data[i + Vector256<int>.Count * 3])
Unsafe.As<int, Vector256<int>>(ref Unsafe.Add(ref pv, i + Vector256<int>.Count * 3))
Using nint
Secondly, we need to return an index to the data, so we are declaring the variable nint i = 0;
at the beginning that we will keep between the different batch groups. Now you might wonder why using nint
instead of a int
? nint
is a shorthand for System.IntPtr
and matches the CPU register size. It helps the compiler to avoid generating many up-casting from int32 to int64 in various places (movsxd
instructions). In the generic ReadOnlySpan<int>
SIMD version, you won't see this movsxd
instructions, because the compiler is able to detect such pattern, but because in our case we are keeping a single variable across the different batch groups (next section), the compiler won't notice this and will generate these spurious movsxd
.
Using more batches
Thirdly, we want to increase the batch, so that the cost of the loop is reduced compared to the code doing the actual computation. In the previous generic Vector128<int>
version, the compute intensive loop is like this and there is almost as much code to run the loop than to perform the actual computation:
G_M000_IG03: ;; offset=0043H
458BD1 mov r10d, r9d
49C1E204 shl r10, 4
; <<<<<<< Actual computation code - starts
C4A179760C10 vpcmpeqd xmm1, xmm0, xmmword ptr [rax+r10]
C4E27917C9 vptest xmm1, xmm1
7423 je SHORT G_M000_IG06
; >>>>>>> Actual computation code - ends
; [...] skipping some lines
G_M000_IG06: ;; offset=007AH
41FFC1 inc r9d
453BC8 cmp r9d, r8d
7CC1 jl SHORT G_M000_IG03
So, we are breaking our code into 3 loops, from a coarser batch to a scalar loop:
- A first loop that can process
4
xVector256<int>
, so 128 bytes per loop item - A remaining loop that can process
1
xVector256<int>
, so 32 bytes per loop item - And the remaining scalar loop, so 4 bytes per loop.
Another trick is to minimize the computation when we haven't found an item, even in the case of the 4
x Vector256<int>
. The trick here is to generate a single branch for checking if we have a match by OR|
ing the 4
comparisons and only check the result of the OR|
ing:
var r1 = Vector256.Equals(Unsafe.As<int, Vector256<int>>(ref Unsafe.Add(ref pv, i)), compareValue);
var r2 = Vector256.Equals(Unsafe.As<int, Vector256<int>>(ref Unsafe.Add(ref pv, i + Vector256<int>.Count)), compareValue);
var r3 = Vector256.Equals(Unsafe.As<int, Vector256<int>>(ref Unsafe.Add(ref pv, i + Vector256<int>.Count * 2)), compareValue);
var r4 = Vector256.Equals(Unsafe.As<int, Vector256<int>>(ref Unsafe.Add(ref pv, i + Vector256<int>.Count * 3)), compareValue);
var r5 = r1 | r2 | r3 | r4;
if (r5 != Vector256<int>.Zero)
{
// .... We have found something!
}
And here is how our loop looks like now:
G_M000_IG03: ;; offset=0025H
4D8BD1 mov r10, r9
49C1E202 shl r10, 2
C4A17D760C10 vpcmpeqd ymm1, ymm0, ymmword ptr[rax+r10]
C4A17D76541020 vpcmpeqd ymm2, ymm0, ymmword ptr[rax+r10+20H]
C4A17D765C1040 vpcmpeqd ymm3, ymm0, ymmword ptr[rax+r10+40H]
C4A17D76641060 vpcmpeqd ymm4, ymm0, ymmword ptr[rax+r10+60H]
C4E175EBEA vpor ymm5, ymm1, ymm2
C4E155EBEB vpor ymm5, ymm5, ymm3
C4E155EBEC vpor ymm5, ymm5, ymm4
C4E27D17ED vptest ymm5, ymm5
7571 jne SHORT G_M000_IG14
4983C120 add r9, 32
4D3BC8 cmp r9, r8
7CBB jl SHORT G_M000_IG03
It shows that we are doing more compute work than the loop machinery (add r9, 32
, cmp r9, r8
...etc.) and it helps to lower its cost. Compared to the generic version, we have now multiple instructions involved that are processing 4 times more data per loop!
Using CPU intrinsics
Now, how can we avoid the stack spilling by not using Vector256<T>.GetElement(index)
? By using specific SSE/AVX instructions!
For the first 4
x Vector256<int>
batch, we are using the following code to compute the index of the first element found without using any branches:
if (r5 != Vector256<int>.Zero)
{
// r12 = pack 32 to 16 of r1/r2
var t = Avx2.Permute2x128(r1, r2, 0b_0010_0000);
r2 = Avx2.Permute2x128(r1, r2, 0b_0011_0001);
r1 = t;
Vector256<short> r12 = Avx2.PackSignedSaturate(r1, r2).AsInt16();
// r34 = pack 32 to 16 of r3/r4
t = Avx2.Permute2x128(r3, r4, 0b_0010_0000);
r4 = Avx2.Permute2x128(r3, r4, 0b_0011_0001);
r3 = t;
Vector256<short> r34 = Avx2.PackSignedSaturate(r3, r4).AsInt16();
// pack 16 to 8 of r12/r34
var t1 = Avx2.Permute2x128(r12, r34, 0b_0010_0000);
r34 = Avx2.Permute2x128(r12, r34, 0b_0011_0001);
r12 = t1;
Vector256<sbyte> r = Avx2.PackSignedSaturate(r12, r34);
// Get the mask from <8 x byte>
var idx = Avx2.MoveMask(r);
return (int)(i + BitOperations.TrailingZeroCount(idx));
}
So we have 4
x Vector256<int>
registers in r1
, r2
, r3
, r4
, and what we are going to do is to downcast these 32
x int
to 32
x byte
, so a single Vector256<byte>
register, and from there, there is an instruction (movemask
) from generating a mask for each byte.
I'm always using the Intel Intrinsics Guide to look at these functions. If you look at the comment of the functions, you will see the CPU instruction name VPERM2I128
:
/// <summary>
/// <para>__m256i _mm256_permute2x128_si256 (__m256i a, __m256i b, const int imm8)</para>
/// <para>VPERM2I128 ymm, ymm, ymm/m256, imm8</para>
/// </summary>
/// <param name="left" />
/// <param name="right" />
/// <param name="control" />
public new static Vector256<int> Permute2x128(
Vector256<int> left,
Vector256<int> right,
byte control)
Which expands to this manual here.
It gives a description Shuffle 128-bits (composed of integer data) selected by imm8 from a and b, and store the results in dst.
but this is too vague to help really understand how this intruction works. But the good thing is that we have the details of the instructions defined via a pseudo microcode just below the description, and that's super useful to understand such functions! ๐
__m256i _mm256_permute2x128_si256 (__m256i a, __m256i b, const int imm8)
DEFINE SELECT4(src1, src2, control) {
CASE(control[1:0]) OF
0: tmp[127:0] := src1[127:0]
1: tmp[127:0] := src1[255:128]
2: tmp[127:0] := src2[127:0]
3: tmp[127:0] := src2[255:128]
ESAC
IF control[3]
tmp[127:0] := 0
FI
RETURN tmp[127:0]
}
dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0])
dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4])
dst[MAX:256] := 0
So we are using it like this:
var t = Avx2.Permute2x128(r1, r2, 0b_0010_0000);
r2 = Avx2.Permute2x128(r1, r2, 0b_0011_0001);
r1 = t;
The first 0b_0010_0000
is going to translate to:
imm8[3:0] = 0
imm8[7:4] = 2
while the second0b_0011_0001
is translating to:imm8[3:0] = 1
imm8[7:4] = 3
And basically, this instruction with these control byte are packing the low and high 128 bits part together of the two r1/r2 registers:
Why are we performing such swap? Because most of the AVX2 SIMD instructions are working on interleaved SIMD 128 lanes, and the instruction PackSignedSaturate(Vector256<int> left, Vector256<int> right)
which is the instrinsic VPACKSSDW/_mm256_packs_epi32 is defined like this:
__m256i _mm256_packs_epi32 (__m256i a, __m256i b)
dst[15:0] := Saturate16(a[31:0]) ; a[127:0]
dst[31:16] := Saturate16(a[63:32])
dst[47:32] := Saturate16(a[95:64])
dst[63:48] := Saturate16(a[127:96])
dst[79:64] := Saturate16(b[31:0]) ; b[128:0]
dst[95:80] := Saturate16(b[63:32])
dst[111:96] := Saturate16(b[95:64])
dst[127:112] := Saturate16(b[127:96])
dst[143:128] := Saturate16(a[159:128]) ; a[255:128]
dst[159:144] := Saturate16(a[191:160])
dst[175:160] := Saturate16(a[223:192])
dst[191:176] := Saturate16(a[255:224])
dst[207:192] := Saturate16(b[159:128]) ; b[255:128]
dst[223:208] := Saturate16(b[191:160])
dst[239:224] := Saturate16(b[223:192])
dst[255:240] := Saturate16(b[255:224])
dst[MAX:256] := 0
Notice that the function is first processing a[127:0]
, then b[128:0]
, then a[255:128]
and then b[255:128]
!
That's why we need to swap the 128 parts around in order to keep the order of the data.
We further reduce from int
to short
, and from short
to byte
.
The last SIMD instruction Avx2.MoveMask(Vector256<byte>)
(intrinsic _mm256_movemask_epi8) is compressing the 32 bytes into 32 bits.
int _mm256_movemask_epi8 (__m256i a)
FOR j := 0 to 31
i := j*8
dst[j] := a[i+7]
ENDFOR
We can then extract the local position within these 32 bytes by using BitOperations.TrailingZeroCount(int)
.
But as suggested by Nietras here, instead of performing the permutation before the PackSignedSaturate
we could do it after the packing and save 1 permutation per pack, which is quite nice!:
if (r5 != Vector256<int>.Zero)
{
// r12 = pack 32 to 16 of r1/r2
// r34 = pack 32 to 16 of r3/r4
// but it's working on 128 bit lanes, so we need to reorder them
Vector256<short> r12 = Avx2.PackSignedSaturate(r1, r2).AsInt16();
Vector256<short> r34 = Avx2.PackSignedSaturate(r3, r4).AsInt16();
// Reorder r12 & r34 correctly
r12 = Avx2.Permute4x64(r12.AsInt64(), 0b_11_01_10_00).AsInt16();
r34 = Avx2.Permute4x64(r34.AsInt64(), 0b_11_01_10_00).AsInt16();
// pack 16 to 8 of r12/r34
Vector256<sbyte> r = Avx2.PackSignedSaturate(r12, r34);
// Reorder r correctly
r = Avx2.Permute4x64(r.AsInt64(), 0b_11_01_10_00).AsSByte();
// Get the mask from <8 x byte>
var idx = Avx2.MoveMask(r);
return (int)(i + BitOperations.TrailingZeroCount(idx));
}
and generates the following assembly:
G_M000_IG14: ;; offset=00CDH
C5E56BC4 vpackssdw ymm0, ymm3, ymm4
C4E3FD00C0D8 vpermq ymm0, ymm0, -40
C5F56BCA vpackssdw ymm1, ymm1, ymm2
C4E3FD00C9D8 vpermq ymm1, ymm1, -40
C5F563C0 vpacksswb ymm0, ymm1, ymm0
C4E3FD00C0D8 vpermq ymm0, ymm0, -40
C5FDD7C0 vpmovmskb eax, ymm0
F30FBCC0 tzcnt eax, eax
4103C1 add eax, r9d
Similarly, the loop processing 1
x Vector256<int>
is using the Avx2.MoveMask(Vector256<float>)
to compute this index per int directly from the Vector256<int>
result of the comparison. As you can see, the signature is using a float (!) while it does work on a 32 int as well, which is super convenient. There are plenty of instructions like this in SSE/AVX that looks like they are only relevant for floating point while they do work also for integer types.
Results
And the results of the benchmark is giving a significant boost for the optimized version, from 4x to 10x performance boost! (Oops, actually, it looks like closer to x8, but let's keep 10x for the marketing ๐)
Method | N | Mean | Error | StdDev | Median | Ratio | RatioSD |
---|---|---|---|---|---|---|---|
Find_Simple | 32 | 9.481 ns | 0.2092 ns | 0.3066 ns | 9.491 ns | 1.00 | 0.00 |
Find_Generic_128 | 32 | 6.420 ns | 0.0166 ns | 0.0129 ns | 6.416 ns | 0.67 | 0.03 |
Find_Generic_256 | 32 | 4.663 ns | 0.0309 ns | 0.0258 ns | 4.656 ns | 0.49 | 0.02 |
Find_AVX_256_Optimized | 32 | 2.275 ns | 0.0156 ns | 0.0146 ns | 2.278 ns | 0.24 | 0.01 |
Find_Simple | 64 | 16.442 ns | 0.3419 ns | 0.4324 ns | 16.264 ns | 1.00 | 0.00 |
Find_Generic_128 | 64 | 13.991 ns | 0.6312 ns | 1.8411 ns | 12.551 ns | 0.86 | 0.12 |
Find_Generic_256 | 64 | 6.509 ns | 0.0365 ns | 0.0285 ns | 6.523 ns | 0.39 | 0.01 |
Find_AVX_256_Optimized | 64 | 2.855 ns | 0.0116 ns | 0.0103 ns | 2.857 ns | 0.17 | 0.00 |
Find_Simple | 128 | 35.323 ns | 0.5072 ns | 0.4496 ns | 35.470 ns | 1.00 | 0.00 |
Find_Generic_128 | 128 | 22.573 ns | 0.0498 ns | 0.0441 ns | 22.581 ns | 0.64 | 0.01 |
Find_Generic_256 | 128 | 9.979 ns | 0.0276 ns | 0.0245 ns | 9.977 ns | 0.28 | 0.00 |
Find_AVX_256_Optimized | 128 | 5.196 ns | 0.0095 ns | 0.0089 ns | 5.199 ns | 0.15 | 0.00 |
Find_Simple | 256 | 64.433 ns | 1.2814 ns | 1.1986 ns | 63.995 ns | 1.00 | 0.00 |
Find_Generic_128 | 256 | 48.901 ns | 0.0357 ns | 0.0334 ns | 48.907 ns | 0.76 | 0.01 |
Find_Generic_256 | 256 | 16.789 ns | 0.0683 ns | 0.0638 ns | 16.815 ns | 0.26 | 0.00 |
Find_AVX_256_Optimized | 256 | 9.886 ns | 0.0140 ns | 0.0124 ns | 9.886 ns | 0.15 | 0.00 |
Find_Simple | 512 | 120.161 ns | 1.3342 ns | 1.1828 ns | 120.274 ns | 1.00 | 0.00 |
Find_Generic_128 | 512 | 89.880 ns | 0.1807 ns | 0.1690 ns | 89.867 ns | 0.75 | 0.01 |
Find_Generic_256 | 512 | 37.246 ns | 0.3284 ns | 0.3225 ns | 37.262 ns | 0.31 | 0.00 |
Find_AVX_256_Optimized | 512 | 15.867 ns | 0.0183 ns | 0.0171 ns | 15.863 ns | 0.13 | 0.00 |
Find_Simple | 1024 | 232.331 ns | 2.8660 ns | 2.6808 ns | 232.469 ns | 1.00 | 0.00 |
Find_Generic_128 | 1024 | 173.437 ns | 1.2228 ns | 1.0840 ns | 172.894 ns | 0.75 | 0.01 |
Find_Generic_256 | 1024 | 64.866 ns | 0.0929 ns | 0.0726 ns | 64.850 ns | 0.28 | 0.00 |
Find_AVX_256_Optimized | 1024 | 28.875 ns | 0.0765 ns | 0.0715 ns | 28.891 ns | 0.12 | 0.00 |
Find_Simple | 4096 | 898.684 ns | 8.5403 ns | 7.9886 ns | 896.829 ns | 1.00 | 0.00 |
Find_Generic_128 | 4096 | 673.922 ns | 1.8541 ns | 1.6436 ns | 673.437 ns | 0.75 | 0.01 |
Find_Generic_256 | 4096 | 232.696 ns | 0.9915 ns | 0.8790 ns | 232.518 ns | 0.26 | 0.00 |
Find_AVX_256_Optimized | 4096 | 115.904 ns | 0.5819 ns | 0.5158 ns | 115.905 ns | 0.13 | 0.00 |
Find_Simple | 8192 | 1,781.139 ns | 6.3148 ns | 5.2731 ns | 1,779.763 ns | 1.00 | 0.00 |
Find_Generic_128 | 8192 | 1,337.138 ns | 3.3049 ns | 3.0914 ns | 1,336.492 ns | 0.75 | 0.00 |
Find_Generic_256 | 8192 | 457.116 ns | 0.7556 ns | 0.7068 ns | 457.027 ns | 0.26 | 0.00 |
Find_AVX_256_Optimized | 8192 | 223.426 ns | 0.1103 ns | 0.0978 ns | 223.445 ns | 0.13 | 0.00 |
Final words
Nowadays, large and small processing of data is requiring going full width on the CPU in order to achieve optimal performance - before going wider on the CPU cores. It is no surprise that the .NET Teams have been optimizing already the .NET Base Class Libraries (BCL) for several years now with such intrinsics. See all the .NET Performance blog posts from Stephen Toub for .NET 5, .NET 6 and .NET 7, they will give you a - huge! - glimpse of what is happening there. Simple functions like string.IndexOf(char)
are using these SIMD intrinsics under the hood and are implemented entirely and super efficiently in C# without the need for a C++ fallback.
Not only the BCL is benefiting from the usage of such intrinsics, but the whole .NET ecosystem, with plenty of OSS projects, are joining the effort: for example, Nietras shared recently a cool library Sep - Possibly the World's Fastest .NET CSV Parser which is extensively using SIMD intrinsics to dramatically boost CSV parsing speed.
The initial simple implementation in this post also shows that the C++ compiler won't be able to optimize it better (no auto-vectorization) without specific compiler pattern matching (and I haven't seen any implementing this one in particular), and so, it makes it relevant to implement such optimized loops with .NET Vector intrinsics and CPU intrinsics to deliver the best performance.
More specifically, the zoo of Intel SIMD intrinsics can be involved in more algorithm tricks than their ARM counterparts, and that's the cool and fun part of this: Figuring out how to best use them!
Finally, if you need to micro-optimize such functions, one rule of thumb: never guess! Always profile, benchmark and check the generated output ASM to find out and understand optimization opportunities. ๐งช
Happy coding! ๐ค
PS: The code is available as a gist here