Add a SearchValues implementation for values with unique low nibbles #106900

MihaZupan · 2024-08-23T20:33:00Z

Based on http://0x80.pl/articles/simd-byte-lookup.html#special-case-3-unique-lower-and-higher-nibbles

As a comparison, the current core lookup for an ASCII set on AVX2 uses: 2 and, 1 shift, 2 shuffles

Vector256<byte> Lookup(Vector256<byte> source, Vector256<byte> bitmapLookup)
{
    Vector256<byte> highNibbles = (source.AsInt32() >>> 4).AsByte() & Vector256.Create((byte)0xF);
    Vector256<byte> bitMask = Avx2.Shuffle(bitmapLookup, source);
    Vector256<byte> bitPositions = Avx2.Shuffle(Vector256.Create(0x8040201008040201).AsByte(), highNibbles);
    return bitMask & bitPositions;
}

Where the core lookup for values with unique low nibbles uses: 1 comparison, 1 shuffle

Vector256<byte> Lookup(Vector256<byte> source, Vector256<byte> valuesByLowNibble)
{
    Vector256<byte> values = Avx2.Shuffle(valuesByLowNibble, source);
    return Vector256.Equals(source, values);
}

(code-wise, most of the implementation in this PR is a copy-paste of the existing ASCII logic, swapping out this core lookup routine)

Consider a benchmark inspired by @lemire's https://lemire.me/blog/2024/07/05/scan-html-faster-with-simd-instructions-net-c-edition/
In this case, we're scanning UTF8 input for bytes relevant to HTML (<, &, \r and \0).
Previously, SearchValues would pick the same implementation as span.IndexOfAny(4 values).
The blog post highlights that a hand-written approach can beat SearchValues in this case -- not anymore :)

public class Bench
{
    private static readonly SearchValues<byte> s_searchValues = SearchValues.Create("\0\r&<"u8);
    private static byte[] s_bytes = Encoding.ASCII.GetBytes(new string('x', 10_000));

    [Benchmark] public int FindHtmlChar() => s_bytes.AsSpan().IndexOfAny(s_searchValues);
}

This approach doubles the searching performance on my AVX2 CPU (Ryzen 1700).
On ARM, it's a 1.6X improvement.

Method	Toolchain	Mean	Ratio
FindHtmlChar	main	605.7 ns	1.00
FindHtmlChar	pr	304.5 ns	0.50

Compared to the implementation for an arbitrary ASCII set, this improves throughput between 1.2x and 1.5x depending on the hardware (see more numbers below).

The UniqueLowNibble approach could be used a lot more aggresively (see benchmarks below).
I conservatively placed it between 3 and 4 values to minimize the risk of regressions for now.
In practice, we're currently only using SearchValues with 4 or more values across runtime/aspnet.

As a follow up, I plan on changing our heuristics around which approach we pick in SearchValues depending on the platform.
After that, we may want to consider using it even with fewer values (e.g. 2 or 3).

We should also consider using PackedSpanHelpers on ARM.
Searching for any subset of ASCII is currently faster than a basic IndexOf('a') on M1 hardware because we're not doing that.

Throughput numbers for scanning through 10k elements (10k bytes or 10k chars).
Rows are ordered from fastest to slowest.

ARM (Apple M1)

Method	Mean	Error
IndexOfAny1Byte	233.2 ns	0.24 ns
IndexOfAnyByteInRange	253.7 ns	0.24 ns
IndexOfAny2Byte	274.4 ns	0.14 ns
IndexOfAnyUniqueLowNibbleByte	275.7 ns	0.83 ns
IndexOfAnyAsciiByte	346.5 ns	0.02 ns
IndexOfAny3Byte	346.8 ns	0.11 ns
IndexOfAny4Byte	444.5 ns	0.03 ns
IndexOfAnyByte	541.7 ns	0.05 ns
IndexOfAny5Byte	542.2 ns	0.05 ns

IndexOfAnyUniqueLowNibbleChar	351.4 ns	0.49 ns
IndexOfAnyAsciiChar	448.2 ns	0.41 ns
IndexOfAny1Char	453.2 ns	0.23 ns
IndexOfAnyInRange	497.6 ns	0.31 ns
IndexOfAny2Chars	543.2 ns	0.37 ns
IndexOfAny3Chars	688.8 ns	0.19 ns
IndexOfAny4Chars	884.2 ns	0.21 ns
IndexOfAny5Chars	1,079.5 ns	0.10 ns

ARM (Azure D8plsv5 VM)

Method	Mean	Error
IndexOfAny1Byte	493.0 ns	0.04 ns
IndexOfAnyByteInRange	544.2 ns	2.92 ns
IndexOfAny2Byte	636.7 ns	6.03 ns
IndexOfAnyUniqueLowNibbleByte	664.5 ns	4.58 ns
IndexOfAny3Byte	851.6 ns	7.29 ns
IndexOfAnyAsciiByte	853.6 ns	5.32 ns
IndexOfAny4Byte	1,067.7 ns	8.85 ns
IndexOfAny5Byte	1,292.5 ns	11.32 ns
IndexOfAnyByte	1,309.2 ns	10.58 ns

IndexOfAny1Char	979.7 ns	0.08 ns
IndexOfAnyInRange	1,075.4 ns	4.08 ns
IndexOfAnyUniqueLowNibbleChar	1,088.8 ns	53.19 ns
IndexOfAny2Chars	1,279.2 ns	13.17 ns
IndexOfAnyAsciiChar	1,316.2 ns	0.91 ns
IndexOfAny3Chars	1,702.1 ns	14.53 ns
IndexOfAny4Chars	2,135.7 ns	17.84 ns
IndexOfAny5Chars	2,578.2 ns	21.88 ns

X64 with Vector256 (i9-10900X - no full Avx512)

Method	Mean	Error
IndexOfAny1Byte	164.1 ns	2.56 ns
IndexOfAnyUniqueLowNibbleByte	163.8 ns	0.53 ns
IndexOfAnyByteInRange	200.0 ns	1.26 ns
IndexOfAny2Byte	214.8 ns	2.16 ns
IndexOfAny3Byte	216.4 ns	1.80 ns
IndexOfAny4Byte	227.1 ns	1.27 ns
IndexOfAnyAsciiByte	248.0 ns	2.47 ns
IndexOfAny5Byte	252.0 ns	0.75 ns
IndexOfAnyByte	361.8 ns	1.34 ns

IndexOfAny1PackedChar	209.1 ns	0.23 ns
IndexOfLetterIgnoreCase	199.4 ns	1.92 ns
IndexOfAnyUniqueLowNibbleChar	218.3 ns	0.25 ns
IndexOfAny2PackedChars	231.7 ns	2.57 ns
IndexOfTwoLettersIgnoreCase	243.4 ns	2.00 ns
IndexOfAny3PackedChars	248.4 ns	2.82 ns
IndexOfAnyInRangePacked	248.7 ns	2.49 ns
IndexOfAnyAsciiChar	287.2 ns	0.38 ns
IndexOfAny1Char	304.3 ns	3.55 ns
IndexOfAnyInRange	395.7 ns	3.20 ns
IndexOfAny2Chars	416.4 ns	5.64 ns
IndexOfAny3Chars	410.5 ns	3.66 ns
IndexOfAny4Chars	440.0 ns	3.07 ns
IndexOfAny5Chars	496.1 ns	1.85 ns

X64 with Vector256 (Ryzen 1700)

Method	Mean	Error
IndexOfAny1Byte	241.3 ns	1.51 ns
IndexOfAnyUniqueLowNibbleByte	279.0 ns	1.54 ns
IndexOfAnyByteInRange	368.5 ns	1.80 ns
IndexOfAny2Byte	369.9 ns	1.89 ns
IndexOfAny3Byte	447.2 ns	2.03 ns
IndexOfAnyAsciiByte	455.7 ns	2.62 ns
IndexOfAny4Byte	557.9 ns	1.79 ns
IndexOfAny5Byte	640.4 ns	3.19 ns
IndexOfAnyByte	655.3 ns	3.58 ns

IndexOfAny1PackedChar	280.7 ns	1.48 ns
IndexOfAnyUniqueLowNibbleChar	363.0 ns	1.94 ns
IndexOfAny2PackedChars	365.7 ns	1.99 ns
IndexOfLetterIgnoreCase	369.2 ns	1.98 ns
IndexOfAnyInRangePacked	375.1 ns	1.27 ns
IndexOfAny3PackedChars	448.3 ns	2.02 ns
IndexOfTwoLettersIgnoreCase	459.5 ns	2.24 ns
IndexOfAny1Char	461.1 ns	1.85 ns
IndexOfAnyAsciiChar	545.5 ns	18.02 ns
IndexOfAnyInRange	718.8 ns	4.14 ns
IndexOfAny2Chars	734.8 ns	2.81 ns
IndexOfAny3Chars	922.0 ns	2.75 ns
IndexOfAny4Chars	1,091.1 ns	5.75 ns
IndexOfAny5Chars	1,254.2 ns	7.16 ns

X64 with Vector512 (Xeon Platinum 8370C)

Method	Mean	Error
IndexOfAny1Byte	99.20 ns	0.811 ns
IndexOfAny2Byte	186.23 ns	0.157 ns
IndexOfAny3Byte	236.63 ns	0.228 ns
IndexOfAnyByteInRange	253.85 ns	0.279 ns
IndexOfAnyUniqueLowNibbleByte	273.06 ns	3.011 ns
IndexOfAny4Byte	312.36 ns	0.102 ns
IndexOfAnyAsciiByte	346.18 ns	2.557 ns
IndexOfAny5Byte	363.69 ns	0.160 ns
IndexOfAnyByte	422.75 ns	1.270 ns

IndexOfAny1PackedChar	165.53 ns	3.280 ns
IndexOfAnyInRangePacked	168.52 ns	2.998 ns
IndexOfLetterIgnoreCase	170.30 ns	2.769 ns
IndexOfAny1Char	194.79 ns	0.097 ns
IndexOfAny2PackedChars	217.14 ns	0.150 ns
IndexOfTwoLettersIgnoreCase	239.79 ns	0.205 ns
IndexOfAny3PackedChars	268.56 ns	0.217 ns
IndexOfAnyUniqueLowNibbleChar	271.63 ns	1.392 ns
IndexOfAnyAsciiChar	327.17 ns	1.021 ns
IndexOfAny2Chars	366.79 ns	0.087 ns
IndexOfAny3Chars	468.80 ns	0.093 ns
IndexOfAnyInRange	500.94 ns	0.521 ns
IndexOfAny4Chars	621.20 ns	0.209 ns
IndexOfAny5Chars	723.78 ns	0.212 ns

dotnet-policy-service · 2024-08-23T20:33:25Z

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

src/libraries/System.Private.CoreLib/src/System/SearchValues/IndexOfAnyAsciiSearcher.cs

stephentoub · 2024-09-05T17:34:44Z

src/libraries/System.Private.CoreLib/src/System/SearchValues/IndexOfAnyAsciiSearcher.cs

+            {
+                // Avoid false positives for the zero character if no other character has a low nibble of zero.
+                // We can replace it with any other byte that has a non-zero low nibble.
+                valuesByLowNibble.SetElementUnsafe(0, (byte)1);


I didn't fully grok this. Why don't we need to check if 1 is already being used?

All vector elements start out as 0, and not all of them may be initialized.

We map every input character to an element based on its lower nibble.

0, 16, 32 ... => valuesByLowNibble[0] 1, 17, 33 ... => valuesByLowNibble[1] 15, 31, 47 ... => valuesByLowNibble[15]

The search works by first picking a potential match based on the low nibble (Shuffle) and then confirming it (Equals).

This means that input characters with a given low nibble only care about the element of valuesByLowNibble for that nibble. Values like 1 or 2 don't care about what the value of valuesByLowNibble[7] is since they'll never be mapped to it.

This also means that it's okay for valuesByLowNibble to be left uninitialized at 0.
The Equals could only match for an input character 0, but those will always be mapped to valuesByLowNibble[0] by the shuffle instead.

The edge case is the 0th nibble since the character 0 could be a false positive there.
But it'll only be a false positive if we don't have the character 0 in our values.
That's the valuesByLowNibble.GetElement(0) == 0 && !lookup.Contains(0) check above.

To avoid false positives for 0, we can use the same trick of setting the element to some "unreachable" value.
We can use any value with a non-zero nibble, as the shuffle will map any inputs with those values to a different element. 1 is just an arbitrary choice.

Edit: I tweaked the comment a bit, hopefully, it's decipherable.

src/libraries/System.Private.CoreLib/src/System/SearchValues/IndexOfAnyAsciiSearcher.cs

…otnet#106900) * Add SearchValues implementation for values with unique low nibbles * More generics * Tweak comment * Remove extra empty line * Update comment

MihaZupan added the area-System.Memory label Aug 23, 2024

MihaZupan added this to the 10.0.0 milestone Aug 23, 2024

MihaZupan requested a review from stephentoub August 23, 2024 20:33

MihaZupan self-assigned this Aug 23, 2024

This was referenced Aug 24, 2024

SIGKILL (OOM?) while running LibraryImportGenerator.Tests w/o actionable log messages or artifacts dotnet/dnceng#2496

Open

XmlSerializerTests.Xml_DerivedIXmlSerializable is failing on linux-musl legs #106865

Closed

stephentoub reviewed Sep 5, 2024

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/SearchValues/IndexOfAnyAsciiSearcher.cs Outdated Show resolved Hide resolved

stephentoub reviewed Sep 5, 2024

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/SearchValues/IndexOfAnyAsciiSearcher.cs Show resolved Hide resolved

stephentoub reviewed Sep 5, 2024

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/SearchValues/IndexOfAnyAsciiSearcher.cs Outdated Show resolved Hide resolved

stephentoub reviewed Sep 5, 2024

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/SearchValues/IndexOfAnyAsciiSearcher.cs Outdated Show resolved Hide resolved

stephentoub approved these changes Sep 5, 2024

View reviewed changes

MihaZupan added 3 commits September 6, 2024 17:14

Add SearchValues implementation for values with unique low nibbles

cdf4d7f

More generics

8ea4ecc

Tweak comment

fe3ae67

MihaZupan force-pushed the searchvalues-uniqueLowNibble2 branch from d2ae610 to fe3ae67 Compare September 6, 2024 17:22

Remove extra empty line

74e1d6e

MihaZupan commented Sep 9, 2024

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/SearchValues/IndexOfAnyAsciiSearcher.cs Outdated Show resolved Hide resolved

Update comment

2dfd56b

MihaZupan mentioned this pull request Sep 9, 2024

Extend the list of recognized SearchValues<char> field names in Regex #107402

Merged

MihaZupan merged commit b06d5e2 into dotnet:main Sep 10, 2024
146 of 148 checks passed

This was referenced Sep 11, 2024

"We stopped hearing from agent Azure Pipelines 32. Verify the agent machine is running and has a healthy network connection" dotnet/dnceng#1886

Open

restarted. Azure DevOps can't recover from restarts. dotnet/dnceng#3879

Open

LoopedBard3 mentioned this pull request Sep 19, 2024

[Perf] Linux/arm64: 2 Improvements on 9/10/2024 10:21:56 PM dotnet/perf-autofiling-issues#41493

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a SearchValues implementation for values with unique low nibbles #106900

Add a SearchValues implementation for values with unique low nibbles #106900

MihaZupan commented Aug 23, 2024 •

edited

Loading

dotnet-policy-service bot commented Aug 23, 2024

stephentoub Sep 5, 2024

MihaZupan Sep 5, 2024 •

edited

Loading

Add a SearchValues implementation for values with unique low nibbles #106900

Add a SearchValues implementation for values with unique low nibbles #106900

Conversation

MihaZupan commented Aug 23, 2024 • edited Loading

dotnet-policy-service bot commented Aug 23, 2024

stephentoub Sep 5, 2024

Choose a reason for hiding this comment

MihaZupan Sep 5, 2024 • edited Loading

Choose a reason for hiding this comment

MihaZupan commented Aug 23, 2024 •

edited

Loading

MihaZupan Sep 5, 2024 •

edited

Loading