-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[API Proposal]: MemoryExtensions.IndexOfAny{Except}WhiteSpace #77959
Comments
The proposal LGTM. I assume we are going to provide a vectorized implementation for the most frequent cases ( |
At least experiment with one. I think the methods would be worth adding even if not vectorized; I listed some examples of places code is doing this same operation, but there are many more... it's surprisingly common. That said, I think it'd be worth trying to vectorize it and see what the perf looks like, in particular when there's non-ASCII input. My initial thought was we'd basically vectorize the search for all the ASCII whitespace chars (0x9 through 0xD and 0x20) as well as anything > 0x7F, and if we find anything > 0x7F, then fall back to a scalar loop just using char.IsWhiteSpace (I don't know if there'd be an efficient way to vectorize the search for the Unicode whitespace characters). My main concern with this would be that if the input wasn't ASCII, we'd likely frequently be paying the overhead of an initial vector on top of then taking the scalar path. |
#80963 means that this could be vectorized when using |
Looks good as proposed. When looking for consistency, we saw that the Notably, MemoryExtensions.IsWhiteSpace does not have a namespace System;
public static partial class MemoryExtensions
{
public static int IndexOfAnyWhiteSpace(this ReadOnlySpan<char> span);
public static int IndexOfAnyExceptWhiteSpace(this ReadOnlySpan<char> span);
public static int LastIndexOfAnyWhiteSpace(this ReadOnlySpan<char> span);
public static int LastIndexOfAnyExceptWhiteSpace(this ReadOnlySpan<char> span);
// Edit: APIs approved as part of #86528
public static bool ContainsAnyWhiteSpace(this ReadOnlySpan<char> span);
public static bool ContainsAnyExceptWhiteSpace(this ReadOnlySpan<char> span);
} |
As of #83992, the regex source generator will now emit its own variants of these, e.g. if it needs to search for /// <summary>Finds the next index of any character that matches a whitespace character.</summary>
internal static int IndexOfAnyWhiteSpace(this ReadOnlySpan<char> span)
{
int i = span.IndexOfAnyExcept(Utilities.s_asciiExceptWhiteSpace);
if ((uint)i < (uint)span.Length)
{
if (span[i] <= 0x7f)
{
return i;
}
do
{
if (char.IsWhiteSpace(span[i]))
{
return i;
}
i++;
}
while ((uint)i < (uint)span.Length);
}
return -1;
}
/// <summary>Supports searching for characters in or not in "\0\u0001\u0002\u0003\u0004\u0005\u0006\a\b\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f!\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\u007f".</summary>
internal static readonly IndexOfAnyValues<char> s_asciiExceptWhiteSpace = IndexOfAnyValues.Create("\0\u0001\u0002\u0003\u0004\u0005\u0006\a\b\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f!\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\u007f"); and for /// <summary>Finds the next index of any character that matches any character other than a space character.</summary>
internal static int IndexOfAnyExceptWhiteSpace(this ReadOnlySpan<char> span)
{
int i = span.IndexOfAnyExcept(Utilities.s_asciiWhiteSpace);
if ((uint)i < (uint)span.Length)
{
if (span[i] <= 0x7f)
{
return i;
}
do
{
if (!char.IsWhiteSpace(span[i]))
{
return i;
}
i++;
}
while ((uint)i < (uint)span.Length);
}
return -1;
}
/// <summary>Supports searching for characters in or not in "\t\n\v\f\r ".</summary>
internal static readonly IndexOfAnyValues<char> s_asciiWhiteSpace = IndexOfAnyValues.Create("\t\n\v\f\r "); (It's capable now of emitting many variations of these... it just chooses good names for the methods for these known sets, but will generate a similarly shaped method for any set it doesn't have a better strategy for.) I suggest we put this issue on hold until we get some more experience with those implementations and how they fair. Once we're happy with them, we can effectively promote them up to MemoryExtensions per this issue, and the regex source generator / compiler can be changed to use those instead. |
A nit - the names I'm assuming the respective definitions are "return the index of the first whitespace char you see; or -1 if you find no whitespace chars or the string is empty" and "return the index of the first non-whitespace char you see; or -1 if the string contains only whitespace chars or is empty." Some proposals:
|
The capitalization of space in WhiteSpace is jarring for me given we consistently write whitespace as a single word even in this very issue. |
I don't disagree, but we have |
It was contentious enough that Krzysztof wrote it in the table of correct spellings in Framework Design Guidelines 2nd edition (or maybe even first).
Which is in opposition to Timestamp:
|
Background and motivation
It turns out it's quite common to want to search for whitespace (
char.IsWhiteSpace
) or things other than whitespace (!char.IsWhiteSpace
). This is true not only in regex (\s
and\S
) (according to our nuget regex database there are ~13,000 occurrences of a regex that's simply\s
) but also in many open-coded loops, e.g.runtime/src/libraries/System.Private.CoreLib/src/System/MemoryExtensions.Globalization.cs
Lines 17 to 25 in 264d739
runtime/src/libraries/System.Linq.Expressions/src/System/Linq/Expressions/DebugViewWriter.cs
Lines 1194 to 1204 in 264d739
runtime/src/libraries/System.Private.CoreLib/src/System/MemoryExtensions.Trim.cs
Lines 568 to 589 in 264d739
Etc. We should expose these as dedicated helpers, whether or not we're able to improve performance over a simple loop (we might be able to, for at least some kinds of input).
API Proposal
ReadOnlySpan<char>
and not alsoSpan<char>
, since the most common case by far is expected to be spans derived from strings. The existing MemoryExtensions.IsWhiteSpace is also only exposed forReadOnlySpan<char>
.API Usage
e.g. MemoryExtensions.IsWhiteSpace could be rewritten as simply:
Alternative Designs
If we want to expose these but don't want them to be so prominent, once #68328 is implemented (assuming it sticks with the proposed design), this could instead be exposed as a static property on
IndexOfAnyValues
:public static class IndexOfAnyValues { + public static IndexOfAnyValues<char> WhiteSpace { get; } }
in which case the same functionality could be achieved with:
The WhiteSpace property would cache a specialized concrete implementation that does what the proposed IndexOfAnyWhiteSpace would do.
Risks
No response
The text was updated successfully, but these errors were encountered: