Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: MemoryExtensions.IndexOfAny{Except}WhiteSpace #77959

Open
stephentoub opened this issue Nov 7, 2022 · 10 comments
Open

[API Proposal]: MemoryExtensions.IndexOfAny{Except}WhiteSpace #77959

stephentoub opened this issue Nov 7, 2022 · 10 comments
Labels
api-approved API was approved in API review, it can be implemented area-System.Memory
Milestone

Comments

@stephentoub
Copy link
Member

stephentoub commented Nov 7, 2022

Background and motivation

It turns out it's quite common to want to search for whitespace (char.IsWhiteSpace) or things other than whitespace (!char.IsWhiteSpace). This is true not only in regex (\s and \S) (according to our nuget regex database there are ~13,000 occurrences of a regex that's simply \s) but also in many open-coded loops, e.g.

Etc. We should expose these as dedicated helpers, whether or not we're able to improve performance over a simple loop (we might be able to, for at least some kinds of input).

API Proposal

namespace System;

public static class MemoryExtensions
{
+   public static int IndexOfAnyWhiteSpace(this ReadOnlySpan<char> span);
+   public static int IndexOfAnyExceptWhiteSpace(this ReadOnlySpan<char> span);
+   public static int LastIndexOfAnyWhiteSpace(this ReadOnlySpan<char> span);
+   public static int LastIndexOfAnyExceptWhiteSpace(this ReadOnlySpan<char> span);
}
  • This is only proposed for ReadOnlySpan<char> and not also Span<char>, since the most common case by far is expected to be spans derived from strings. The existing MemoryExtensions.IsWhiteSpace is also only exposed for ReadOnlySpan<char>.

API Usage

e.g. MemoryExtensions.IsWhiteSpace could be rewritten as simply:

public static bool IsWhiteSpace(this ReadOnlySpan<char> span) => span.IndexOfAnyExceptWhiteSpace() < 0;

Alternative Designs

If we want to expose these but don't want them to be so prominent, once #68328 is implemented (assuming it sticks with the proposed design), this could instead be exposed as a static property on IndexOfAnyValues:

public static class IndexOfAnyValues
{
+   public static IndexOfAnyValues<char> WhiteSpace { get; }
}

in which case the same functionality could be achieved with:

int wsIndex = span.IndexOfAny(IndexOfAnyValues.WhiteSpace); // or IndexOfAnyExcept

The WhiteSpace property would cache a specialized concrete implementation that does what the proposed IndexOfAnyWhiteSpace would do.

Risks

No response

@stephentoub stephentoub added api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Memory labels Nov 7, 2022
@stephentoub stephentoub added this to the 8.0.0 milestone Nov 7, 2022
@ghost
Copy link

ghost commented Nov 7, 2022

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

It turns out it's quite common to want to search for whitespace (char.IsWhiteSpace) or things other than whitespace (!char.IsWhiteSpace). This is true not only in regex (\s and \S) (according to our nuget regex database there are ~13,000 occurrences of a regex that's simply \s) but also in many open-coded loops, e.g.

Etc. We should expose these as dedicated helpers, whether or not we're able to improve performance over a simple loop (we might be able to, for at least some kinds of input).

API Proposal

namespace System;

public static class MemoryExtensions
{
+   public static int IndexOfAnyWhiteSpace(this ReadOnlySpan<char> span);
+   public static int IndexOfAnyExceptWhiteSpace(this ReadOnlySpan<char> span);
}
  • This is only proposed for ReadOnlySpan<char> and not also Span<char>, since the most common case by far is expected to be spans derived from strings. The existing MemoryExtensions.IsWhiteSpace is also only exposed for ReadOnlySpan<char>.

API Usage

e.g. MemoryExtensions.IsWhiteSpace could be rewritten as simply:

public static bool IsWhiteSpace(this ReadOnlySpan<char> span) => span.IndexOfAnyExceptWhiteSpace() < 0;

Alternative Designs

If we want to expose these but don't want them to be so prominent, once #68328 is implemented (assuming it sticks with the proposed design), this could instead be exposed as a static property on IndexOfAnyValues:

public static class IndexOfAnyValues
{
+   public static IndexOfAnyValues<char> WhiteSpace { get; }
}

in which case the same functionality could be achieved with:

int wsIndex = span.IndexOfAny(IndexOfAnyValues.WhiteSpace); // or IndexOfAnyExcept

The WhiteSpace property would cache a specialized concrete implementation that does what the proposed IndexOfAnyWhiteSpace would do.

Risks

No response

Author: stephentoub
Assignees: -
Labels:

api-suggestion, area-System.Memory

Milestone: 8.0.0

@adamsitnik
Copy link
Member

The proposal LGTM. I assume we are going to provide a vectorized implementation for the most frequent cases (Latin1)?

@stephentoub
Copy link
Member Author

stephentoub commented Nov 7, 2022

I assume we are going to provide a vectorized implementation for the most frequent cases (Latin1)?

At least experiment with one. I think the methods would be worth adding even if not vectorized; I listed some examples of places code is doing this same operation, but there are many more... it's surprisingly common. That said, I think it'd be worth trying to vectorize it and see what the perf looks like, in particular when there's non-ASCII input. My initial thought was we'd basically vectorize the search for all the ASCII whitespace chars (0x9 through 0xD and 0x20) as well as anything > 0x7F, and if we find anything > 0x7F, then fall back to a scalar loop just using char.IsWhiteSpace (I don't know if there'd be an efficient way to vectorize the search for the Unicode whitespace characters). My main concern with this would be that if the input wasn't ASCII, we'd likely frequently be paying the overhead of an initial vector on top of then taking the scalar path.

@stephentoub stephentoub added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Nov 7, 2022
@MihaZupan
Copy link
Member

#80963 means that this could be vectorized when using IndexOfAnyValues (at least for non-except overloads), but if we expect matches to commonly occur at the start of the input (e.g. Trim), we may not want to use it anyway due to the added overhead.

@bartonjs
Copy link
Member

bartonjs commented Mar 2, 2023

Video

Looks good as proposed.

When looking for consistency, we saw that the ReadOnlySpan<char> methods on MemoryExtensions are not entirely consistent with whether there's also a Span<char> overload. In the event that the Span<char> members are desired, they're approved by pattern-matching.

Notably, MemoryExtensions.IsWhiteSpace does not have a Span<char> overload.

namespace System;

public static partial class MemoryExtensions
{
    public static int IndexOfAnyWhiteSpace(this ReadOnlySpan<char> span);
    public static int IndexOfAnyExceptWhiteSpace(this ReadOnlySpan<char> span);
    public static int LastIndexOfAnyWhiteSpace(this ReadOnlySpan<char> span);
    public static int LastIndexOfAnyExceptWhiteSpace(this ReadOnlySpan<char> span);

    // Edit: APIs approved as part of #86528
    public static bool ContainsAnyWhiteSpace(this ReadOnlySpan<char> span);
    public static bool ContainsAnyExceptWhiteSpace(this ReadOnlySpan<char> span);
}

@bartonjs bartonjs added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Mar 2, 2023
@stephentoub stephentoub self-assigned this Mar 15, 2023
@stephentoub
Copy link
Member Author

As of #83992, the regex source generator will now emit its own variants of these, e.g. if it needs to search for \s it'll emit:

        /// <summary>Finds the next index of any character that matches a whitespace character.</summary>
        internal static int IndexOfAnyWhiteSpace(this ReadOnlySpan<char> span)
        {
            int i = span.IndexOfAnyExcept(Utilities.s_asciiExceptWhiteSpace);
            if ((uint)i < (uint)span.Length)
            {
                if (span[i] <= 0x7f)
                {
                    return i;
                }
        
                do
                {
                    if (char.IsWhiteSpace(span[i]))
                    {
                        return i;
                    }
                    i++;
                }
                while ((uint)i < (uint)span.Length);
            }
        
            return -1;
        }
        
        /// <summary>Supports searching for characters in or not in "\0\u0001\u0002\u0003\u0004\u0005\u0006\a\b\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f!\"#$%&amp;'()*+,-./0123456789:;&lt;=&gt;?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\u007f".</summary>
        internal static readonly IndexOfAnyValues<char> s_asciiExceptWhiteSpace = IndexOfAnyValues.Create("\0\u0001\u0002\u0003\u0004\u0005\u0006\a\b\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f!\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\u007f");

and for \S:

        /// <summary>Finds the next index of any character that matches any character other than a space character.</summary>
        internal static int IndexOfAnyExceptWhiteSpace(this ReadOnlySpan<char> span)
        {
            int i = span.IndexOfAnyExcept(Utilities.s_asciiWhiteSpace);
            if ((uint)i < (uint)span.Length)
            {
                if (span[i] <= 0x7f)
                {
                    return i;
                }
        
                do
                {
                    if (!char.IsWhiteSpace(span[i]))
                    {
                        return i;
                    }
                    i++;
                }
                while ((uint)i < (uint)span.Length);
            }
        
            return -1;
        }
        
        /// <summary>Supports searching for characters in or not in "\t\n\v\f\r ".</summary>
        internal static readonly IndexOfAnyValues<char> s_asciiWhiteSpace = IndexOfAnyValues.Create("\t\n\v\f\r ");

(It's capable now of emitting many variations of these... it just chooses good names for the methods for these known sets, but will generate a similarly shaped method for any set it doesn't have a better strategy for.)

I suggest we put this issue on hold until we get some more experience with those implementations and how they fair. Once we're happy with them, we can effectively promote them up to MemoryExtensions per this issue, and the regex source generator / compiler can be changed to use those instead.

@GrabYourPitchforks
Copy link
Member

A nit - the names IndexOfAnyWhiteSpace and IndexOfAnyExceptWhiteSpace read very strangely to me. I know we're establishing a pattern around Any and AnyExcept, but is it possible to tweak these names to make them a little more human? Breaking the naming pattern in this instance might help improve readability and understandability.

I'm assuming the respective definitions are "return the index of the first whitespace char you see; or -1 if you find no whitespace chars or the string is empty" and "return the index of the first non-whitespace char you see; or -1 if the string contains only whitespace chars or is empty."

Some proposals:

  • IndexOfWhiteSpace and IndexOfNonWhiteSpace (remove the "any", since the term whitespace already implies a set of possible matching characters rather than a single possible char match)
  • IndexOfAnyWhiteSpace and IndexOfAnyNonWhiteSpace (keep the "any"; change the "except" to "non")

@danmoseley
Copy link
Member

The capitalization of space in WhiteSpace is jarring for me given we consistently write whitespace as a single word even in this very issue.

@stephentoub
Copy link
Member Author

The capitalization of space in WhiteSpace is jarring for me given we consistently write whitespace as a single word even in this very issue.

I don't disagree, but we have char.IsWhiteSpace, Rune.IsWhiteSpace, string.IsNullOrWhiteSpace, MemoryExtensions.IsWhiteSpace, ArgumentException.ThrowIfNullOrWhiteSpace, FromBase64TransformMode.IgnoreWhiteSpaces, ... outside of System.Xml and one enum value in regex, we've pretty consistently used "WhiteSpace" rather than "Whitespace" and shouldn't change now.

@bartonjs
Copy link
Member

It was contentious enough that Krzysztof wrote it in the table of correct spellings in Framework Design Guidelines 2nd edition (or maybe even first).

Pascal: WhiteSpace, Camel: whiteSpace, Not: Whitespace

Which is in opposition to Timestamp:

Pascal: Timestamp, Camel: timestamp, Not: TimeStamp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-approved API was approved in API review, it can be implemented area-System.Memory
Projects
None yet
Development

No branches or pull requests

6 participants