Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API Proposal]: ISpanScannable, IUtf8SpanScannable #93339

Open
epeshk opened this issue Oct 11, 2023 · 3 comments
Open

[API Proposal]: ISpanScannable, IUtf8SpanScannable #93339

epeshk opened this issue Oct 11, 2023 · 3 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Runtime needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration
Milestone

Comments

@epeshk
Copy link
Contributor

epeshk commented Oct 11, 2023

Background and motivation

This is a successor of IUtf8SpanParsable<TSelf> API proposal where this scenario is already concerned, but deferred to a later. (IUtf8SpanParsable was designed to be in sync with existing ISpanParsable). It proposes not a replacement for IUtf8SpanParsable<TSelf>, but an alternative for advanced scenarios.

Currently, ISpanParsable<TSelf> , IUtf8SpanParsable<TSelf> interfaces can't be used to parse a value when the size of serialized data is unknown. When a caller wants to deserialize text from some buffer, it must scan the buffer up to the delimiter, and only then pass a piece of data to the Parse/TryParse methods. Thus, each byte/character will be processed twice: the first time to find the size and the second time to parse a value.

Currently, this can be solved with a bunch of static TryParse methods in the Utf8Parser class. And there is another API proposal to add an option to stop parsing after the first invalid character.

These approaches have some disadvantages:

  • Both the Utf8Parser static methods and the NumberStyles.AllowTrailingInvalidCharacters option are limited to BCL types. NumberStyles also does not cover non-numeric types such as TimeSpan, DateTime, DateTimeOffset, DateOnly, TimeOnly, ...
  • These APIs are bool returning, which may leads to some quirks when parsing, e.g. floating-point values.

Utf8Parser.TryParse("1."u8, out double value, out int bytesConsumed) will successfully parse the double value 1, and the caller should perform additional complicated checks to determine if there is a final result, or TryParse call should be retried with more data (fetched from Stream) such as Utf8Parser.TryParse("1.2"u8, ...)

.NET already has an OperationStatus enum that expresses this behavior in the OperationStatus.NeedMoreData value. But for this purpose it may be useful to extend the OperationStatus enum with a new value to make the difference between "Data is parsed, but may be retried with more data" and "Data is not parsed, but may be parsed with more data". E.g. Utf8Parser.TryParse("1.e"u8, ...) returns false, but Utf8Parser.TryParse("1e7"u8, ...) will be successful.

If adding a new value to this enum is a breaking change, the new method may return bool with OperationStatus out parameter, or a new enum with similar semantic

/// The input is partially processed, up to the last valid chunk of the input that could be consumed.
/// The caller can stitch the remaining unprocessed input with more data, slice the buffers appropriately, and retry.
/// </summary>
NeedMoreData,

API Proposal

namespace System.Buffers
{
  public interface ISpanScannable<TSelf>
    where TSelf : ISpanScannable<TSelf>?
  {
    static abstract OperationStatus TryParse(
      ReadOnlySpan<char> text,
      IFormatProvider? provider,
      out int bytesConsumed,
      out TSelf result);
  }

  public interface IUtf8SpanScannable<TSelf>
    where TSelf : IUtf8SpanScannable<TSelf>?
  {
    static abstract OperationStatus TryParse(
      ReadOnlySpan<byte> utf8Text,
      IFormatProvider? provider,
      out int bytesConsumed,
      out TSelf result);
  }
}

namespace System.Buffers
{
  public enum OperationStatus
  {
    ...,

    /// <summary>
    /// The entire input is processed, up to the end of the buffer.
    /// However, caller can stitch the remaining unprocessed input with more data, retry, and get different result.
    /// </summary>
    PartiallyDone
  }
}

API Usage

  
class Utf8Reader(Stream stream)
{
  byte[] buffer = new byte[16];
  int position = 0;
  int end;

  public TValue Read<TValue>() where TValue : IUtf8SpanParsable<TValue>
  {
    TValue value;

    SkipDelimiters();
    while (true)
    {
      OperationStatus status = TValue.TryParse(buffer.AsSpan(position), out int bytesConsumed, out value);

      switch (status)
      {
        case OperationStatus.Done:
          position += bytesConsumed;
          return value;
        case OperationStatus.PartiallyDone:
          if (FetchMoreData())
            continue;
          position += bytesConsumed;
          return value;
        case OperationStatus.NeedMoreData:
          if (FetchMoreData())
            continue;
          throw new FormatException();
        case OperationStatus.InvalidData:
          throw new FormatException();
      }
    }
  }

  bool FetchMoreData()
  {
    var unconsumedLength = end - position;
    buffer.AsSpan(position, end).CopyTo(buffer);
    position = unconsumedLength;
    if (position == buffer.Length)
      GrowBuffer();
    var bytesRead = stream.Read(buffer.AsSpan(position));
    if (bytesRead <= 0)
      return false;
    position += bytesRead;
    return true;
  }

  void SkipDelimiters() { ... }
  void GrowBuffer() { ... }
}

Alternative Designs

Without OperationStatus. Caller should analyze buffer data on failure itself.

namespace System
{
  public interface ISpanScannable<TSelf>
    where TSelf : ISpanScannable<TSelf>?
  {
    static abstract bool TryParse(
      ReadOnlySpan<char> text,
      IFormatProvider? provider,
      out int bytesConsumed,
      out TSelf result);
  }

  public interface IUtf8SpanScannable<TSelf>
    where TSelf : IUtf8SpanScannable<TSelf>?
  {
    static abstract bool TryParse(
      ReadOnlySpan<byte> utf8Text,
      IFormatProvider? provider,
      out int bytesConsumed,
      out TSelf result);
  }
}

Risks

  1. This API is only for advanced scenarios, may not be useful for regular user, TryParse may be too complex to implement correctly.

  2. Nullability of out TSelf result parameter may require a new nullability attribute. This is not needed for bool-returning alternative implementation.

sealed class MaybeNullWhenAttribute<TEnum>(params TEnum[] returnValues) : Attribute where TEnum : Enum
{
  public TEnum[] ReturnValues { get; } = returnValues;
}
  1. Alternative approaches for primitive values parsing may be faster than relying on IUtf8SpanScannable<TSelf>. E.g. with vectorization
@epeshk epeshk added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Oct 11, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Oct 11, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Oct 11, 2023
@epeshk
Copy link
Contributor Author

epeshk commented Oct 11, 2023

public class ParsableVsScannable
{
  public ParsableVsScannable()
  {
    if (Parsable() != Scannable())
      throw new Exception();
  }

  readonly static byte[] utf8Text =
    Encoding.UTF8.GetBytes(string.Join(" ", Enumerable.Range(100000, 100000).Select(x => x.ToString())) + " ");
  
  [Benchmark]
  public int Parsable()
  {
    var span = utf8Text.AsSpan();
    var xor = 0;

    do
    {
      var delim = span.IndexOf((byte)' ');
      if (int.TryParse(span.Slice(0, delim), CultureInfo.InvariantCulture, out var x))
        xor ^= x;
      span = span.Slice(delim + 1);
    } while (!span.IsEmpty);

    return xor;
  }
  
  [Benchmark]
  public int Scannable()
  {
    var span = utf8Text.AsSpan();
    var xor = 0;

    do
    {
      if (Utf8Parser.TryParse(span, out int x, out var bytesConsumed))
      {
        xor ^= x;
        span = span.Slice(bytesConsumed + 1);
      }
    } while (!span.IsEmpty);

    return xor;
  }
}
| Method    | Mean     | Error   | StdDev  |
|---------- |---------:|--------:|--------:|
| Parsable  | 854.2 us | 2.17 us | 1.82 us |
| Scannable | 427.0 us | 1.65 us | 1.46 us |

@ghost
Copy link

ghost commented Oct 12, 2023

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

This is a successor of IUtf8SpanParsable<TSelf> API proposal where this scenario is already concerned, but deferred to a later. (IUtf8SpanParsable was designed to be in sync with existing ISpanParsable). It proposes not a replacement for IUtf8SpanParsable<TSelf>, but an alternative for advanced scenarios.

Currently, ISpanParsable<TSelf> , IUtf8SpanParsable<TSelf> interfaces can't be used to parse a value when the size of serialized data is unknown. When a caller wants to deserialize text from some buffer, it must scan the buffer up to the delimiter, and only then pass a piece of data to the Parse/TryParse methods. Thus, each byte/character will be processed twice: the first time to find the size and the second time to parse a value.

Currently, this can be solved with a bunch of static TryParse methods in the Utf8Parser class. And there is another API proposal to add an option to stop parsing after the first invalid character.

These approaches have some disadvantages:

  • Both the Utf8Parser static methods and the NumberStyles.AllowTrailingInvalidCharacters option are limited to BCL types. NumberStyles also does not cover non-numeric types such as TimeSpan, DateTime, DateTimeOffset, DateOnly, TimeOnly, ...
  • These APIs are bool returning, which may leads to some quirks when parsing, e.g. floating-point values.

Utf8Parser.TryParse("1."u8, out double value, out int bytesConsumed) will successfully parse the double value 1, and the caller should perform additional complicated checks to determine if there is a final result, or TryParse call should be retried with more data (fetched from Stream) such as Utf8Parser.TryParse("1.2"u8, ...)

.NET already has an OperationStatus enum that expresses this behavior in the OperationStatus.NeedMoreData value. But for this purpose it may be useful to extend the OperationStatus enum with a new value to make the difference between "Data is parsed, but may be retried with more data" and "Data is not parsed, but may be parsed with more data". E.g. Utf8Parser.TryParse("1.e"u8, ...) returns false, but Utf8Parser.TryParse("1e7"u8, ...) will be successful.

If adding a new value to this enum is a breaking change, the new method may return bool with OperationStatus out parameter, or a new enum with similar semantic

/// The input is partially processed, up to the last valid chunk of the input that could be consumed.
/// The caller can stitch the remaining unprocessed input with more data, slice the buffers appropriately, and retry.
/// </summary>
NeedMoreData,

API Proposal

namespace System.Buffers
{
  public interface ISpanScannable<TSelf>
    where TSelf : ISpanScannable<TSelf>?
  {
    static abstract OperationStatus TryParse(
      ReadOnlySpan<char> text,
      IFormatProvider? provider,
      out int bytesConsumed,
      out TSelf result);
  }

  public interface IUtf8SpanScannable<TSelf>
    where TSelf : IUtf8SpanScannable<TSelf>?
  {
    static abstract OperationStatus TryParse(
      ReadOnlySpan<byte> utf8Text,
      IFormatProvider? provider,
      out int bytesConsumed,
      out TSelf result);
  }
}

namespace System.Buffers
{
  public enum OperationStatus
  {
    ...,

    /// <summary>
    /// The entire input is processed, up to the end of the buffer.
    /// However, caller can stitch the remaining unprocessed input with more data, retry, and get different result.
    /// </summary>
    PartiallyDone
  }
}

API Usage

  
class Utf8Reader(Stream stream)
{
  byte[] buffer = new byte[16];
  int position = 0;
  int end;

  public TValue Read<TValue>() where TValue : IUtf8SpanParsable<TValue>
  {
    TValue value;

    SkipDelimiters();
    while (true)
    {
      OperationStatus status = TValue.TryParse(buffer.AsSpan(position), out int bytesConsumed, out value);

      switch (status)
      {
        case OperationStatus.Done:
          position += bytesConsumed;
          return value;
        case OperationStatus.PartiallyDone:
          if (FetchMoreData())
            continue;
          position += bytesConsumed;
          return value;
        case OperationStatus.NeedMoreData:
          if (FetchMoreData())
            continue;
          throw new FormatException();
        case OperationStatus.InvalidData:
          throw new FormatException();
      }
    }
  }

  bool FetchMoreData()
  {
    var unconsumedLength = end - position;
    buffer.AsSpan(position, end).CopyTo(buffer);
    position = unconsumedLength;
    if (position == buffer.Length)
      GrowBuffer();
    var bytesRead = stream.Read(buffer.AsSpan(position));
    if (bytesRead <= 0)
      return false;
    position += bytesRead;
    return true;
  }

  void SkipDelimiters() { ... }
  void GrowBuffer() { ... }
}

Alternative Designs

Without OperationStatus. Caller should analyze buffer data on failure itself.

namespace System
{
  public interface ISpanScannable<TSelf>
    where TSelf : ISpanScannable<TSelf>?
  {
    static abstract bool TryParse(
      ReadOnlySpan<char> text,
      IFormatProvider? provider,
      out int bytesConsumed,
      out TSelf result);
  }

  public interface IUtf8SpanScannable<TSelf>
    where TSelf : IUtf8SpanScannable<TSelf>?
  {
    static abstract bool TryParse(
      ReadOnlySpan<byte> utf8Text,
      IFormatProvider? provider,
      out int bytesConsumed,
      out TSelf result);
  }
}

Risks

  1. This API is only for advanced scenarios, may not be useful for regular user, TryParse may be too complex to implement correctly.

  2. Nullability of out TSelf result parameter may require a new nullability attribute. This is not needed for bool-returning alternative implementation.

sealed class MaybeNullWhenAttribute<TEnum>(params TEnum[] returnValues) : Attribute where TEnum : Enum
{
  public TEnum[] ReturnValues { get; } = returnValues;
}
  1. Alternative approaches for primitive values parsing may be faster than relying on IUtf8SpanScannable<TSelf>. E.g. with vectorization
Author: epeshk
Assignees: -
Labels:

api-suggestion, area-System.Runtime, untriaged, needs-area-label

Milestone: -

@MihaZupan MihaZupan removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Oct 12, 2023
@tannergooding
Copy link
Member

Would appreciate some additional input from @stephentoub and @GrabYourPitchforks here.

@jeffhandley jeffhandley added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed untriaged New issue has not been triaged by the area owner labels Jul 19, 2024
@jeffhandley jeffhandley added this to the Future milestone Jul 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Runtime needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration
Projects
None yet
Development

No branches or pull requests

5 participants