[API Proposal]: ISpanScannable, IUtf8SpanScannable #93339

epeshk · 2023-10-11T12:51:38Z

Background and motivation

This is a successor of IUtf8SpanParsable<TSelf> API proposal where this scenario is already concerned, but deferred to a later. (IUtf8SpanParsable was designed to be in sync with existing ISpanParsable). It proposes not a replacement for IUtf8SpanParsable<TSelf>, but an alternative for advanced scenarios.

Currently, ISpanParsable<TSelf> , IUtf8SpanParsable<TSelf> interfaces can't be used to parse a value when the size of serialized data is unknown. When a caller wants to deserialize text from some buffer, it must scan the buffer up to the delimiter, and only then pass a piece of data to the Parse/TryParse methods. Thus, each byte/character will be processed twice: the first time to find the size and the second time to parse a value.

Currently, this can be solved with a bunch of static TryParse methods in the Utf8Parser class. And there is another API proposal to add an option to stop parsing after the first invalid character.

These approaches have some disadvantages:

Both the Utf8Parser static methods and the NumberStyles.AllowTrailingInvalidCharacters option are limited to BCL types. NumberStyles also does not cover non-numeric types such as TimeSpan, DateTime, DateTimeOffset, DateOnly, TimeOnly, ...
These APIs are bool returning, which may leads to some quirks when parsing, e.g. floating-point values.

Utf8Parser.TryParse("1."u8, out double value, out int bytesConsumed) will successfully parse the double value 1, and the caller should perform additional complicated checks to determine if there is a final result, or TryParse call should be retried with more data (fetched from Stream) such as Utf8Parser.TryParse("1.2"u8, ...)

.NET already has an OperationStatus enum that expresses this behavior in the OperationStatus.NeedMoreData value. But for this purpose it may be useful to extend the OperationStatus enum with a new value to make the difference between "Data is parsed, but may be retried with more data" and "Data is not parsed, but may be parsed with more data". E.g. Utf8Parser.TryParse("1.e"u8, ...) returns false, but Utf8Parser.TryParse("1e7"u8, ...) will be successful.

If adding a new value to this enum is a breaking change, the new method may return bool with OperationStatus out parameter, or a new enum with similar semantic

/// The input is partially processed, up to the last valid chunk of the input that could be consumed.
/// The caller can stitch the remaining unprocessed input with more data, slice the buffers appropriately, and retry.
/// </summary>
NeedMoreData,

API Proposal

namespace System.Buffers
{
  public interface ISpanScannable<TSelf>
    where TSelf : ISpanScannable<TSelf>?
  {
    static abstract OperationStatus TryParse(
      ReadOnlySpan<char> text,
      IFormatProvider? provider,
      out int bytesConsumed,
      out TSelf result);
  }

  public interface IUtf8SpanScannable<TSelf>
    where TSelf : IUtf8SpanScannable<TSelf>?
  {
    static abstract OperationStatus TryParse(
      ReadOnlySpan<byte> utf8Text,
      IFormatProvider? provider,
      out int bytesConsumed,
      out TSelf result);
  }
}

namespace System.Buffers
{
  public enum OperationStatus
  {
    ...,

    /// <summary>
    /// The entire input is processed, up to the end of the buffer.
    /// However, caller can stitch the remaining unprocessed input with more data, retry, and get different result.
    /// </summary>
    PartiallyDone
  }
}

API Usage

  
class Utf8Reader(Stream stream)
{
  byte[] buffer = new byte[16];
  int position = 0;
  int end;

  public TValue Read<TValue>() where TValue : IUtf8SpanParsable<TValue>
  {
    TValue value;

    SkipDelimiters();
    while (true)
    {
      OperationStatus status = TValue.TryParse(buffer.AsSpan(position), out int bytesConsumed, out value);

      switch (status)
      {
        case OperationStatus.Done:
          position += bytesConsumed;
          return value;
        case OperationStatus.PartiallyDone:
          if (FetchMoreData())
            continue;
          position += bytesConsumed;
          return value;
        case OperationStatus.NeedMoreData:
          if (FetchMoreData())
            continue;
          throw new FormatException();
        case OperationStatus.InvalidData:
          throw new FormatException();
      }
    }
  }

  bool FetchMoreData()
  {
    var unconsumedLength = end - position;
    buffer.AsSpan(position, end).CopyTo(buffer);
    position = unconsumedLength;
    if (position == buffer.Length)
      GrowBuffer();
    var bytesRead = stream.Read(buffer.AsSpan(position));
    if (bytesRead <= 0)
      return false;
    position += bytesRead;
    return true;
  }

  void SkipDelimiters() { ... }
  void GrowBuffer() { ... }
}

Alternative Designs

Without OperationStatus. Caller should analyze buffer data on failure itself.

namespace System
{
  public interface ISpanScannable<TSelf>
    where TSelf : ISpanScannable<TSelf>?
  {
    static abstract bool TryParse(
      ReadOnlySpan<char> text,
      IFormatProvider? provider,
      out int bytesConsumed,
      out TSelf result);
  }

  public interface IUtf8SpanScannable<TSelf>
    where TSelf : IUtf8SpanScannable<TSelf>?
  {
    static abstract bool TryParse(
      ReadOnlySpan<byte> utf8Text,
      IFormatProvider? provider,
      out int bytesConsumed,
      out TSelf result);
  }
}

Risks

This API is only for advanced scenarios, may not be useful for regular user, TryParse may be too complex to implement correctly.
Nullability of out TSelf result parameter may require a new nullability attribute. This is not needed for bool-returning alternative implementation.

sealed class MaybeNullWhenAttribute<TEnum>(params TEnum[] returnValues) : Attribute where TEnum : Enum
{
  public TEnum[] ReturnValues { get; } = returnValues;
}

Alternative approaches for primitive values parsing may be faster than relying on IUtf8SpanScannable<TSelf>. E.g. with vectorization

The text was updated successfully, but these errors were encountered:

epeshk · 2023-10-11T15:21:55Z

public class ParsableVsScannable
{
  public ParsableVsScannable()
  {
    if (Parsable() != Scannable())
      throw new Exception();
  }

  readonly static byte[] utf8Text =
    Encoding.UTF8.GetBytes(string.Join(" ", Enumerable.Range(100000, 100000).Select(x => x.ToString())) + " ");
  
  [Benchmark]
  public int Parsable()
  {
    var span = utf8Text.AsSpan();
    var xor = 0;

    do
    {
      var delim = span.IndexOf((byte)' ');
      if (int.TryParse(span.Slice(0, delim), CultureInfo.InvariantCulture, out var x))
        xor ^= x;
      span = span.Slice(delim + 1);
    } while (!span.IsEmpty);

    return xor;
  }
  
  [Benchmark]
  public int Scannable()
  {
    var span = utf8Text.AsSpan();
    var xor = 0;

    do
    {
      if (Utf8Parser.TryParse(span, out int x, out var bytesConsumed))
      {
        xor ^= x;
        span = span.Slice(bytesConsumed + 1);
      }
    } while (!span.IsEmpty);

    return xor;
  }
}

| Method    | Mean     | Error   | StdDev  |
|---------- |---------:|--------:|--------:|
| Parsable  | 854.2 us | 2.17 us | 1.82 us |
| Scannable | 427.0 us | 1.65 us | 1.46 us |

ghost · 2023-10-12T16:38:56Z

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

This is a successor of IUtf8SpanParsable<TSelf> API proposal where this scenario is already concerned, but deferred to a later. (IUtf8SpanParsable was designed to be in sync with existing ISpanParsable). It proposes not a replacement for IUtf8SpanParsable<TSelf>, but an alternative for advanced scenarios.

Currently, ISpanParsable<TSelf> , IUtf8SpanParsable<TSelf> interfaces can't be used to parse a value when the size of serialized data is unknown. When a caller wants to deserialize text from some buffer, it must scan the buffer up to the delimiter, and only then pass a piece of data to the Parse/TryParse methods. Thus, each byte/character will be processed twice: the first time to find the size and the second time to parse a value.

Currently, this can be solved with a bunch of static TryParse methods in the Utf8Parser class. And there is another API proposal to add an option to stop parsing after the first invalid character.

These approaches have some disadvantages:

Both the Utf8Parser static methods and the NumberStyles.AllowTrailingInvalidCharacters option are limited to BCL types. NumberStyles also does not cover non-numeric types such as TimeSpan, DateTime, DateTimeOffset, DateOnly, TimeOnly, ...
These APIs are bool returning, which may leads to some quirks when parsing, e.g. floating-point values.

Utf8Parser.TryParse("1."u8, out double value, out int bytesConsumed) will successfully parse the double value 1, and the caller should perform additional complicated checks to determine if there is a final result, or TryParse call should be retried with more data (fetched from Stream) such as Utf8Parser.TryParse("1.2"u8, ...)

.NET already has an OperationStatus enum that expresses this behavior in the OperationStatus.NeedMoreData value. But for this purpose it may be useful to extend the OperationStatus enum with a new value to make the difference between "Data is parsed, but may be retried with more data" and "Data is not parsed, but may be parsed with more data". E.g. Utf8Parser.TryParse("1.e"u8, ...) returns false, but Utf8Parser.TryParse("1e7"u8, ...) will be successful.

If adding a new value to this enum is a breaking change, the new method may return bool with OperationStatus out parameter, or a new enum with similar semantic

/// The input is partially processed, up to the last valid chunk of the input that could be consumed.
/// The caller can stitch the remaining unprocessed input with more data, slice the buffers appropriately, and retry.
/// </summary>
NeedMoreData,

API Proposal

namespace System.Buffers
{
  public interface ISpanScannable<TSelf>
    where TSelf : ISpanScannable<TSelf>?
  {
    static abstract OperationStatus TryParse(
      ReadOnlySpan<char> text,
      IFormatProvider? provider,
      out int bytesConsumed,
      out TSelf result);
  }

  public interface IUtf8SpanScannable<TSelf>
    where TSelf : IUtf8SpanScannable<TSelf>?
  {
    static abstract OperationStatus TryParse(
      ReadOnlySpan<byte> utf8Text,
      IFormatProvider? provider,
      out int bytesConsumed,
      out TSelf result);
  }
}

namespace System.Buffers
{
  public enum OperationStatus
  {
    ...,

    /// <summary>
    /// The entire input is processed, up to the end of the buffer.
    /// However, caller can stitch the remaining unprocessed input with more data, retry, and get different result.
    /// </summary>
    PartiallyDone
  }
}

API Usage

  
class Utf8Reader(Stream stream)
{
  byte[] buffer = new byte[16];
  int position = 0;
  int end;

  public TValue Read<TValue>() where TValue : IUtf8SpanParsable<TValue>
  {
    TValue value;

    SkipDelimiters();
    while (true)
    {
      OperationStatus status = TValue.TryParse(buffer.AsSpan(position), out int bytesConsumed, out value);

      switch (status)
      {
        case OperationStatus.Done:
          position += bytesConsumed;
          return value;
        case OperationStatus.PartiallyDone:
          if (FetchMoreData())
            continue;
          position += bytesConsumed;
          return value;
        case OperationStatus.NeedMoreData:
          if (FetchMoreData())
            continue;
          throw new FormatException();
        case OperationStatus.InvalidData:
          throw new FormatException();
      }
    }
  }

  bool FetchMoreData()
  {
    var unconsumedLength = end - position;
    buffer.AsSpan(position, end).CopyTo(buffer);
    position = unconsumedLength;
    if (position == buffer.Length)
      GrowBuffer();
    var bytesRead = stream.Read(buffer.AsSpan(position));
    if (bytesRead <= 0)
      return false;
    position += bytesRead;
    return true;
  }

  void SkipDelimiters() { ... }
  void GrowBuffer() { ... }
}

Alternative Designs

Without OperationStatus. Caller should analyze buffer data on failure itself.

namespace System
{
  public interface ISpanScannable<TSelf>
    where TSelf : ISpanScannable<TSelf>?
  {
    static abstract bool TryParse(
      ReadOnlySpan<char> text,
      IFormatProvider? provider,
      out int bytesConsumed,
      out TSelf result);
  }

  public interface IUtf8SpanScannable<TSelf>
    where TSelf : IUtf8SpanScannable<TSelf>?
  {
    static abstract bool TryParse(
      ReadOnlySpan<byte> utf8Text,
      IFormatProvider? provider,
      out int bytesConsumed,
      out TSelf result);
  }
}

Risks

This API is only for advanced scenarios, may not be useful for regular user, TryParse may be too complex to implement correctly.
Nullability of out TSelf result parameter may require a new nullability attribute. This is not needed for bool-returning alternative implementation.

sealed class MaybeNullWhenAttribute<TEnum>(params TEnum[] returnValues) : Attribute where TEnum : Enum
{
  public TEnum[] ReturnValues { get; } = returnValues;
}

Alternative approaches for primitive values parsing may be faster than relying on IUtf8SpanScannable<TSelf>. E.g. with vectorization

Author:	epeshk
Assignees:	-
Labels:	`api-suggestion`, `area-System.Runtime`, `untriaged`, `needs-area-label`
Milestone:	-

tannergooding · 2023-10-24T16:12:44Z

Would appreciate some additional input from @stephentoub and @GrabYourPitchforks here.

epeshk added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Oct 11, 2023

dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Oct 11, 2023

ghost added the untriaged New issue has not been triaged by the area owner label Oct 11, 2023

jeffschwMSFT added the area-System.Runtime label Oct 12, 2023

MihaZupan removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Oct 12, 2023

jeffhandley added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed untriaged New issue has not been triaged by the area owner labels Jul 19, 2024

jeffhandley added this to the Future milestone Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[API Proposal]: ISpanScannable, IUtf8SpanScannable #93339

[API Proposal]: ISpanScannable, IUtf8SpanScannable #93339

epeshk commented Oct 11, 2023 •

edited

Loading

epeshk commented Oct 11, 2023 •

edited

Loading

ghost commented Oct 12, 2023

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

tannergooding commented Oct 24, 2023

[API Proposal]: ISpanScannable, IUtf8SpanScannable #93339

[API Proposal]: ISpanScannable, IUtf8SpanScannable #93339

Comments

epeshk commented Oct 11, 2023 • edited Loading

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

epeshk commented Oct 11, 2023 • edited Loading

ghost commented Oct 12, 2023

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

tannergooding commented Oct 24, 2023

epeshk commented Oct 11, 2023 •

edited

Loading

epeshk commented Oct 11, 2023 •

edited

Loading