Expose IUtf8SpanParsable and implement it on the primitive numeric types #86875

tannergooding · 2023-05-29T18:49:11Z

This makes progress towards #81500

Still not covered are:

BigInteger
Boolean
Complex
Char (explicit API surface to avoid boxing)
DateOnly
DateTime
DateTimeOffset
Enum
Guid
IPAddress
IPNetwork
TimeOnly
TimeSpan
Version

dotnet-issue-labeler · 2023-05-29T18:49:18Z

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

ghost · 2023-05-29T18:49:26Z

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

Issue Details

This makes progress towards #81500

Still not covered are:

BigInteger
Boolean
Complex
Char (explicit API surface to avoid boxing)
DateOnly
DateTime
DateTimeOffset
Enum
Guid
IPAddress
IPNetwork
TimeOnly
TimeSpan
Version

Author:	tannergooding
Assignees:	tannergooding
Labels:	`area-System.Numerics`, `new-api-needs-documentation`
Milestone:	-

…rectly

tannergooding · 2023-05-30T14:20:57Z

summary:
better: 6, geomean: 1.067
total diff: 6

No Slower results for the provided threshold = 2% and noise filter = 0.3 ns.

| Faster                                                           | base/diff | Base Median (ns) | Diff Median (ns) | Modality|
| ---------------------------------------------------------------- | ---------:| ----------------:| ----------------:| --------:|
| System.Tests.Perf_Int64.Parse(value: "9223372036854775807")      |      1.09 |            12.27 |            11.30 |         |
| System.Tests.Perf_Int64.TryParse(value: "-9223372036854775808")  |      1.08 |            12.40 |            11.45 |         |
| System.Tests.Perf_UInt64.Parse(value: "18446744073709551615")    |      1.08 |            12.62 |            11.73 |         |
| System.Tests.Perf_UInt64.TryParse(value: "18446744073709551615") |      1.07 |            12.04 |            11.23 |         |
| System.Tests.Perf_Int64.TryParse(value: "9223372036854775807")   |      1.05 |            11.80 |            11.20 |         |
| System.Tests.Perf_Int64.Parse(value: "-9223372036854775808")     |      1.03 |            12.20 |            11.83 |         |

tannergooding · 2023-05-30T17:10:59Z

WASM Interpreter reports

No differences found between the benchmark results with threshold 0%.

tannergooding · 2023-05-30T19:27:34Z

CC. @stephentoub

src/libraries/System.Private.CoreLib/src/System/Globalization/CompareInfo.Utf8.cs

src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.Utf8.cs

src/libraries/System.Private.CoreLib/src/System/MemoryExtensions.Globalization.Utf8.cs

src/libraries/System.Private.CoreLib/src/System/MemoryExtensions.Trim.Utf8.cs

src/libraries/System.Private.CoreLib/src/System/Numerics/INumberBase.cs

src/libraries/System.Private.CoreLib/src/System/Globalization/CompareInfo.Utf8.cs

src/libraries/System.Private.CoreLib/src/System/Number.Parsing.cs

src/libraries/System.Private.CoreLib/src/System/Decimal.DecCalc.cs

GrabYourPitchforks · 2023-07-10T20:44:15Z

src/libraries/System.Private.CoreLib/src/System/Enum.cs

@@ -1023,7 +1023,7 @@ static bool TryParseRareTypes(RuntimeType rt, ReadOnlySpan<char> value, bool ign

                if (throwOnFailure)
                {
-                    Number.ThrowOverflowException(Type.GetTypeCode(typeof(TUnderlying)));
+                    ThrowHelper.ThrowOverflowException();


Line 959 above was modified to read Number.ThrowOverflowException<TUnderlying>();, but this line was modified to remove the type code / name entirely. Was this intentional?

The underlying here was only hitting the "rare types" (float, double, nint, nuint, or char). None of these were handled by Number.ThrowOverflowException(TypeCode) and would have triggered asserts in debugging (since they aren't decimal) and using an incorrect string otherwise.

Directly throwing OverflowException without a string is both simpler and more accurate than what we had before.

GrabYourPitchforks · 2023-07-10T23:07:08Z

src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.cs

+        [MethodImpl(MethodImplOptions.AggressiveInlining)]
+        internal static bool Vector128OrdinalIgnoreCaseAscii(Vector128<byte> vec1, Vector128<byte> vec2)
+        {
+            // ASSUMPTION: Caller has validated that input values are ASCII.


Possible more optimized (completely untested!) implementation, based solely on codegen size.

[MethodImpl(MethodImplOptions.AggressiveInlining)] internal static bool Vector128OrdinalIgnoreCaseAscii(Vector128<byte> vec1, Vector128<byte> vec2) { // ASSUMPTION: Caller has validated that input values are ASCII. Vector128<sbyte> vector0x20 = Vector128.Create((sbyte)0x20); // Convert all characters to lowercase. // Some non-letter chars will also be changed; we'll deal with them later. Vector128<sbyte> asLower = vec1.AsSByte() | vector0x20; // Create a vector where all letter characters [a-z] are normalized to 0x20 // and all non-letter characters are normalized to 0x00. Vector128<sbyte> letterChars = Vector128.GreaterThan(asLower + Vector128.Create((sbyte)(0x7F - 'z')), Vector128.Create((sbyte)(0x7F - 26))) & vector0x20; // Compute the ones-complement diff between vec1 and vec2. Vector128<sbyte> onesCompDiff = vec1.AsSByte() ^ vec2.AsSByte(); // There must be no bits set in 'onesCompDiff' that aren't also set in 'letterChars'. return Vector128.AndNot(letterChars, onesCompDiff) == default; }

Not going to worry about optimizing further at this point. Its more important that we get the feature in for .NET 8.

As is, this generally matches the algorithm used by the UTF-16 path and is correct.

An interested party can always submit this as an optimization after the main support goes in.

GrabYourPitchforks · 2023-07-10T23:17:23Z

src/libraries/System.Private.CoreLib/src/System/Numerics/INumberBase.cs

+                utf16Text = utf16TextArray.AsSpan(0, textMaxCharCount);
+            }
+
+            int utf16TextLength = Encoding.UTF8.GetChars(utf8Text, utf16Text);


Potential DoS vector here. This logic cannot safely coexist with approved API #87171. It's reasonable that somebody might write code like this:

while (!utf8Span.IsEmpty) { TNumber parsed = TNumber.Parse(utf8Span, NumberStyles.AllowTrailingInvalidCharacters, out int bytesConsumed); UseValue(parsed); utf8Span = utf8Span.Slice(bytesConsumed).TrimStart((byte)','); // or whatever the expected delimiter char is }

The code above appears to have O(n) runtime, where n is the length (in bytes) of the input span. However, since there's a transcoding step taking place under the covers, and since the transcoding step consumes the entire remainder of the input span, the actual runtime is O(n^2). Since aspnet allows a 4MB request by default, this means a loop like this can result in <some constant factor> * 16 trillion units of work being performed.

(Same comment elsewhere in this file where similar code appears.)

Fixed to use Utf8.ToUtf16(utf8Text, utf16Text, out _, out int utf16TextLength, replaceInvalidSequences: false), as with other paths. It throws FormatException on Parse and returns false for TryParse, indicating the input was not in a correct format.

src/libraries/System.Private.CoreLib/src/System/Numerics/INumberBase.cs

GrabYourPitchforks · 2023-07-10T23:30:59Z

src/libraries/System.Private.CoreLib/src/System/Numerics/INumberBase.cs

+                utf16Text = utf16TextArray.AsSpan(0, textMaxCharCount);
+            }
+
+            int utf16TextLength = Encoding.UTF8.GetChars(utf8Text, utf16Text);


Potential correctness issue here. What behavior should this routine have when the input is not well-formed UTF-8? The current behavior here is: lossily replace invalid UTF-8 sequences with well-formed UTF-16 replacement characters, then call the UTF-16 parse routine. Does this change the desired correctness of the parse routine? What about if the replacement character has special meaning to the specified provider?

Fixed to use Utf8.ToUtf16(utf8Text, utf16Text, out _, out int utf16TextLength, replaceInvalidSequences: false), as with other paths. It throws FormatException on Parse and returns false for TryParse, indicating the input was not in a correct format.

Any dev that needs to override the behavior can and should by overriding in their derived type.

GrabYourPitchforks · 2023-07-11T00:01:57Z

src/libraries/System.Private.CoreLib/src/System/Globalization/CompareInfo.Utf8.cs

+
+            if (sourceStatus != OperationStatus.Done)
+            {
+                return false;


This gets a bit complicated. The code as-is definitively returns "not a prefix match!" - even though the inputs are invalid to the point that the question is nonsensical. It's a bit like asking "is 'dog' less than 'cat'?" It's not true and it's not false. It's just... huh?

Throwing a wrench into this even more is that ICU actually does have specialized handling for invalid UTF-16 sequences. (Basically, it treats them as opaque chars that don't match anything except for themselves.) So this means that, e.g., the sequence "\uD800\uD801".StartsWith("\uD800") would return true, even though both sequences are ill-formed. But in the logic you have here, "\xFF\xFE"u8.CultureAwareStartsWith("\xFF"u8) would return false.

Recommendation: Invalid UTF-8 sequences should throw (not return false) or should utilize existing ICU handling for invalid UTF-16 sequences.

The way I worked around this in the UTF-8 prototype was to use something akin to WTF-16 for the comparison. Basically, every time an invalid UTF-8 byte is encountered, append an invalid UTF-16 code point to the string we're building. Any invalid UTF-8 byte 0x?? would map to the char U+DF??. So for instance, the invalid byte 0x80 would map to the invalid char 0xDF80, the invalid byte 0xC0 would map to the invalid char U+DFC0, etc. The end result is that you're building up something that is mostly UTF-16, with the invalid sequences specifically chosen not to conflict with one another, which allows ICU to handle the input string appropriately.

This does not have the same algorithmic complexity constraints that I call out in IUtf8SpanParsable, since ICU's implementation for culture-aware prefix matching is O(needle.Length + haystack.Length) with no early-exit optimization. It's lamentable, but it is what it is. So that means the transcoding logic is upping some constant factor but isn't otherwise changing the algorithmic complexity itself.

(NLS, I believe, does have an early exit consideration. But that should be enough of an edge case that I'm willing to hold my nose for any algorithmic complexity changes we introduce.)

EgorBo · 2023-07-18T16:41:22Z

Improvements:

[Perf] Linux/x64: 3 Improvements on 7/12/2023 4:12:15 PM perf-autofiling-issues#19874
[Perf] Windows/x64: 3 Improvements on 7/12/2023 7:15:00 AM perf-autofiling-issues#19879 (Perf_Byte)

markples · 2023-07-19T00:12:00Z

src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.Utf8.cs

+                byteOffset += 4;
+                length -= 4;
+            }
+
+            Debug.Assert(length == 0);


I am quite confused as to how testing was good before and is failing now, but this snippet looks suspicious. The length was 1-3, and byteOffset (though not used after this point) also would be increased by 1-3 (except for the 2 sometimes already added above)

tannergooding added 6 commits May 26, 2023 13:50

Expose IUtf8SpanParsable

f1d305b

Have INumberBase implement IUtf8SpanParsable

9b03507

Deduplicate some floating-point parsing logic

ca67504

Refactoring the primitive parsing logic to support UTF-8

c1f35f3

Updating the primitive numeric types to include UTF-8 parsing support

fe438ca

Adding tests covering the new UTF-8 parsing support

01066c8

dotnet-issue-labeler bot added area-System.Numerics new-api-needs-documentation labels May 29, 2023

ghost assigned tannergooding May 29, 2023

tannergooding added 3 commits May 29, 2023 11:54

Ensure that tests don't try to capture a span in lambda

ba28bab

Ensure that MatchChars does the right thing

0fdcd4e

Account for the switch from string to ROSpan<TChar> for currSymbol

9e1fe3e

This was referenced May 30, 2023

Tracking issue for CI build timeouts #76454

Closed

Failed USB connection via port 54050, error 61, in tvOS arm64 Release AllSubsets_Mono #82637

Open

tannergooding added 2 commits May 29, 2023 17:27

Ensure EqualsIgnoreCaseUtf8_Scalar handles the remaining elements cor…

6c3dfda

…rectly

Merge remote-tracking branch 'dotnet/main' into utf8-parsing

0fc41ca

build-analysis bot mentioned this pull request May 30, 2023

Assert failure in GC/API/NoGCRegion/Callback_Svr test #86612

Closed

runfoapp bot mentioned this pull request May 30, 2023

Infra improvements for Helix #68176

Closed