SequenceReader<T>: Round 2 #69

terrajobst · 2018-10-18T21:19:04Z

No description provided.

2018/System.Buffers.SequenceReader/README.md

benaadams · 2018-10-18T22:17:06Z

2018/System.Buffers.SequenceReader/README.md

+
+## BufferExtensions
+
+* Can't we make this an instance method?


Why force a defensive copy span allocation if its not required?

// returns false if data is more than one span public bool TryAsSpan<T>(out ReadOnlySpan<T> span); // copyBuffer should be >= .Length public ReadOnlySpan<T> AsSpan<T>(Span<T> copyBuffer);

Also should there be a return flag to indicate if the copyBuffer was used?

Otherwise there is ambiguity. Is there is a writable version (or copy) in the copyBuffer? Can the copyBuffer be reused prior to consuming the ReadOnlySpan (i.e. is it part of the ReadOnlySpan or was it unused).

Does it throw an exception if the copyBuffer is too small (assume so); but what if the copy buffer is not needed because the data is a single span, but the copy buffer is too small; does that still throw an exception?

To resolve the ambiguity; especially around exceptions, maybe it should be:

// returns false if data is more than one span public bool TryAsSpan<T>(out ReadOnlySpan<T> span); // destination should be >= .Length or exception public void CopyTo(Span<T> destination);

Then the pattern would be something like:

T[] scratch = null; if (!seq.TryAsSpan(out ReadOnlySpan<T> roSpan) { scratch = ArrayPool<T>.Shared.Rent(seq.Length) seq.CopyTo(scratch); roSpan = scratch; } // Do something with roSpan if (scratch != null) ArrayPool<T>.Shared.Return(scratch);

Also should there be a return flag to indicate if the copyBuffer was used? Otherwise there is ambiguity

It seems really rare that you'd care, though, right? Presumably you'd just assume you can't use the supplied scratch buffer again until you're done with the returned span. And in the rare case where you actually cared, you could use Span.Overlaps to determine whether the returned span is part of the scratch buffer or not.

That said, I agree in general that this pattern is confusing, and it'd be better not to force the caller to allocate until they know they have to.

Problem is if you want to modify the returned data (mask/unmask etc) then with

public ReadOnlySpan<T> AsSpan<T>(Span<T> copyBuffer);

You need one scratch buffer for the call (which may or may not be used); then a second scratch buffer to copy the returned span into; whereas it would be easier just to jump straight to .CopyTo

For just reading; then the scratch buffer is unnecessary for the fast-path; but you need a .Length sized buffer allocated just in case and if the whole ReadOnlySequence is being read then the scratch buffer is unlikely to be used for more than a single read (e.g. returned to pool prior to awaiting for more data).

Also question on what happens if .Length is greater than 2Bn elements.

Branchless made it worse:

BEFORE TwoStage_Array | 17.87 ns | 0.3825 ns | 0.6994 ns | 18.21 ns | AFTER TwoStage_Array | 18.73 ns | 0.3999 ns | 0.7894 ns | 19.18 ns |

Not terribly surprising- ripping the span out being costly is the only reason SequenceReader is ref after all. :) Flipping to return a span made it slightly better, but still slower than TryAsSpan and is pretty awkard imho:

TwoStage_Array | 17.03 ns | 0.3595 ns | 0.5597 ns | 17.24 ns |

ReadOnlySpan<byte> span = _arraySequence.TryAsSpan(out bool success); if (!success) {

why would we even need SliceOrCopy

I'm not sure what you're asking here? Having the SliceOrCopy methods period is due to the performance improvement in the normal single segment case. Having one without a delegate is again about performance- as crossing segments is so unlikely, new T[] is a good-enough solution for many. Given that it is expected to be common and is faster it makes sense to have an "even-faster" overload for "default" behavior.

Here are the results I get copying and pasting the benchmarks you shared:

Method | Mean | Error | StdDev | Median | ------------------ |---------:|----------:|----------:|---------:| SliceOrCopy_Array | 18.25 ns | 0.3942 ns | 0.6695 ns | 18.04 ns | TwoStage_Array | 17.58 ns | 0.1277 ns | 0.1132 ns | 17.57 ns |

I ran it multiple times, the results always come out relatively the same, with the TwoStage_Array being faster than SliceOrCopy_Array. Note that that's with:

private static ReadOnlySequence<byte> _arraySequence = new ReadOnlySequence<byte>(new byte[10_000]);

since it wasn't clear to me from what you wrote what _arraySequence was. If instead change it to be an instance, the versions end up being almost the same perf-wise for me:

Method | Mean | Error | StdDev | ------------------ |---------:|----------:|----------:| SliceOrCopy_Array | 17.47 ns | 0.3707 ns | 0.3468 ns | TwoStage_Array | 17.31 ns | 0.3631 ns | 0.4036 ns |

We're talking nanosecond differences here, in either direction. Maybe it's just minor differences on the machine. Maybe it's the processor involved. Maybe it's something else. But let's be very careful not to add lots of APIs because we found a microbenchmark showed an improvement on one machine.

I'm not sure what you're asking here?

I mean, why have these helpers at all? They're one-liners, using public APIs that anyone can use. When I run the same test with:

[Benchmark] public ReadOnlySpan<byte> OpenCoded() { ref ReadOnlySequence<byte> r = ref _arraySequence; return r.IsSingleSegment ? r.First.Span : r.ToArray(); }

for example, I get:

Method | Mean | Error | StdDev | ------------------ |---------:|----------:|----------:| SliceOrCopy_Array | 17.25 ns | 0.3703 ns | 0.6770 ns | OpenCoded | 16.41 ns | 0.1735 ns | 0.1538 ns |

So, if we're going for utmost in throughput, worried about developers concerned about this level of microoptimization, and if these numbers are to be believed, then why have the method at all? Let the developer use the ReadOnlySequence APIs however they need, rather than trying to come up with the various "helper" APIs on top of them that are all effectively one-liners.

Here is the complete code:

https://gist.github.com/JeremyKuhne/2eb803b80058fa8b108c81352bc31ba0

I ran it multiple times, the results always come out relatively the same,

Ditto:

Method | Mean | Error | StdDev | Median | --------------------------- |---------:|----------:|----------:|---------:| SliceOrCopy_Array | 14.87 ns | 0.3224 ns | 0.7343 ns | 15.21 ns | SliceOrCopy_AllocatorArray | 16.30 ns | 0.3458 ns | 0.4247 ns | 16.41 ns | TwoStage_Array | 17.82 ns | 0.3756 ns | 0.5736 ns | 18.05 ns |

But let's be very careful not to add lots of APIs because we found a microbenchmark showed an improvement on one machine.

Well, sure, let's not be blind. But lets also not throw this out because some machines don't get a boost. For the record, my run is on a 6 core i8700K with a 2.1 .NET Core console app and the current BenchmarkDotNet (CTRL-F5ed of course) on Windows.

They're one-liners, using public APIs that anyone can use.

They are simple because these are meant to validate the overhead involved. The whole point of the Allocator is to allow plugging in whatever you want- which may be significantly more complicated. The ArrayPool tracker I mentioned above is one such example. I plan to move onto

The simple SliceOrCopy_Array with no arguments as it makes writing the typical case easy. It also helps discoverability of the fact that .ToArray() is potentially inefficient. I wouldn't toss it simply because someone can easily rewrite the code.

For the record, here is what I get plugging in your OpenCoded:

Method | Mean | Error | StdDev | Median | --------------------------- |---------:|----------:|----------:|---------:| OpenCoded | 14.52 ns | 0.3108 ns | 0.5018 ns | 14.72 ns | SliceOrCopy_Array | 14.95 ns | 0.3185 ns | 0.5904 ns | 15.21 ns |

Not as much as an improvement for me.

Having these methods does not preclude you from writing your own logic. Having them, however, allows easy piping of the Allocator approach (e.g. I can take one as input to SequenceReader), facilitates reusability of solutions, and makes things significantly faster on some machines. All of that makes it pretty compelling to me.

If you could try to run again with the code in the gist and share your configuration perhaps we might find out what it is that makes your run slower than mine.

Having them, however

And is slower on some machines apparently, and clutters the surface area (which can actually make it harder for a consumer of the API to do the "right thing'"), and adds surface area that may encourage developers to allocate when they don't actually need to, and a developer who's really concerned about eeking out the best possible throughout arguably won't use them, anyway, and for everyone else a nanosecond difference doesn't matter, and they're barely shorter than just writing out the ternary manually. As for "piping the allocator", that could still be used by ToArray if it's desirable, so that doesn't seem to actually change anything.

I will try the shared code again this weekend on a few machines. But at the moment I'm not seeing the benefit here (for ToArray, yes, for SliceOrCopy, no). I would rather we teach developers about the few core APIs that are there and how to best use thm efficiently, saving additional APIs we add for when they're truly meaningful (like the general reader APIs you've been working on).

terrajobst added 3 commits October 18, 2018 14:15

Fixed typo

8232b90

Prepare section headers

b4aa05d

Add notes for round 2

ab6feeb

stephentoub reviewed Oct 18, 2018

View reviewed changes

2018/System.Buffers.SequenceReader/README.md Show resolved Hide resolved

stephentoub reviewed Oct 18, 2018

View reviewed changes

2018/System.Buffers.SequenceReader/README.md Show resolved Hide resolved

stephentoub reviewed Oct 18, 2018

View reviewed changes

2018/System.Buffers.SequenceReader/README.md Outdated Show resolved Hide resolved

benaadams reviewed Oct 18, 2018

View reviewed changes

2018/System.Buffers.SequenceReader/README.md Show resolved Hide resolved

benaadams reviewed Oct 18, 2018

View reviewed changes

Address Stephen's feedback on the notes

09bc6ec

terrajobst merged commit 69b2e67 into master Oct 23, 2018

terrajobst deleted the sequencereader-2 branch October 23, 2018 22:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SequenceReader<T>: Round 2 #69

SequenceReader<T>: Round 2 #69

terrajobst commented Oct 18, 2018

benaadams Oct 18, 2018 •

edited

Loading

benaadams Oct 19, 2018 •

edited

Loading

benaadams Oct 19, 2018 •

edited

Loading

stephentoub Oct 19, 2018 •

edited

Loading

benaadams Oct 19, 2018

JeremyKuhne Oct 19, 2018

stephentoub Oct 20, 2018

stephentoub Oct 20, 2018

JeremyKuhne Oct 20, 2018

stephentoub Oct 20, 2018

SequenceReader<T>: Round 2 #69

SequenceReader<T>: Round 2 #69

Conversation

terrajobst commented Oct 18, 2018

benaadams Oct 18, 2018 • edited Loading

Choose a reason for hiding this comment

benaadams Oct 19, 2018 • edited Loading

Choose a reason for hiding this comment

benaadams Oct 19, 2018 • edited Loading

Choose a reason for hiding this comment

stephentoub Oct 19, 2018 • edited Loading

Choose a reason for hiding this comment

benaadams Oct 19, 2018

Choose a reason for hiding this comment

JeremyKuhne Oct 19, 2018

Choose a reason for hiding this comment

stephentoub Oct 20, 2018

Choose a reason for hiding this comment

stephentoub Oct 20, 2018

Choose a reason for hiding this comment

JeremyKuhne Oct 20, 2018

Choose a reason for hiding this comment

stephentoub Oct 20, 2018

Choose a reason for hiding this comment

benaadams Oct 18, 2018 •

edited

Loading

benaadams Oct 19, 2018 •

edited

Loading

benaadams Oct 19, 2018 •

edited

Loading

stephentoub Oct 19, 2018 •

edited

Loading