Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unmanaged pooling MemoryAllocator #1730

Merged
merged 96 commits into from
Dec 9, 2021

Conversation

antonfirsov
Copy link
Member

@antonfirsov antonfirsov commented Aug 8, 2021

UPDATE 2: Ready for review!

  • Although later than used to, but 32 bit still fails with OutOfMemoryException. Consider retrying Marshal.AllocHGlobal on OutOfMemoryException after a short wait. DONE: We are blocking the thread on OOM to retry allocations. 32 bit is 2x slower with 20 Threads than 64 bit, but doesn't OOM. The retries alone are not responsible for the 2x slowdown, 32bit runtime seems to work 1.5x slower also with 10 threads, when there are no OOMs.
  • Implement PreferContiguousImageBuffers, remove MemoryAllocator.MinimumContiguousBlockSizeBytes.
  • Implement PixelAccessor<T> and other Pixel processing breaking changes & API discussion #1739 stuff
  • Remove ArrayPoolMemoryAllocator
  • Implement some other minor API changes, eg. MemoryAllocator.Default. -- Need to change MemoryAllocator.Default, it should be get-only.
  • Fix memory corruption / resource leak bugs

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

This PR introduces UniformUnmanagedMemoryPoolMemoryAllocator and sets it as default to fix #1596.

UniformUnmanagedMemoryPoolMemoryAllocator functional characteristics

  • For allocations below 1MB use ArrayPool<byte>.Shared
  • For allocations above 1MB use UniformUnmanagedMemoryPool to allocate 4 MB blocks of discontiguous unmanaged memory, of up to the pool's limit
  • Over pool limit, allocate unmanaged buffers split to discontiguous blocks of 32 MB. These will be freed immediately

Pool size

According to my benchmaks, the pool should scale to the maximum desired size to achieve the best througput. There is no point placing an artificial pool limit, unless there is a physical limitation. I decided to set the maximum pool size to 1/8 th of the available physical memory in 64 bit .NET Core processes. This means that on a 16GB machine the pool can grow as large as 2 GB.

On 32 bit, and other (non-testable) platforms the pool limit is 128 MB.

Trimming

The trimming of the pools is triggered by both Gen 2 GC collect and a timer. (We need the timer since unmanaged allocations don't trigger GC) On high load we trim the entire pool, on low load we trim 50% of the pool every minute.

Finalizers

With ArrayPoolMemoryAllocator, if an image is GC-d without being disposed, buffers are never returned to the pool. This means no hard memory leak, but the pools will be eventually exhausted, because the bucket's running index hitting the bucket limit.

To avoid this, MemoryGroup<T>.Owned and UniformUnmanagedMemoryPool.FinalizableBuffer<T> have finalizers returning the UnmanagedMemoryHandle to the pool. This can get tricky, since UnmanagedMemoryHandle is also finalizable:

/// <summary>
/// UnmanagedMemoryHandle's finalizer would release the underlying handle returning the memory to the OS.
/// We want to prevent this when a finalizable owner (buffer or MemoryGroup) is returning the handle to
/// <see cref="UniformUnmanagedMemoryPool"/> in it's finalizer.
/// Since UnmanagedMemoryHandle is CriticalFinalizable, it is guaranteed that the owner's finalizer is called first.
/// </summary>
internal void Resurrect()
{
GC.SuppressFinalize(this);
this.resurrected = true;
}
internal void AssignedToNewOwner()
{
if (this.resurrected)
{
// The handle has been resurrected
GC.ReRegisterForFinalize(this);
this.resurrected = false;
}
}

I'm moderately concerned about CA2015, but I don't think it applies to us. Dispose will also free the memory used by a span. Touching a span or a pointer to SkiaSharp image's memory would be also a bug if the image is finalized.

API changes

Resolves #1739
Fixes #1675

Edit: API changes implemented according to #1739.

Benchmarking methodology

To determine these defaults I compared results of LoadResizeSaveParallelMemoryStress runs systematically, typically running them for a varying parameter a couple of times, while fixing all other parameters. I have a bunch of Excel documents comparing the tables, including all of them would be TLDR, but I can present information on request.

Benchmark results

I was benchmarking on a 10 Core (20 Thread) 19-10900X with 64 GB RAM. This means I was able to stress highly parallel workload very extensive allocation pressure.

Here is an median processing time (seconds) of 40 runs of LoadResizeSaveParallelMemoryStress ("Classic" means ArrayPoolMemoryAllocator):

Classic Nopool Unmanaged Pool
9.038 8.988 8.306
100.56% 100.00% 92.41%

ImageSharp is about 8% faster with the new default memory allocator.

Results of LoadResizeSaveStressBenchmarks BDN benchmark also show 7.5% improvement:

Method maxDegreeOfParallelism Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
SystemDrawing 20 3,007.7 ms 151.1 ms 8.28 ms 1.00 - - - 30 KB
ImageSharp_DefaultMemoryAllocator 20 682.6 ms 395.6 ms 21.69 ms 0.23 - - - 2,413 KB
ImageSharp_ArrayPoolMemoryAllocator 20 734.9 ms 174.4 ms 9.56 ms 0.24 1000.0000 1000.0000 1000.0000 1,237,298 KB
Magick 20 1,545.1 ms 624.1 ms 34.21 ms 0.51 - - - 106 KB
MagicScaler 20 389.5 ms 163.7 ms 8.97 ms 0.13 - - - 101 KB
SkiaBitmap 20 889.6 ms 182.9 ms 10.03 ms 0.30 - - - 118 KB
NetVips 20 441.4 ms 686.1 ms 37.61 ms 0.15 - - - 102 KB

VirtualAlloc commit lifetimes graph with ArrayPoolMemoryAllocator

image

VirtualAlloc commit lifetimes graph with the new allocator, demonstrating the trimming

image

VirtualAlloc commit lifetimes graph with pool size set to zero

image

I would be happy to see some expert feedback on this solution, especially for the finalizer tricks.

/cc @Sergio0694 @saucecontrol @br3aker

@br3aker
Copy link
Contributor

br3aker commented Aug 8, 2021

(We need the timer since unmanaged allocations don't trigger GC)

Might be worth testing trimming solely via GC.AddMemoryPressure & gen2 callback?

@antonfirsov
Copy link
Member Author

@br3aker I'm not sure if AddMemoryPressure / RemoveMemoryPressure contributes to Gen2 allocation budget or not. Might make sense to try out, but I think the timer is not that expensive, and trimming is configured to be done every 1 minute (mimicking ArrayPool.shared) anyways.

Directory.Build.props Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Aug 8, 2021

Codecov Report

Merging #1730 (5eaa632) into master (4587649) will increase coverage by 0%.
The diff coverage is 89%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #1730    +/-   ##
=======================================
  Coverage      87%     87%            
=======================================
  Files         935     944     +9     
  Lines       49300   49753   +453     
  Branches     6102    6165    +63     
=======================================
+ Hits        43175   43575   +400     
- Misses       5115    5157    +42     
- Partials     1010    1021    +11     
Flag Coverage Δ
unittests 87% <89%> (+<1%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/ImageSharp/Advanced/AdvancedImageExtensions.cs 84% <ø> (+24%) ⬆️
src/ImageSharp/Common/Helpers/DebugGuard.cs 0% <0%> (ø)
...eSharp/Common/Helpers/Shuffle/IComponentShuffle.cs 100% <ø> (ø)
src/ImageSharp/Common/Helpers/SimdUtils.Shuffle.cs 92% <ø> (ø)
...icInterpretation/WhiteIsZero24TiffColor{TPixel}.cs 0% <0%> (ø)
...rp/Memory/Allocators/Internals/BasicArrayBuffer.cs 100% <ø> (ø)
...Sharp/Memory/Allocators/SimpleGcMemoryAllocator.cs 66% <ø> (-14%) ⬇️
src/ImageSharp/Memory/UnmanagedMemoryManager{T}.cs 62% <0%> (ø)
...s/Convolution/Convolution2PassProcessor{TPixel}.cs 61% <40%> (ø)
...harp/Memory/Allocators/Internals/Gen2GcCallback.cs 44% <44%> (ø)
... and 118 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4587649...5eaa632. Read the comment docs.

@JimBobSquarePants
Copy link
Member

Cannot wait to get stuck into reading this! 🤩

@antonfirsov
Copy link
Member Author

@JimBobSquarePants been thinking a lot & chatting with folks on the C#/lowlevel discord channel, I'm no longer sure if using unsafe memory is the right thing to do.

The problem is that an update to 1.1 may turn minor bugs and performance issues into security errors for users of GetRowSpan(y) and TryGetSinglePixelSpan(). These are methods which look completely safe, but will become inherently unsafe with this PR. (See the warnings added to docs in c9d1396)

On the other hand, SkiaSharp merged a similar API without spending a single minute on security concerns: mono/SkiaSharp#1242

@JimBobSquarePants
Copy link
Member

@antonfirsov That's the IMemory<T> finalizer warning yeah? That would require some impressively bad code to trigger but you're right we should do a careful review.

I need to do some reading there and maybe see if we can get someone from the runtime team who worked in that area to comment.

@antonfirsov
Copy link
Member Author

antonfirsov commented Aug 9, 2021

Strictly speaking, the finalizer warning doesn't apply to us since the SafeHandle is finalizable and returns memory anyways.

That would require some impressively bad code to trigger

First I also thought so, but in fact it can be as simple as:

var image = new Image<Rgba32>(w, h); // no using
Span<Rgba32> span = image.GetPixelRowSpan(0); // last use of the object `image`, finalizers may run after this point

// some relatively long running code here to allow the finalizers to finish

span[0] = default;  // memory corruption

@JimBobSquarePants
Copy link
Member

I wonder if we could write an analyzer?

@antonfirsov
Copy link
Member Author

antonfirsov commented Aug 9, 2021

I need to do some reading there and maybe see if we can get someone from the runtime team who worked in that area to comment.

That would be awesome, but I'm afraid they would share the concerns around unsafe memory and push back.

I wonder if we could write an analyzer?

Would it work out of the box just by using the library? Note that it won't help existing users doing a package update without recompilation, and then running into a potential security issue.

@JimBobSquarePants
Copy link
Member

The trick, I think, would be to make the analyzer a dependency of the main library Like Xunit do.

Have we made any breaking changes that require recompilation? Maybe we should just to ensure people should rebuild. 👿

@antonfirsov
Copy link
Member Author

antonfirsov commented Aug 10, 2021

GrabYourPitchforks — Yesterday at 6:09 PM
The original design for Memory didn't have a span accessor; you had to provide a delegate and the span would be passed as a parameter to the delegate. This solved many (but not all) ownership problems.

There is a safe, breaking way to re-implement span accessors by using delegates, inspired by the comment above:

  public class Image<T>
  {
-     public Span<T> GetPixelRowSpan(int y);
-     public bool TryGetSinglePixelSpan(out Span<TPixel> span);
+     public void ProcessPixels(Action<PixelAccessor<T>> rowProcessor);
+     public bool TryProcessSinglePixelSpan(SpanAction<T> pixelProcessor);
  }

+ public ref struct PixelAcceessor<T>
+ {
+     Span<T> GetRowSpan(int y);
+ }

The simplest thing would be to go ahead with this breaking change and bump ImageSharp version number to 2.0. The improvements will justify the change.

@JimBobSquarePants
Copy link
Member

bump ImageSharp version number to 2.0

2.0 was to be my kill all old target frameworks release. I want to ship a working V1 of Fonts and Drawing before starting work on it.

@antonfirsov
Copy link
Member Author

antonfirsov commented Aug 10, 2021

That can be 3.0 then. We follow semantic versioning more or less, so no point to be afraid of major version jumps as breaking changes land IMO.

@antonfirsov
Copy link
Member Author

antonfirsov commented Aug 10, 2021

But I'm also fine with a hard-breaking 1.1, this is more about PR and communication than anything else,

However, renaming the milestones seems to be better thing to do for me, we can even benefit out of it.

@JimBobSquarePants
Copy link
Member

JimBobSquarePants commented Aug 10, 2021

My only issue with jumping from 2.0 to 3.0 would that in real terms it would probably occur over a short timespan which, in my opinion does reflect well on the quality. 1.1 would be, by far, my next desired target.

This is a massive breaking change though so I'm deeply conflicted. 🙁

@Sergio0694
Copy link
Member

Disclaimer: I'm not home and with barely functioning internet, so might've missed some bits here.

I have a question regarding PixelAcceessor<T>, isn't that quite similar to what we already offer with the ProcessPixelRows___ APIs already? Given that those also allow working on arbitrary pixel types and offer batched and vectorized pixel format conversion, should we consider just pushing devs currently accessing individual pixels towards those APIs instead? If not and if we still want to offer a way to access individual pixel specific values, should we align the design of these new APIs to follow those, for consistency? Just some random thoughts here 😄

Btw amazing work on all this @antonfirsov, I'll also need to find some time to carefully go through all this like James said and have a proper read, as the whole investigation seems super interesting! 🚀

@JimBobSquarePants
Copy link
Member

After careful consideration. I'm up for a V2 release. It's good opportunity to fix a few things plus we are already adding a significant amount of fixes/functionality to the release so let's make a show of it.

@JimBobSquarePants
Copy link
Member

JimBobSquarePants commented Dec 6, 2021

Some more detail on the failures on my machine. They are repeatable by running each theory group.

MemoryOwnerFinalizer_ReturnsToPool fails for 600.

Message: 
Microsoft.DotNet.RemoteExecutor.RemoteExecutionException : Remote process failed with an unhandled exception.

  Stack Trace: 
Child exception:
  Xunit.Sdk.EqualException: Assert.Equal() Failure
Expected: 42
Actual:   240
UniformUnmanagedPoolMemoryAllocatorTests.AllocateSingleAndForget(UniformUnmanagedMemoryPoolMemoryAllocator allocator, Int32 length, Boolean check) line 346
UniformUnmanagedPoolMemoryAllocatorTests.<MemoryOwnerFinalizer_ReturnsToPool>g__RunTest|14_0(String lengthStr) line 326
Child process:
  SixLabors.ImageSharp.Tests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=d998eea7b14cab13 SixLabors.ImageSharp.Tests.Memory.Allocators.UniformUnmanagedPoolMemoryAllocatorTests Void <MemoryOwnerFinalizer_ReturnsToPool>g__RunTest|14_0(System.String)
Child arguments:
  600

MemoryGroupFinalizer_ReturnsToPool fails for 600 and 1200.

Message: 
Microsoft.DotNet.RemoteExecutor.RemoteExecutionException : Remote process failed with an unhandled exception.

  Stack Trace: 
Child exception:
  Xunit.Sdk.EqualException: Assert.Equal() Failure
Expected: 42
Actual:   224
UniformUnmanagedPoolMemoryAllocatorTests.AllocateGroupAndForget(UniformUnmanagedMemoryPoolMemoryAllocator allocator, Int32 length, Boolean check) line 295
UniformUnmanagedPoolMemoryAllocatorTests.<MemoryGroupFinalizer_ReturnsToPool>g__RunTest|12_0(String lengthStr) line 277
Child process:
  SixLabors.ImageSharp.Tests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=d998eea7b14cab13 SixLabors.ImageSharp.Tests.Memory.Allocators.UniformUnmanagedPoolMemoryAllocatorTests Void <MemoryGroupFinalizer_ReturnsToPool>g__RunTest|12_0(System.String)
Child arguments:
  600
Message: 
Microsoft.DotNet.RemoteExecutor.RemoteExecutionException : Remote process failed with an unhandled exception.

  Stack Trace: 
Child exception:
  Xunit.Sdk.EqualException: Assert.Equal() Failure
Expected: 42
Actual:   0
UniformUnmanagedPoolMemoryAllocatorTests.AllocateGroupAndForget(UniformUnmanagedMemoryPoolMemoryAllocator allocator, Int32 length, Boolean check) line 295
UniformUnmanagedPoolMemoryAllocatorTests.<MemoryGroupFinalizer_ReturnsToPool>g__RunTest|12_0(String lengthStr) line 277
Child process:
  SixLabors.ImageSharp.Tests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=d998eea7b14cab13 SixLabors.ImageSharp.Tests.Memory.Allocators.UniformUnmanagedPoolMemoryAllocatorTests Void <MemoryGroupFinalizer_ReturnsToPool>g__RunTest|12_0(System.String)
Child arguments:
  1200

MultiplePoolInstances_TrimPeriodElapsed_AllAreTrimmed is also failing for me.

Message: 
Microsoft.DotNet.RemoteExecutor.RemoteExecutionException : Remote process failed with an unhandled exception.

  Stack Trace: 
Child exception:
  Xunit.Sdk.EqualException: Assert.Equal() Failure
Expected: 128
Actual:   0
NonParallel.<MultiplePoolInstances_TrimPeriodElapsed_AllAreTrimmed>g__RunTest|0_0() line 84
Child process:
  SixLabors.ImageSharp.Tests, Version=1.0.0.0, Culture=neutral, PublicKeyToken=d998eea7b14cab13 SixLabors.ImageSharp.Tests.Memory.Allocators.UniformUnmanagedMemoryPoolTests+Trim+NonParallel Void <MultiplePoolInstances_TrimPeriodElapsed_AllAreTrimmed>g__RunTest|0_0()

I've discovered that the issue also appears outside of Remote Executor so I can attempt to debug for you. Will let you know how I get on.

Question - UnmanagedBuffer<T> doesn't have a finalizer but MemoryOwnerFinalizer_ReturnsToPool seems to expect one. Is this correct? UniformUnmanagedMemoryPool finalizer doesn't return buffers only frees them.

@antonfirsov
Copy link
Member Author

antonfirsov commented Dec 7, 2021

@JimBobSquarePants the naming reflects the old buggy design that was depending on finalization order, will change it when everything else is fixed.

IMemoryOwner/MemoryGuard doesn't have a finalizer anymore, it's the associated Lifetime Guards that return things to pools:
UniformUnmanagedMemoryPool.LifetimeGuards.cs
SharedArrayPoolBuffer.LifetimeGuard

The parameters of the tests are exercising different cases:

  • 300 : Rented from / returned to ArrayPool<byte>.Shared
  • 600: Rented from / returned to UniformUnmanagedMemoryPool
  • 1200: over pool limit, guard frees the handle Forces discontigous group allocation of 2 buffers (Edited)

@JimBobSquarePants
Copy link
Member

JimBobSquarePants commented Dec 7, 2021

@antonfirsov The failing behavior I'm seeing is the memory pool being trimmed after the first allocation because it correctly thinks it's under high pressure (92% of my memory appears to be in use!!).

@antonfirsov
Copy link
Member Author

@JimBobSquarePants thanks that's super valuable info! I disabled these tests for local runs in 4986c52, that's the best idea I was able to come up with.

Copy link
Member

@JimBobSquarePants JimBobSquarePants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anton I'm happy with this if you are. My debugging suggested that everything was working as expected and I cannot see any issues.

@antonfirsov
Copy link
Member Author

I wonder whether we'd be better moving our high memory detection code into an internal utility class like the runtime does and use that as a condition instead?

I haven't seen this being done in test code, wanted to avoid complicating tests further. I don't expect them to fail unless someone touches the allocator logic.

I'm happy with this if you are.

Yeah. let's merge this and start dogfeeding it, that should detect potential issues faster than sitting further on the PR :)

@antonfirsov antonfirsov merged commit ee83e99 into master Dec 9, 2021
@antonfirsov antonfirsov deleted the af/UniformUnmanagedMemoryPoolMemoryAllocator-02 branch December 9, 2021 10:51
frenzibyte added a commit to frenzibyte/osu-framework that referenced this pull request Apr 1, 2022
Updates to use the new and default `ArrayPool`-based memory allocator
(see SixLabors/ImageSharp#1730).
frenzibyte added a commit to frenzibyte/osu-framework that referenced this pull request Apr 1, 2022
Updates to use the new and default `ArrayPool`-based memory allocator
(see SixLabors/ImageSharp#1730).
frenzibyte added a commit to frenzibyte/osu-framework that referenced this pull request Apr 1, 2022
Updates to use the new and default `ArrayPool`-based memory allocator
(see SixLabors/ImageSharp#1730).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants