Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.
/ corefx Public archive

[WIP] Pipelines - Reduce write lock contention #35267

Closed
wants to merge 8 commits into from

Conversation

benaadams
Copy link
Member

@benaadams benaadams commented Feb 12, 2019

BufferSegment:ResetMemory() is not an inexpensive operation (blanking 80 bytes + returning data to pool); however it doesn't need to be done under the Pipe's reader/writer sync lock.

Also the scheduling/marking of completion can happen prior to the Segments being reset and returned to the pool, allowing for lower latency.

This change resets the segments outside of lock; then uses the pool as a different lock to return the Segments to the pool (and to get them from the pool) so its a fast lock only for the pooled Segments and doesn't lock the whole Pipe.

It also reduces the scope of the lock on the write side when acquiring Memory; to only when it modifies the readHead or needs to change state to Writing (so most write Advancing will skip the lock)

Shrunk the Flush lock when writing as it can modify the write head; then acquire lock after when signalling the Reader.

/cc @davidfowl @pakrym @jkotalik

@stephentoub
Copy link
Member

Doesn't this mean that a segment which might otherwise have been available for the continuation is now much less likely to be? Is that not a problem at scale?

@benaadams
Copy link
Member Author

benaadams commented Feb 12, 2019

Is that not a problem at scale?

Pool size is 16; segments are 4096 bytes so that would affect many small writes without Flush larger than 60kB (15 segments * 4096b) or larger if the ArrayPool is being used.

If you are doing single shot large writes then it would be 16 writes without a Flush (say 1MB x 16 = 16MB) that would hit this.

On the flip size holding the lock which is for both reader and writer and delaying the signaling of the continuation is more of an issue as ResetMemory is pretty chunky and needs to be called for every block returned; as well as then either returing the data via ArrayPool.Return (or ConcurrentQueue for Kestrel); so the writer wouldn't be able to make progress during this time (currently)

e.g. (called for every block returned)

; Assembly listing for method BufferSegment:ResetMemory():this
; Lcl frame size = 40

G_M401_IG01:
       57                   push     rdi
       56                   push     rsi
       4883EC28             sub      rsp, 40
       488BF1               mov      rsi, rcx

G_M404_IG02:
       488B5628             mov      rdx, gword ptr [rsi+40]
       48B9B8D075C9FA7F0000 mov      rcx, 0x7FFAC975D0B8
       E8546A615F           call     CORINFO_HELP_ISINSTANCEOFINTERFACE
       4885C0               test     rax, rax
       7417                 je       SHORT G_M404_IG03
       488BC8               mov      rcx, rax
       49BB081536C9FA7F0000 mov      r11, 0x7FFAC9361508
       3909                 cmp      dword ptr [rcx], ecx
       FF1552B89AFF         call     [IDisposable:Dispose():this]
       EB49                 jmp      SHORT G_M404_IG04

G_M404_IG03:
       488B5628             mov      rdx, gword ptr [rsi+40]
       48B9020D50C9FA7F0000 mov      rcx, 0x7FFAC9500D02
       E845274F5F           call     CORINFO_HELP_ISINSTANCEOFARRAY
       488BF8               mov      rdi, rax
       4885FF               test     rdi, rdi
       742E                 je       SHORT G_M404_IG04
       48B9003F45C9FA7F0000 mov      rcx, 0x7FFAC9453F00
       BA77000000           mov      edx, 119
       E8B96C5F5F           call     CORINFO_HELP_CLASSINIT_SHARED_DYNAMICCLASS
       48B9E02E951095010000 mov      rcx, 0x19510952EE0
       488B09               mov      rcx, gword ptr [rcx]
       488BD7               mov      rdx, rdi
       4533C0               xor      r8d, r8d
       3909                 cmp      dword ptr [rcx], ecx
       E89795FFFF           call     TlsOverPerCoreLockedStacksArrayPool`1:Return(ref,bool):this

G_M405_IG04:
       33C0                 xor      rax, rax
       48894608             mov      gword ptr [rsi+8], rax
       33C0                 xor      rax, rax
       48894610             mov      qword ptr [rsi+16], rax
       33C0                 xor      rax, rax
       33D2                 xor      edx, edx
       488D4E18             lea      rcx, bword ptr [rsi+24]
       488901               mov      gword ptr [rcx], rax
       895108               mov      dword ptr [rcx+8], edx
       89510C               mov      dword ptr [rcx+12], edx
       33C0                 xor      rax, rax
       48894628             mov      gword ptr [rsi+40], rax
       48894630             mov      gword ptr [rsi+48], rax
       33C0                 xor      eax, eax
       894638               mov      dword ptr [rsi+56], eax
       33C0                 xor      rax, rax
       33D2                 xor      edx, edx
       4883C640             add      rsi, 64
       488906               mov      gword ptr [rsi], rax
       895608               mov      dword ptr [rsi+8], edx
       89560C               mov      dword ptr [rsi+12], edx

G_M407_IG05:
       4883C428             add      rsp, 40
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret      

; Total bytes of code 197, prolog size 6 for method BufferSegment:ResetMemory():this

@benaadams benaadams force-pushed the buffers branch 2 times, most recently from c175b05 to e778d5b Compare February 12, 2019 22:31
@pakrym
Copy link

pakrym commented Feb 12, 2019

Is there a way to make segment pool non-locking? It doesn't have to be strict, loosing some segments to GC is acceptable.

@davidfowl
Copy link
Member

Before we merge this one, we need to see the impact on our benchmarks.

@benaadams
Copy link
Member Author

benaadams commented Feb 12, 2019

Is there a way to make segment pool non-locking?

Easiest way is to use the ConcurrentQueueSegment though its a bit chunky.

Though its single reader single writer, so might be able to do something simpler

SingleProducerConsumerQueue looks like the ticket?

@benaadams
Copy link
Member Author

Made it lock-free and interleaved the Resets with the Returns

@benaadams
Copy link
Member Author

Updated summary

This change changes the segment pool to be lock-free so the whole pipe doesn't need to be locked to return them; interleaves ResetMemory and return; eagerly making the BufferSegments available and triggers completion prior to returning the segments.

@benaadams
Copy link
Member Author

@stephentoub does the conversion of SingleProducerSingleConsumerQueue to SingleProducerSingleConsumerPool look correct?

@benaadams benaadams changed the title Schedule completion prior to pooling segments [WIP] Schedule completion prior to pooling segments Feb 13, 2019
@benaadams benaadams changed the title [WIP] Schedule completion prior to pooling segments Pipelines - Reduce lock contention Feb 22, 2019
@benaadams
Copy link
Member Author

Removed the lock-free segment pool as SingleProducerSingleConsumerQueue is kinda chunky and it can be a future optimization.

Reduced the lock in the write path further following discussion in #35484

@benaadams
Copy link
Member Author

Summary change

This change resets the segments outside of lock; then uses the pool as a different lock to return the Segments to the pool (and to get them from the pool) so its a fast lock only for the pooled Segments and doesn't lock the whole Pipe.

It also reduces the scope of the lock on the write side when acquiring Memory; to only when it modifies the readHead or needs to change state to Writing.

@benaadams
Copy link
Member Author

benaadams commented Feb 23, 2019

Fixed race between setting writing active and _writingHead being set to null (e.g. need to check it again after setting state to writing as it may have been set to null in the mean time)

@benaadams benaadams changed the title Pipelines - Reduce lock contention Pipelines - Reduce write lock contention Feb 23, 2019
@benaadams
Copy link
Member Author

*note race condition was in this PR, not in current version

@benaadams
Copy link
Member Author

Using stack array for pooling rather than heap array; adds a chunk of zeroing; however its now done after completion and out of lock, so shouldn't hit perf

G_M48697_IG02:
       lea      rcx, bword ptr [rbp-B8H]
       vxorps   xmm0, xmm0
       vmovdqu  qword ptr [rcx], xmm0
       vmovdqu  qword ptr [rcx+16], xmm0
       vmovdqu  qword ptr [rcx+32], xmm0
       vmovdqu  qword ptr [rcx+48], xmm0
       vmovdqu  qword ptr [rcx+64], xmm0
       vmovdqu  qword ptr [rcx+80], xmm0
       vmovdqu  qword ptr [rcx+96], xmm0
       vmovdqu  qword ptr [rcx+112], xmm0

@benaadams
Copy link
Member Author

benaadams commented Feb 23, 2019

Uses a CORINFO_HELP_CHECKED_ASSIGN_REF to write to the ValueArray; have added as example to the coreclr Jit issue: https://github.com/dotnet/coreclr/issues/15755#issuecomment-466683856

@benaadams
Copy link
Member Author

Reverted the ValueArray as it didn't help perf and had issues (above)

@benaadams
Copy link
Member Author

Debian.8.Amd64.Open-x64-Release failures System.Net.NameResolution.Functional.Tests
https://github.com/dotnet/corefx/issues/24355#issuecomment-466807679

@benaadams benaadams changed the title Pipelines - Reduce write lock contention [WIP] Pipelines - Reduce write lock contention Feb 24, 2019
@@ -97,6 +98,7 @@ public Pipe(PipeOptions options)
}

_bufferSegmentPool = new BufferSegment[SegmentPoolSize];
_bufferSegmentsToReturn = new BufferSegment[SegmentPoolSize];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, larger pipe and more allocations per pipe.

@pakrym
Copy link

pakrym commented Feb 25, 2019

Any benchmark for the last iteration?

@benaadams
Copy link
Member Author

Any benchmark for the last iteration?

Yea... Is why I added WIP 😉

@pakrym
Copy link

pakrym commented Feb 25, 2019

Yea... Is why I added WIP 😉

Whoops, sorry.

@benaadams
Copy link
Member Author

Has a mild regression on some scenarios on the Pipe throughput benchmark; and I don't fully understand why; have dropped to asm and tweaking to improve asm CQ.

May fold them back in as a separate PR if I don't find the root cause.

@karelz
Copy link
Member

karelz commented Mar 4, 2019

@benaadams any update?

@karelz
Copy link
Member

karelz commented Mar 18, 2019

@benaadams any update / plans? (I assume you will be busy at MVP Summit this week)

@karelz
Copy link
Member

karelz commented Apr 1, 2019

1 month no update, closing for now. Please feel free to reopen it once you're ready to push it forward. Thanks!

@karelz karelz closed this Apr 1, 2019
@karelz karelz added this to the 3.0 milestone Apr 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants