Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1sec Long Pause (stop-the-world) on Gen1 GC #88426

Closed
dickens-code opened this issue Jul 5, 2023 · 6 comments · Fixed by #88772
Closed

1sec Long Pause (stop-the-world) on Gen1 GC #88426

dickens-code opened this issue Jul 5, 2023 · 6 comments · Fixed by #88772
Labels
area-GC-coreclr tenet-performance Performance related issue
Milestone

Comments

@dickens-code
Copy link
Contributor

dickens-code commented Jul 5, 2023

Description

Recently my team has been working on lowering down memory allocation to minimize GC pressure.
While we used different technique to reuse/recycle object, or use ArrayPool.Shared to rent/return memory, we found that there are periodic long pauses (~1s, stop-the-world) of Gen0/Gen1 GC few times per hour.
We used PerfView command to spot it out:

PerfView -NoGui -AcceptEULA -NoNGenRundown -BufferSizeMB:2048 -CircularMB:2048 -Zip:false -Merge:true -ClrEvents:GC+Stack -StopOnGCOverMSec:300 -KernelEvents:Memory,VirtualAlloc,Default -OnlyProviders:ClrPrivate:1:5,Clr:0x40000001:5 -NoRundown
-DelayAfterTriggerSec:0 -CollectMultiple:1 -DataFile:XXX-2307051821-48272-gcsg -FocusProcess:48272 -Process:48272 collect

GC Stats showing 1s Gen1 GC pause

Screenshot 2023-07-05 224639
[pic1]

The Gen0 Frag% is obvious very high (98.98%) ...
If I looked into CPU Stack with filter applied for GC period, looks like most CPU time is used for doing compact?

Screenshot 2023-07-06 010013

When looking into the CPU Stack of profile, we discovered that these few calls during GC causes most part of delay:

  1. coreclr!SyncBlockCache::InsertCleanupSyncBlock
  2. coreclr!SVR::GCHeap::IsPromoted
  3. ntoskrnl!MiRemoveLowestPriorityStandbyPage

Screenshot 2023-07-05 225019
[pic2]

When firing up Goto Source these lines are most time consuming ...
InsertCleanupSyncBlock

Screenshot 2023-07-05 231740
[pic3]

IsPromoted

Screenshot 2023-07-05 231840
[pic4]

Would you mind telling us that what would be the root cause of it, and anything we can do to prevent such long app pause (1N Gen1 non-concurrent GC)? Or how to solve fragmentation problem if that is the root cause?

Thanks a lot~
dickens

Configuration

  • .NET 7
  • App running in GC Server mode, with GC affinity set to 0:0-62
  • Windows Server 2012 x64, 128 cores, 2 NUMA nodes
  • Cloud server on AWS
@dickens-code dickens-code added the tenet-performance Performance related issue label Jul 5, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jul 5, 2023
@ghost
Copy link

ghost commented Jul 5, 2023

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Recently my team has been working on lowering down memory allocation to minimize GC pressure.
While we used different technique to reuse/recycle object, or use ArrayPool.Shared to rent/return memory, we found that there long periodic long pauses (~1s) of Gen0/Gen1 GC few times per hour.
We used PerfView command to spot it out:

PerfView -NoGui -AcceptEULA -NoNGenRundown -BufferSizeMB:2048 -CircularMB:2048 -Zip:false -Merge:true -ClrEvents:GC+Stack -StopOnGCOverMSec:300 -KernelEvents:Memory,VirtualAlloc,Default -OnlyProviders:ClrPrivate:1:5,Clr:0x40000001:5 -NoRundown
-DelayAfterTriggerSec:0 -CollectMultiple:1 -DataFile:XXX-2307051821-48272-gcsg -FocusProcess:48272 -Process:48272 collect

GC Stats showing 1s Gen1 GC pause
Screenshot 2023-07-05 225019
[pic1]

When looking into the CPU Stack of profile, we discovered that these few calls during GC causes most part of delay:

  1. coreclr!SyncBlockCache::InsertCleanupSyncBlock
  2. coreclr!SVR::GCHeap::IsPromoted
  3. ntoskrnl!MiRemoveLowestPriorityStandbyPage
    Screenshot 2023-07-05 224639
    [pic2]

When Goto Source these lines are most time consuming ...
InsertCleanupSyncBlock
Screenshot 2023-07-05 231740
[pic3]

IsPromoted
Screenshot 2023-07-05 231840
[pic4]

Configuration

.NET core 7
Windows Server 2012 x64

Author: dickens-code
Assignees: -
Labels:

tenet-performance, area-GC-coreclr, untriaged

Milestone: -

@dickens-code dickens-code changed the title Long Pause on Gen1 GC 1s Long Pause on Gen1 GC Jul 5, 2023
@dickens-code dickens-code changed the title 1s Long Pause on Gen1 GC 1sec Long Pause on Gen1 GC Jul 5, 2023
@dickens-code dickens-code changed the title 1sec Long Pause on Gen1 GC 1sec Long Pause (stop-the-world) on Gen1 GC Jul 5, 2023
@mangod9
Copy link
Member

mangod9 commented Jul 5, 2023

Hello @dickens-code , would it possible to securely share the traces you have collected? Also are you observing that this has regressed in .NET 7?

@dickens-code
Copy link
Contributor Author

yes, it is happening to net7, however I can't share the profile (etl file) to you due to security reason. Anything I should check?

@mangod9
Copy link
Member

mangod9 commented Jul 6, 2023

would it be possible to collect some GCCollectOnly traces? That shouldnt have much PII or other info.

@cshung
Copy link
Member

cshung commented Jul 6, 2023

@mangod9, I am working with @dickens-code offline to diagnose the issue. We will report findings here whenever we make progress.

@dickens-code
Copy link
Contributor Author

thanks both @mangod9 and @cshung ~

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Jul 19, 2023
@mangod9 mangod9 added this to the 9.0.0 milestone Jul 19, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Aug 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-GC-coreclr tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants