Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zgc pause time includes concurrent time #2372

Closed
linking12 opened this issue Dec 8, 2020 · 14 comments · Fixed by #2795
Closed

zgc pause time includes concurrent time #2372

linking12 opened this issue Dec 8, 2020 · 14 comments · Fixed by #2795
Assignees
Labels
bug A general bug module: micrometer-core An issue that is related to our core module
Milestone

Comments

@linking12
Copy link

linking12 commented Dec 8, 2020

image

formula:rate(jvm_gc_pause_seconds_sum{application="$application", instance="$instance"}[1m])/rate(jvm_gc_pause_seconds_count{application="$application", instance="$instance"}[1m])

question:
The STW time of zgc is less 10ms, but why jvm_gc_pause_time is 822ms? can not understand this phenomenon

@linking12
Copy link
Author

linking12 commented Dec 9, 2020

I think isConcurrentPhase is wrong for zgc

private boolean isConcurrentPhase(GarbageCollectionNotificationInfo info) {
    // So far the only indicator known is that the cause will be reported as "No GC"
    // when using CMS.
    //
    // For ZGC, the allocation stall seems to indicate that some thread or threads are
    // blocked trying to allocate. Even though it is not a true STW pause, counting it
    // as such seems to be less confusing.
    return "No GC".equals(info.getGcCause())            // CMS
        || "Shenandoah Cycles".equals(info.getGcName()) // Shenandoah
        || ("ZGC".equals(info.getGcName()) && !"Allocation Stall".equals(info.getGcCause()));
  }

https://github.com/Netflix/spectator/blob/3d428b3c69c80f7067772e27ce0032cf11d8ec9c/spectator-ext-gc/src/main/java/com/netflix/spectator/gc/GcLogger.java
Netflix is right?

@shakuzen shakuzen added the bug A general bug label Dec 9, 2020
@shakuzen shakuzen modified the milestones: 1.6.2, 1.6.x Dec 9, 2020
@shakuzen
Copy link
Member

shakuzen commented Dec 9, 2020

Thank you for the report. Things do seem probably out of sorts. I'd like to check on the ZGC devs mailing list what the correct way is to separate out concurrent and pause time from gc notifications.

@shakuzen
Copy link
Member

shakuzen commented Dec 9, 2020

I also just came across JDK-8255289. The comments seem to indicate it's no longer possible in the latest implementations to distinguish pause time from concurrent time using the gc notifications for ZGC. This is unfortunate, especially since there doesn't seem to be an alternative available currently either.

@linking12
Copy link
Author

linking12 commented Dec 9, 2020

I also just came across JDK-8255289. The comments seem to indicate it's no longer possible in the latest implementations to distinguish pause time from concurrent time using the gc notifications for ZGC. This is unfortunate, especially since there doesn't seem to be an alternative available currently either.

Make a suggestion,For ZGC, the allocation stall seems to indicate that some thread or threads are blocked trying to allocate. Even though it is not a true STW pause, counting it as such seems to be less confusing.

@shakuzen
Copy link
Member

shakuzen commented Dec 9, 2020

I think that comment from the Spectator code is based on the behavior before the change mentioned in a comment on JDK-8255289:

as part of JDK-8240679, the GarbageCollectorMXBean was changed to report data covering the complete GC cycle, instead of individual GC pauses. A complete GC cycle in ZGC includes three (very short) pauses plus all concurrent phases. So, in the examples above you are comparing individual GC pauses (JDK 14) with the time for a complete concurrent GC cycle (JDK 15).

If you run an application with gc logging on you can see the pause time and concurrent time separately. The gc notification is including the concurrent time for ZGC cycles, even when the cause Allocation Stall isn't involved. In the below instance, the cause is Metadata GC Threshold. I started up a Spring Boot application and can see the following GC log with the following metrics.

[2.406s][info][gc,phases] GC(1) Pause Mark Start 0.507ms
[2.458s][info][gc,phases] GC(1) Concurrent Mark 51.175ms
[2.458s][info][gc,phases] GC(1) Pause Mark End 0.125ms
[2.462s][info][gc,phases] GC(1) Concurrent Process Non-Strong References 3.647ms
[2.462s][info][gc,phases] GC(1) Concurrent Reset Relocation Set 0.549ms
[2.468s][info][gc,phases] GC(1) Concurrent Select Relocation Set 5.721ms
[2.468s][info][gc,phases] GC(1) Pause Relocate Start 0.213ms
[2.478s][info][gc,phases] GC(1) Concurrent Relocate 9.883ms
[2.478s][info][gc       ] GC(1) Garbage Collection (Metadata GC Threshold) 176M(2%)->84M(1%)
jvm_gc_pause_seconds_count{action="end of major GC",cause="Metadata GC Threshold",} 1.0
jvm_gc_pause_seconds_sum{action="end of major GC",cause="Metadata GC Threshold",} 0.073

Edit: To be clear, our metric is showing 73 milliseconds of pause time, but you can see from the GC logs that the 3 pauses add up to less than 1 millisecond - the rest is concurrent phases.

Given how wrong the pause time metric is for ZGC, I'm strongly considering removing support for ZGC unless we can find a way to get the information.

Edit: I should note that I'm running the above code with Java 15.0.1.

@linking12
Copy link
Author

I think that comment from the Spectator code is based on the behavior before the change mentioned in a comment on JDK-8255289:

as part of JDK-8240679, the GarbageCollectorMXBean was changed to report data covering the complete GC cycle, instead of individual GC pauses. A complete GC cycle in ZGC includes three (very short) pauses plus all concurrent phases. So, in the examples above you are comparing individual GC pauses (JDK 14) with the time for a complete concurrent GC cycle (JDK 15).

If you run an application with gc logging on you can see the pause time and concurrent time separately. The gc notification is including the concurrent time for ZGC cycles, even when the cause Allocation Stall isn't involved. In the below instance, the cause is Metadata GC Threshold. I started up a Spring Boot application and can see the following GC log with the following metrics.

[2.406s][info][gc,phases] GC(1) Pause Mark Start 0.507ms
[2.458s][info][gc,phases] GC(1) Concurrent Mark 51.175ms
[2.458s][info][gc,phases] GC(1) Pause Mark End 0.125ms
[2.462s][info][gc,phases] GC(1) Concurrent Process Non-Strong References 3.647ms
[2.462s][info][gc,phases] GC(1) Concurrent Reset Relocation Set 0.549ms
[2.468s][info][gc,phases] GC(1) Concurrent Select Relocation Set 5.721ms
[2.468s][info][gc,phases] GC(1) Pause Relocate Start 0.213ms
[2.478s][info][gc,phases] GC(1) Concurrent Relocate 9.883ms
[2.478s][info][gc       ] GC(1) Garbage Collection (Metadata GC Threshold) 176M(2%)->84M(1%)
jvm_gc_pause_seconds_count{action="end of major GC",cause="Metadata GC Threshold",} 1.0
jvm_gc_pause_seconds_sum{action="end of major GC",cause="Metadata GC Threshold",} 0.073

Edit: To be clear, our metric is showing 73 milliseconds of pause time, but you can see from the GC logs that the 3 pauses add up to less than 1 millisecond - the rest is concurrent phases.

Given how wrong the pause time metric is for ZGC, I'm strongly considering removing support for ZGC unless we can find a way to get the information.

Edit: I should note that I'm running the above code with Java 15.0.1.

I don't think so for removing support ZGC, JVM pause time is very important for the application's health;
Can we use GarbageCollectorMXBean like prometheus exporter agent?

public interface GarbageCollectorMXBean extends MemoryManagerMXBean {
    /**
     * Returns the total number of collections that have occurred.
     * This method returns <tt>-1</tt> if the collection count is undefined for
     * this collector.
     *
     * @return the total number of collections that have occurred.
     */
    public long getCollectionCount();

    /**
     * Returns the approximate accumulated collection elapsed time
     * in milliseconds.  This method returns <tt>-1</tt> if the collection
     * elapsed time is undefined for this collector.
     * <p>
     * The Java virtual machine implementation may use a high resolution
     * timer to measure the elapsed time.  This method may return the
     * same value even if the collection count has been incremented
     * if the collection elapsed time is very short.
     *
     * @return the approximate accumulated collection elapsed time
     * in milliseconds.
     */
    public long getCollectionTime();


}

Also I think the allocation stall is very import for a java application when it apply memory.

@shakuzen
Copy link
Member

shakuzen commented Dec 10, 2020

JVM pause time is very important for the application's health

I agree. The issue is we cannot get that information.

Can we use GarbageCollectorMXBean like prometheus exporter agent?

No because ZGC is also reporting there the total collection time, including pauses and concurrent phases.

[0.820s][info][gc,phases] GC(0) Pause Mark Start 0.503ms
[0.826s][info][gc,phases] GC(0) Concurrent Mark 5.658ms
[0.826s][info][gc,phases] GC(0) Pause Mark End 0.037ms
[0.827s][info][gc,phases] GC(0) Concurrent Process Non-Strong References 0.596ms
[0.827s][info][gc,phases] GC(0) Concurrent Reset Relocation Set 0.000ms
[0.840s][info][gc,phases] GC(0) Concurrent Select Relocation Set 13.007ms
[0.840s][info][gc,phases] GC(0) Pause Relocate Start 0.188ms
[0.844s][info][gc,phases] GC(0) Concurrent Relocate 3.514ms
[0.844s][info][gc       ] GC(0) Garbage Collection (System.gc()) 62M(1%)->50M(1%)
MeterId{name='jvm.gc.pause', tags=[tag(action=end of major GC),tag(cause=System.gc())]} [Measurement{statistic='COUNT', value=1.0}, Measurement{statistic='TOTAL_TIME', value=0.024}, Measurement{statistic='MAX', value=0.024}]
ZGC (1) 24ms

The last line is what is reported from the GarbageCollectorMXBean for ZGC. You can see from the gc logs that 24ms is the total collection time, including concurrent phases. I'm not seeing any way to get just the pause time.

@maciek-codes
Copy link

It appears that in JDK 17 the GarbageCollectorMXBean added information about collection time and pause time:
https://bugs.openjdk.java.net/browse/JDK-8265136
Commit 79798c65

@shakuzen
Copy link
Member

@macqm Thanks for sharing. That's good to know. I'll have to take another look at our gc metrics with zgc in 17.

@shakuzen shakuzen modified the milestones: 1.6.x, 1.6.12 Sep 27, 2021
@shakuzen shakuzen self-assigned this Sep 27, 2021
shakuzen added a commit to shakuzen/micrometer that referenced this issue Sep 27, 2021
Since Java 17, the GarbageCollectorMXBean for ZGC has been split into `ZGC Pauses` and `ZGC Cycles`, which allows us to properly measure the pause time and the concurrent collection time. Before we only had access to the overall time.

Resolves micrometer-metricsgh-2372
@jonatan-ivanov jonatan-ivanov linked a pull request Sep 27, 2021 that will close this issue
@shakuzen shakuzen added the module: micrometer-core An issue that is related to our core module label Sep 28, 2021
@shakuzen shakuzen changed the title zgc pause time is not right zgc pause time includes concurrent time Sep 28, 2021
@shakuzen
Copy link
Member

The fix will be in the upcoming 1.6.12, 1.7.5, and 1.8.0 releases. It is available in the corresponding snapshots for early testing. You need to be running Java 17 or later for the fix to have any effect, since that is the release from which the information we needed was made available via JMX.

@krystianzybala
Copy link

To be clear, finally GC pause metric for ZGC is not available now?
I looking for this metric with InfluxDb metric registry and there is not.

@shakuzen
Copy link
Member

It's available as long as you are using a recent supported version of Micrometer and Java 17 or later. It's available before that also, but the metric is not accurate, since it includes concurrent time as well as pause time.

@krystianzybala
Copy link

Yes, I see that. Thanks I found solution for that.

@RoySunnySean007
Copy link

Dear @krystianzybala ,

Could u please kindly let me know what's ur solution? We encounter the same issue in JDK11 with ZGC.

Thanks,
Roy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A general bug module: micrometer-core An issue that is related to our core module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants