-
Notifications
You must be signed in to change notification settings - Fork 690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Properly understanding CPU utilization metrics #787
Comments
There's been a number of reports of the data not lining up, but noone has done such a thorough investigation. Thanks for that! windows_exporter/collector/cpu.go Line 261 in 74eac8f
It'd be interesting to see how that data looks, it might be a counter internally (it often is), so if you have time to have a look at that it'd be great! Re processor_performance, I tried deciphering it a few years ago, but didn't really figure it out, sadly :( |
I think we might have answered part of the processor_performance question.... If you do something like this:
On an AMD system, you will get a nice graph showing the effective P-state frequency of each hyperthread; if you're familiar with turbostat on Linux, it looks a lot like the Bzy_MHz column. If you remove the On Intel systems, you need to replace that 2 denominator with a 0.5 for it to make sense. Without knowing where the metric comes from, it's hard to speculate as to why this is the case. I'm still pretty sure that Windows doesn't provide any known interface into the APERF / MPERF MSRs, but I have no idea where else it could come from. About the idle metrics... I've found that:
and
differ by 5-15% depending on load. I would trust the former to be more reliable, and is closer to the % processor utility number. So, our graph of CPU utilisation is made up of the following (BYO labels & $__interval):
If you use the "old" Idle metric, it very neatly adds up to 100%, but I have little faith in it. I'll investigate the utility metric next week if time permits. Thanks to @tycho for his help in getting it this far. |
Hi @carlpett, Sorry for the huge delay in following up on this - it's been a busy year. I revisited this issue recently because some BIOS tuning failed to apply on a subset of servers, and it would have been really useful to have an alert to flag that a server wasn't going into boost properly. Here's what I've found: As it stands, the
which makes sense since it's a 2.0GHz CPU which boosts to about 4.3GHz. When we dig into the counter a bit more like this:
The
If you take I dug around perflib_exporter/perflib/perflib.go and it looks like any of the
And we get something that is good enough, but still not exactly the same as the cooked value.
I'm not familiar enough with the perflib code (and my golang has atrophied a bit) to know if my approach is safe; perhaps @leoluk has some input on this. This would also enable the addition of an accurate processor utility gauge. |
Of course, I start testing this on some actually loaded production systems and the got complete nonsense back. I investigated some more, and I'm pretty sure that We know that I'll let it bake for a few more days, but if this is useful for people other than me, it'll require changes in both this project and perflib-exporter. |
Thanks @higels, I appreciate the time you've spent looking into this one 👍 If you need any assistance making changes just let me know. |
Hi @breed808 - I've made a rough version of the proposed changes here: master...higels:windows_exporter:add_mperf_metric and to perflib_exporter here: leoluk/perflib_exporter@master...higels:perflib_exporter:add_secondvalue_plumbing Basic summary is that we add a I haven't tested this much yet, but just wanted to make sure I was on the right track. I still need to improve the metric descriptions, but that's the easy part. I have a few more questions:
|
I've been running an exporter based on my changes above for a day now on quite a large number of systems, and I have CPU metrics that very accurately match what taskmgr shows. I do a bit of creative promql to break out Privileged Utility Time into system, dpc and interrupt and everything adds up to 100%. It's a little concerning how different the results are. windows_cpu_time_total based metrics on the left, my newer metrics on the right: ( |
perflib_exporter changes LGTM - just make a PR and I'll merge them and make a release. |
Great work! Once the |
Any news on this? Is this merged, PR created? Currently on version 0.20.0 there is same issue with measuring CPU usage. |
Here's a new perflib_exporter release: https://github.com/leoluk/perflib_exporter/releases/tag/v0.2.0 |
Apologies for the delay, I must have missed the notification for this. Dependency has been updated in #1084 |
I will hopefully have time to get my changes rebased and submitted this week. Thanks all for your work on this! |
This change adds 4 new CPU related metrics: * process_mperf_total * processor_rtc_total * processor_utility_total * processor_privileged_utility_total and renames the existing process_performance to processor_performance_total, since it was previously misunderstood and was unlikely to be have been useful without the above new metrics The data sources for these are not particularly well understood, and the examples show that in some cases, arbitrary scaling factors are required to actually make them useful, but in my testing on hundreds of systems with a broad range of CPUs and operating systems from 2012r2 through to 2019 has proved out that we can use them to accurately display actual CPU frequencies and CPU utilisation as it is represented in taskmgr. Things I don't particularly like and would like input on: * I would have preferred to do the scaling of processor_mperf_total in the code, but there isn't an elegant way of doing this right now. * Maybe processor_mperf_total should be called processor_mperformance_total. See prometheus-community#787 for discussion. Signed-off-by: Steffen Higel <higels@valvesoftware.com>
This change adds 4 new CPU related metrics: * process_mperf_total * processor_rtc_total * processor_utility_total * processor_privileged_utility_total and renames the existing process_performance to processor_performance_total, since it was previously misunderstood and was unlikely to be have been useful without the above new metrics The data sources for these are not particularly well understood, and the examples show that in some cases, arbitrary scaling factors are required to actually make them useful, but in my testing on hundreds of systems with a broad range of CPUs and operating systems from 2012r2 through to 2019 has proved out that we can use them to accurately display actual CPU frequencies and CPU utilisation as it is represented in taskmgr. Things I don't particularly like and would like input on: * I would have preferred to do the scaling of processor_mperf_total in the code, but there isn't an elegant way of doing this right now. * Maybe processor_mperf_total should be called processor_mperformance_total. See prometheus-community#787 for discussion. Signed-off-by: Steffen Higel <higels@valvesoftware.com>
This issue has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs. |
This change adds 4 new CPU related metrics: * process_mperf_total * processor_rtc_total * processor_utility_total * processor_privileged_utility_total and renames the existing process_performance to processor_performance_total, since it was previously misunderstood and was unlikely to be have been useful without the above new metrics The data sources for these are not particularly well understood, and the examples show that in some cases, arbitrary scaling factors are required to actually make them useful, but in my testing on hundreds of systems with a broad range of CPUs and operating systems from 2012r2 through to 2019 has proved out that we can use them to accurately display actual CPU frequencies and CPU utilisation as it is represented in taskmgr. Things I don't particularly like and would like input on: * I would have preferred to do the scaling of processor_mperf_total in the code, but there isn't an elegant way of doing this right now. * Maybe processor_mperf_total should be called processor_mperformance_total. See prometheus-community#787 for discussion. Signed-off-by: Steffen Higel <higels@valvesoftware.com>
We noticed recently that on a reasonably loaded system (AMD 7702P, 128 hyper threads, 3.35Ghz at low load, will come down to 2.6Ghz as things heat up), there was a large discrepancy in overall and per-hyperthread utilization when comparing taskmgr vs. our windows_exporter metrics + the stuff we pull from WMI independently.
Specifically, taskmgr reported that the average utilization of the system was between 65 and 70%, with many individual cores being completely utilized, with their hyper thread siblings being between 10 and 30% utilized.
Our data from windows_exporter showed that
1 - avg(rate(windows_cpu_time_total{mode='idle'}[2m]))
was between 50 and 55%. Our data from the WMI gauge\\Processor(_Total)\\% Processor Time
agreed with windows_exporter.The data for individual hyperthreads showed two bands, one between 70 and 85% utilization and the other between 20 and 50%.
We updated our independently gathered metrics to use
\\Processor Information(_Total)\\% Processor Utility
and this seemed to line up with what taskmgr was providing for overall utilization.% Processor Time
is seemingly very old and doesn’t handle variance in CPU frequency.If we trust taskmgr, with windows_exporter / perflib, we have a 10-15% difference in overall system utilization at higher load as our CPU clocks down, and the per core utilization is being misallocated across each pair of hyperthreads.
This brings me to my questions:
The text was updated successfully, but these errors were encountered: