Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion]system.cpu.cores metric definition #24942

Closed
narph opened this issue Apr 6, 2021 · 8 comments
Closed

[Discussion]system.cpu.cores metric definition #24942

narph opened this issue Apr 6, 2021 · 8 comments
Assignees
Labels
Team:Integrations Label for the Integrations team

Comments

@narph
Copy link
Contributor

narph commented Apr 6, 2021

Documentation provides the following information on the field:

system.cpu.cores
The number of CPU cores present on the host. The non-normalized percentages will have a maximum value of 100% * cores. The normalized percentages already take this value into account and have a maximum value of 100%.
type: long

From the definition above it looks like we are talking about the total number of physical cores on a machine (?) or are we interested in providing the number of logical cores/processors (logical cores are the number of physical cores times the number of threads that can run on each core through the use of hyperthreading) instead?

That said, looking at the Windows implementation:
We are calling the GetProcessAffinityMask win 32 api, if that returns 0 then we retrieve the value from GetSystemInfo win32 api.
The value returned represents the number of processors that a process is allowed to run on in the current processor group.
If there are not multiple groups, the result would be the total number of logical processors.

On the linux side there seems to call sched_getaffinity which looks to be similar to the Windows implementation.

Is the goal that system.cpu.cores returns the number of logical processor from the current processor group? Should the goal be returning the total number of logical processors or physical cores?
Should documentation be altered here to be more clear on the value returned?

cc: @sorantis , @fearful-symmetry

@narph narph self-assigned this Apr 6, 2021
@narph narph added the Team:Integrations Label for the Integrations team label Apr 6, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations (Team:Integrations)

@fearful-symmetry
Copy link
Contributor

So, you're right that linux is doing something fundamentally similar when it comes to sched_getaffinity.

If there are not multiple groups, the result would be the total number of logical processors.

Again, similar on linux, at least at the conceptual level of how an OS can manage processes.

If we look at the actual godoc for NumCPUs() :

(emphasis mine)

NumCPU returns the number of logical CPUs usable by the current process.

This isn't really a great way to get actual hardware and OS state information. Compare this with something like lscpu, which gets fairly complex in an attempt to correctly identify the CPU config. Last time I poured over the source code in depth (and that was a while ago), it still does some creative math to distinguish between logical and physical cores. And this is just on linux. Determining hardware config is a weirdly complex and OS-dependent problem. We've run into similar problems and occasional user confusion with memory reporting: "what state is the entire OS in" vs. "what state is the OS in, from the perspective of information relevant to a running program" is an easy trap to fall into.

The real issue here is "we need more sophisticated means to determine actual CPU count," but if we want to talk about how we define system.cpu.cores, I suspect the best answer is "the number of CPU cores that metricbeat itself can run on"

@exekias
Copy link
Contributor

exekias commented Apr 6, 2021

Many of the system metricsets only return “correct” information when it runs on the host, without any kind of namespacing. I guess we are in the same situation here.

Is the goal that system.cpu.cores returns the number of logical processor from the current processor group? Should the goal be returning the total number of logical processors or physical cores?
Should documentation be altered here to be more clear on the value returned?

I would put it the other way around, do we have a way to get the total number of logical processors when running on a group? If there is a different API call we can make to get real numbers with a reasonable effort I think it would make sense to use it, but probably the kernel hides this info from you when you are running on a subset of the cores.

In any case I agree we can clarify the docs to make it clear that we report the number of cores that Metricbeat can see.

@narph
Copy link
Contributor Author

narph commented Apr 7, 2021

@fearful-symmetry , @exekias ,

It would be interesting to understand why we went for this approach from the beginning but I do not think the value we return reflects overall system metrics and that is the goal of this metricset. This metric value fits better in the process metricset at process level.
On the Windows side there are a few apis that can help return the number of logical and/or physical cores, even the affinity mask api we were calling seems to be also returning the system affinity mask.

I suspect the best answer is "the number of CPU cores that metricbeat itself can run on"

I don't think users are interested in this metric unless they are actively monitoring the Metricbeat process, most of the time this metric value matches the total number of logical cores (if we have a single process group) so maybe this is why it not was an issue before.

I think we should put some effort in finding out how tricky it is to provide a better calculation for this metric before we settle on updating documentation. What do you folks think?

@exekias
Copy link
Contributor

exekias commented Apr 7, 2021

Yes, I'm on board with that!

@narph
Copy link
Contributor Author

narph commented Apr 7, 2021

linking #22827 could be useful for the Linux implementation and just found #23523, seems like this was signaled before.

@fearful-symmetry
Copy link
Contributor

I don't think users are interested in this metric unless they are actively monitoring the Metricbeat process, most of the time this metric value matches the total number of logical cores (if we have a single process group) so maybe this is why it not was an issue before.

Agreed, we need to do this differently. The way #22827 does it is fundamentally correct, and it's also what lscpu does, among other things. @masci I'm not sure if we want to wrap this into our refactor of the system code, or just try to push this out a little more urgently.

@fearful-symmetry
Copy link
Contributor

I'm gonna err on the side of clean-up and close this, as now all of metricbeat as been refactored to better emulate how lscpu and other more sophisticated tools gather CPU config. If there's something I missed, we can reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Integrations Label for the Integrations team
Projects
None yet
Development

No branches or pull requests

4 participants