-
Notifications
You must be signed in to change notification settings - Fork 363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-13153: Limit the value of GOMAXPROCS on node-exporter. #1996
OCPBUGS-13153: Limit the value of GOMAXPROCS on node-exporter. #1996
Conversation
@machine424: This pull request references Jira Issue OCPBUGS-13153, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@machine424: This pull request references Jira Issue OCPBUGS-13153, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
23c315a
to
3f721d7
Compare
/jira refresh |
@machine424: This pull request references Jira Issue OCPBUGS-13153, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
7d40749
to
08aa416
Compare
/jira refresh |
@machine424: This pull request references Jira Issue OCPBUGS-13153, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@machine424: This pull request references Jira Issue OCPBUGS-13153, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@machine424: This pull request references Jira Issue OCPBUGS-13153, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@machine424: This pull request references Jira Issue OCPBUGS-13153, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
08aa416
to
d4f5ba5
Compare
I'll add a test that assures --runtime.gomaxproc precedence in another commit to ease backporting. this will be done in a separate PR #1998 |
…terGoMaxProcs tests Do it in a seperate PR to ease backporitng openshift#1996 as the maxprocs config wasn't present in < 4.14 Signed-off-by: Ayoub Mrini <amrini@redhat.com>
test/e2e/node_exporter_test.go
Outdated
t, 5*time.Minute, `max(go_sched_gomaxprocs_threads{job="node-exporter"})`, | ||
func(v float64) error { | ||
if v > 4 { | ||
return fmt.Errorf(`expecting max(go_sched_gomaxprocs_threads{job="node-exporter"}) <= 4 but got %v.`, v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
go_sched_gomaxprocs_threads seems to only be present on some prom_go versions? go versions >=1.19?), I'll adjust the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was removed by default in recent versions of prom go client (see prometheus/client_golang#1102), and node-exporter does not seem to add them (they didn't add them back since, see https://github.com/prometheus/node_exporter/blame/bdc430af2b63e0556b591625f86cff60ff88c242/node_exporter.go#L64 ), and we'll not ask for them just for a test.
I remove the test, we can add one that check the "Set GOMAXPROCS=N" in logs, I cannot think of another that wouldn't make the patch more complex.
d4f5ba5
to
8801479
Compare
This seems to be a recurring issue for golang based containers in k8s. While this fixes the customer issue reported I wonder if we should think about a different solution, one that is easier to test and that fixes this once and for all. Reading about this one often finds |
Yes, I still don't understand why We think about implementing the same for node-exporter, but it wouldn't help in this case as we don't set CPU limits, maybe we should think about setting them... (Actually, the CU runs a single node cluster (with 4 CPU reserved to platform components) and normally, the default value of GOMAXPROCS takes CPU affinity into account https://github.com/golang/go/blob/bce7aec3cdca8580585095007e9b7cea11a8812f/src/runtime/os_linux.go#L107, but the affinity doesn't seem to be set for node-exporter as its CPU requests is not an integer I think) |
8801479
to
95819e7
Compare
We are thinking about that already ;)
Not sure I understand this part. I'm a bit torn, I feel like we can improve our understanding of what's going on here but at the same time I don't have anything constructive and will be out for the rest of the week. Maybe @openshift/openshift-team-monitoring can chime in here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the desire to fix it once for all in node_exporter and the automaxproc
library + ability to configure CPU limits in CMO could be the right solution (btw we don't even need the automaxprocs lib as we could inject the CPU limits via the downward API). At the same time, this patch provides a tactical solution that should be relatively easy to backport...
We need to check the env var According to the comment in the runtime code, the main function of node exporter should be called after setting the process count by the env var But I suggest to give it a test to be sure :) |
95819e7
to
e177c02
Compare
On nodes with multiple CPU cores, this should help avoid lock contentions without having any side effects, see the ticket for more details. node-exporter versions that have "--runtime.gomaxprocs" can still override this: the flag has precedence over this automatic setting. Make the entrypoint set the env var before runnning the process. Get the CPU count from /proc/cpuinfo without taking CPU affinity (via cpuset) into account as the container's CPU requests is not an integer, thus no affinity is applied. Signed-off-by: Ayoub Mrini <amrini@redhat.com>
e177c02
to
58997b8
Compare
Yes, you're right, it's clear that the flag has precedence prometheus/node_exporter@fb26b9f, but I had done some manual tests and the flag does have precedence in >= 4.14 I even wanted to add the check to the flag e2e test https://github.com/openshift/cluster-monitoring-operator/pull/1998/files#diff-28d0ddad61ad014c06899788cff47f72864db5df662c49235c59cee301e74b57R243 but removed it as the |
@machine424: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/skip |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jan--f, machine424 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@machine424: Jira Issue OCPBUGS-13153: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-13153 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherrypick release-4.13 |
@machine424: #1996 failed to apply on top of branch "release-4.13":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
xxxxx