Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/k8scluster] Consider adding metrics to get effective pod requests/limits #29860

Closed
jinja2 opened this issue Dec 13, 2023 · 3 comments
Closed

Comments

@jinja2
Copy link
Contributor

jinja2 commented Dec 13, 2023

Component(s)

receiver/k8scluster

Is your feature request related to a problem? Please describe.

The k8scluster receiver currently provides metrics for resource requests and limits for containers. For most use cases the effective pod resource requirements end up being equal to the sum of the request/limit of all main containers in the pod. But k8s components like scheduler, kubelet, etc. use a more complicated calculation for the effective resource requirement for running a pod. For k8s version without the sidecar and in-place resize feature, the effective request/limit for a resource is calculated as max ( max(init containers), sum(containers) ) + pod_overhead. The full algorithm which also takes into account the nuances of additional feature for latest k8s version can be found here.

For e.g. I have a pod in the screenshot with initContainer set to cpu requests (150m) > cpu request of the main container (100m). And you can see kubectl describe node output shows the reserved cpu on the node for the pod is 150m.

Screenshot 2023-12-11 at 9 01 51 PM

An admin might want to monitor patterns like this when pods end up reserving resources for initialization that are not used during the life of the pod. Being able to track the effective pod request/limit is useful when trying to track the capacity of the node as seen by the scheduler.

Describe the solution you'd like

The easiest and most accurate way to get the effective pod req/limit is by scraping the metrics kube_pod_resource_request and kube_pod_resource_limit from kube-scheduler but this might not be an option for users with managed clusters.

The receiver should have the option to collect request/limit for initContainers and the pod overhead.

We could additionally discuss the feasibility of computing the effective pod request/limit in the receiver the same way the scheduler does which might be difficult to implement and maintain, since the receiver won’t have access to the enabled k8s feature gates like the scheduler, and we’ll need to keep the computations in the receiver in-sync with changes to k8s.

Proposed new metrics for pod overhead -

k8s.pod.cpu_overhead, additional attr k8s.pod.runtimeclass
k8s.pod.memory_overhead - additional attr k8s.pod.runtimeclass

For the request/limits for init containers, I think it makes sense to differentiate these metrics from those for main containers since users might want to filter out init containers. We could either change the metric name to reflect the different types of container in a pod e.g. k8s.initcontainer.* or add an attr like k8s.container.type. Having separate metric name seems better for user because these can be enabled/disabled in the receiver config interface easily.

Additional consideration when naming the metrics would be the new sidecar-type initContainer metrics being discussed in this issue.

Describe alternatives you've considered

No response

Additional context

No response

@jinja2 jinja2 added enhancement New feature or request needs triage New item requiring triage labels Dec 13, 2023
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@TylerHelmuth TylerHelmuth added priority:p2 Medium and removed needs triage New item requiring triage labels Dec 13, 2023
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Feb 12, 2024
Copy link
Contributor

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants