Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-2138: Add os_image_url_override metric from MCO to telemetry #1784

Merged
merged 1 commit into from
Oct 10, 2022

Conversation

sinnykumari
Copy link
Contributor

Recently we added os_image_url_override metric in MCO openshift/machine-config-operator#3343 which we would like to get added to telemetry as well. See https://github.com/openshift/enhancements/blob/master/enhancements/ocp-coreos-layering/ocp-coreos-layering.md#telemetry

@simonpasquier
Copy link
Contributor

What's the cardinality of the metric? In other words, how many different values can exist for the pool label?

@sinnykumari
Copy link
Contributor Author

cardinality is 2 per pool. Value can be either 0 or 1.

@sinnykumari
Copy link
Contributor Author

/test e2e-agnostic

@jan--f
Copy link
Contributor

jan--f commented Oct 4, 2022

Looking at https://docs.openshift.com/container-platform/4.11/nodes/nodes/nodes-nodes-managing.html it seems like cluster admins can create node pools. Iiuc this exposes telemetry to potentially unbounded cardinality if the name label is included.

Do you need the node pool name in telemetry? If not I'd recommend to aggregate that metric in a recording rule (sum(os_image_url_override)) and send that. Or maybe its an option to only send specific label values (i.e. pool names), like mater and worker?

@sinnykumari
Copy link
Contributor Author

sinnykumari commented Oct 4, 2022

Looking at https://docs.openshift.com/container-platform/4.11/nodes/nodes/nodes-nodes-managing.html it seems like cluster admins can create node pools. Iiuc this exposes telemetry to potentially unbounded cardinality if the name label is included.

Yes, cluster admin can create a custom MachineConfig pool. But this is usually on need basis when they want to logically do something on a set of nodes.

Do you need the node pool name in telemetry? If not I'd recommend to aggregate that metric in a recording rule (sum(os_image_url_override)) and send that.

We can do but that wouldn't be ideal as knowing details of which pool user is overriding would give better information.

Or maybe its an option to only send specific label values (i.e. pool names), like mater and worker?

We are sending MachineConfigPool name, see https://github.com/openshift/machine-config-operator/blob/master/pkg/controller/render/render_controller.go#L492 which would translate to master, worker or any custom pool created by cluster admin. Am I still missing something?

@simonpasquier
Copy link
Contributor

If the pool name is user-defined, the cardinality is virtually unbounded then since we have no control on the possible values...

@sinnykumari
Copy link
Contributor Author

So if I understand, even if we aggregate (sum(os_image_url_override)) and use that to send it to telemetry, it would still be unbound cardinality since number of MCP would change and which would cause aggregate value keep changing? What would be the recommendation here? One I can think of is go with boolean value (sets to 1 if any pool in the cluster i using osImage url override) or something better we can d here?

@jan--f
Copy link
Contributor

jan--f commented Oct 4, 2022

@sinnykumari aggregating the metric would indeed get rid of the cardinality problem. When we refer to cardinality we talk about how many series a particular metric will create. A series is defined by all distinct label value sets. So for this metric, we have only one label but potentially an unlimited set of label values. So for telemetry we have to consider that this metric creates an unbounded number of series. The metric value can be completely ignored for this discussion.

@sinnykumari
Copy link
Contributor Author

Ah, thanks. Makes sense to me now.

@mrunalp @sdodson @cgwalters Are we ok with aggregating osImangeURL override data for all MCPs to overcome unbounded cardinality problem?

@cgwalters
Copy link
Member

It might be interesting to distinguish between control-plane/master and workers, but on the other hand with hypershift that's irrelevant, so yes I'd say let's just squash to sum(os_image_url_override) indeed.

@sdodson
Copy link
Member

sdodson commented Oct 4, 2022

Do we believe that most custom pools would be derived from the worker pool where the override is set and inherited by all custom pools? If we limited ourselves to looking at just master, worker, infra, and a potential future control-plane (as the hopeful more inclusively named replacement for master) would that fit our cardinality constraints and provide any additional insight to the MCO/CoreOS teams?

I'm fine with squashing this entirely so we end up with a simple way to answer yes/no a cluster is using custom base image. I'm just poking around to see if there's an easy way to get greater fidelity, if the team would like that.

sinnykumari added a commit to sinnykumari/machine-config-operator that referenced this pull request Oct 4, 2022
@sinnykumari
Copy link
Contributor Author

sinnykumari commented Oct 4, 2022

Do we believe that most custom pools would be derived from the worker pool where the override is set and inherited by all custom pools?

we can't be sure as it is possible that worker pool doesn't contain override but some of the custom pools does and vice versa.

I'm fine with squashing this entirely so we end up with a simple way to answer yes/no a cluster is using custom base image.

+1

I'm just poking around to see if there's an easy way to get greater fidelity, if the team would like that.

At this point I am not sure what additional advantage we can get with getting seperate information on master(controleplane) and worker pool override. I think with aggregate value we should have enough signal on overrides for a cluster. Perhaps we can revisit in future if doing something more fine grained will help us with certain usecase.

@sdodson
Copy link
Member

sdodson commented Oct 4, 2022

At this point I am not sure what additional advantage we can get with getting seperate information on master(controleplane) and worker pool override. I think with aggregate value we should have enough signal on overrides for a cluster.

Ok, sounds good to me.

@sinnykumari
Copy link
Contributor Author

/retest

@sinnykumari
Copy link
Contributor Author

MCO PR openshift/machine-config-operator#3363 to collect aggregated data of os_image_url metric has been merged. I have updated this PR accordingly.
@jan--f @simonpasquier PTAL

@sinnykumari
Copy link
Contributor Author

these failing test doesn't look to me could have caused by this PR (I could be wrong though), giving it another run
/retest

@sinnykumari
Copy link
Contributor Author

/retest

@jan--f
Copy link
Contributor

jan--f commented Oct 7, 2022

/lgtm
Thanks!

@jan--f
Copy link
Contributor

jan--f commented Oct 7, 2022

/label docs-approved
/label px-approved
/label qe-approved
For a telemetry addition.

@openshift-ci openshift-ci bot added docs-approved Signifies that Docs has signed off on this PR px-approved Signifies that Product Support has signed off on this PR qe-approved Signifies that QE has signed off on this PR labels Oct 7, 2022
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 7, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 7, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jan--f, sinnykumari

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 7, 2022
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD fed0616 and 2 for PR HEAD 11cdd5e in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 7cb5a85 and 1 for PR HEAD 11cdd5e in total

@sinnykumari
Copy link
Contributor Author

/retest

@sinnykumari sinnykumari changed the title MCO-371: Add os_image_url_override metric from MCO to telemetry OCPBUGS-2138: Add os_image_url_override metric from MCO to telemetry Oct 10, 2022
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Oct 10, 2022
@openshift-ci-robot
Copy link
Contributor

@sinnykumari: This pull request references Jira Issue OCPBUGS-2138, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.12.0) matches configured target version for branch (4.12.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @rioliu-rh

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Recently we added os_image_url_override metric in MCO openshift/machine-config-operator#3343 which we would like to get added to telemetry as well. See https://github.com/openshift/enhancements/blob/master/enhancements/ocp-coreos-layering/ocp-coreos-layering.md#telemetry

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from rioliu-rh October 10, 2022 08:21
@sinnykumari
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot
Copy link
Contributor

@sinnykumari: This pull request references Jira Issue OCPBUGS-2138, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.12.0) matches configured target version for branch (4.12.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @rioliu-rh

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sinnykumari
Copy link
Contributor Author

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 10, 2022

@sinnykumari: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 08223fd into openshift:master Oct 10, 2022
@openshift-ci-robot
Copy link
Contributor

@sinnykumari: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-2138 has been moved to the MODIFIED state.

In response to this:

Recently we added os_image_url_override metric in MCO openshift/machine-config-operator#3343 which we would like to get added to telemetry as well. See https://github.com/openshift/enhancements/blob/master/enhancements/ocp-coreos-layering/ocp-coreos-layering.md#telemetry

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. docs-approved Signifies that Docs has signed off on this PR jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. px-approved Signifies that Product Support has signed off on this PR qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants