Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

USHIFT-3133: Skip cloud provider disruption monitors for MicroShift #28767

Merged
merged 3 commits into from
May 9, 2024

Conversation

pacevedom
Copy link
Contributor

@pacevedom pacevedom commented May 3, 2024

These monitors are permafailing in MicroShift as their requirements are not met in MicroShift:
https://prow.ci.openshift.org/job-history/gs/test-platform-results/logs/periodic-ci-openshift-microshift-main-e2e-aws-ovn-ocp-conformance
/cc @dgoodwin

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 3, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented May 3, 2024

@pacevedom: This pull request references USHIFT-3133 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.16.0" version, but no target version was set.

In response to this:

/cc @dgoodwin

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from dgoodwin May 3, 2024 12:58
@openshift-ci-robot
Copy link

openshift-ci-robot commented May 3, 2024

@pacevedom: This pull request references USHIFT-3133 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.16.0" version, but no target version was set.

In response to this:

These monitors are permafailing in MicroShift as their requirements are not met in MicroShift:
https://prow.ci.openshift.org/job-history/gs/test-platform-results/logs/periodic-ci-openshift-microshift-main-e2e-aws-ovn-ocp-conformance
/cc @dgoodwin

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@dgoodwin
Copy link
Contributor

dgoodwin commented May 3, 2024

Why does microshift not have an Infrastructure resource? Slightly worried there could be a lot of plumbing out there using that to determine what the cluster is.

@pacevedom
Copy link
Contributor Author

pacevedom commented May 6, 2024

Why does microshift not have an Infrastructure resource? Slightly worried there could be a lot of plumbing out there using that to determine what the cluster is.

There are several reasons for doing the change this way:

  • MicroShift does not include the config apigroup, which makes any test checking on any of those resources a skip candidate.
  • There is one way to fake network and infrastructure resources, this is through the use of the config shim defined in extended utils package. WRKLDS-605: e2e: Config v1 client shim for static configuration manifests with read-only operations #27714
  • The config shim requires running the tests with a specific configuration and manually crafted yaml files.
  • Disruption tests use DiscoverClusterState which also requires ClusterVersion resource, which is not present either (and cant be faked by the config shim too, only Network and Infrastructure are supported).
  • Changing the disruption tests to use the config shim may be a much bigger change than this one, and we already had the pattern for skipping some of these tests in microshift.
  • Disruption tests (and even more so those based on cloud providers) do not apply in MicroShift. We wanted to make this as explicit as possible instead of relying on handcrafted files for the config shim, which change the outcome of the tests depending on local configuration.

However, if you think we should go with a deeper change for all the disruption tests thats also ok with me. I havent checked for any further issues though, so this was just a brief analysis.

@dgoodwin
Copy link
Contributor

dgoodwin commented May 7, 2024

Why does microshift not have an Infrastructure resource? Slightly worried there could be a lot of plumbing out there using that to determine what the cluster is.

There are several reasons for doing the change this way:

* MicroShift does not include the `config` apigroup, which makes any test checking on any of those resources a skip candidate.

* There is one way to fake `network` and `infrastructure` resources, this is through the use of the config shim defined in extended utils package. [WRKLDS-605: e2e: Config v1 client shim for static configuration manifests with read-only operations #27714](https://github.com/openshift/origin/pull/27714)

* The config shim requires running the tests with a specific configuration and manually crafted yaml files.

Does microshift deploy this shim by default? I'm not aware of all the history here, but all these APIs sound pretty core to OpenShift to me. It seems like the absence of these could cause a lot of problems with operator portability, things suddenly breaking when you deploy in Microshift. If I had to guess, I'd imagine their absence causes a lot of problems for you all, is that the case? Could they be present by default with static responses indicating that this is a Microshift cluster?

Is there another way to detect you're running in a Microshift cluster?

* Disruption tests use [DiscoverClusterState](https://github.com/openshift/origin/blob/master/pkg/clioptions/clusterdiscovery/cluster.go#L96C6-L96C26) which also requires ClusterVersion resource, which is not present either (and cant be faked by the config shim too, only `Network` and `Infrastructure` are [supported](https://github.com/openshift/origin/blob/master/test/extended/util/framework.go#L2239-L2266)).

ClusterVersion is also a broadly used and I would think very important API to have present in any OpenShift cluster, but in this specific case that error handling can probably just be adjusted in origin.

* Changing the disruption tests to use the config shim may be a much bigger change than this one, and we already had the pattern for skipping some of these tests in microshift.

* Disruption tests (and even more so those based on cloud providers) do not apply in MicroShift. We wanted to make this as explicit as possible instead of relying on handcrafted files for the config shim, which change the outcome of the tests depending on local configuration.

Disruption testing will tell you if you're losing connectivity to a variety of endpoints in the cluster during your test run. It's a framework capable of catching the very worst of bugs before they make it to customers. We consider it very high priority, and it seems like it would also be a very good signal that Microshift is healthy.

There are some cloud provider endpoint checks we use to gather broad data that we can reach cloud X, but these have no impact on the cluster under test, they run on the CI cluster and poll static endpoints in the clouds. I don't think any adjustments should be needed for Microshift jobs there afaict.

However, if you think we should go with a deeper change for all the disruption tests thats also ok with me. I havent checked for any further issues though, so this was just a brief analysis.

@pacevedom
Copy link
Contributor Author

Why does microshift not have an Infrastructure resource? Slightly worried there could be a lot of plumbing out there using that to determine what the cluster is.

There are several reasons for doing the change this way:

* MicroShift does not include the `config` apigroup, which makes any test checking on any of those resources a skip candidate.

* There is one way to fake `network` and `infrastructure` resources, this is through the use of the config shim defined in extended utils package. [WRKLDS-605: e2e: Config v1 client shim for static configuration manifests with read-only operations #27714](https://github.com/openshift/origin/pull/27714)

* The config shim requires running the tests with a specific configuration and manually crafted yaml files.

Does microshift deploy this shim by default? I'm not aware of all the history here, but all these APIs sound pretty core to OpenShift to me. It seems like the absence of these could cause a lot of problems with operator portability, things suddenly breaking when you deploy in Microshift. If I had to guess, I'd imagine their absence causes a lot of problems for you all, is that the case? Could they be present by default with static responses indicating that this is a Microshift cluster?

The shim is something you need to configure manually before running the tests, as openshift-tests will check locally whether these files exist or not. Right now we are working on a PR to include a simple shim with both Infrastructure and Network resources. These will help skip more tests as they filter by fields in both api objects. There is an example here.

The absence of these api groups dates back to the origins of MicroShift, where only the minimal amount of APIs where included on purpose (SCCs and routes, specifically). In regards to operators, we just recently added support for OLM, and several upstream operators seem to work (like certmanager). If any OCP operators need to be deployed in MicroShift they can not make any assumptions on these API groups being present.

Is there another way to detect you're running in a Microshift cluster?

The "official" way is to look for the ConfigMap in kube-public namespace, which is what the exutil function is checking.

* Disruption tests use [DiscoverClusterState](https://github.com/openshift/origin/blob/master/pkg/clioptions/clusterdiscovery/cluster.go#L96C6-L96C26) which also requires ClusterVersion resource, which is not present either (and cant be faked by the config shim too, only `Network` and `Infrastructure` are [supported](https://github.com/openshift/origin/blob/master/test/extended/util/framework.go#L2239-L2266)).

ClusterVersion is also a broadly used and I would think very important API to have present in any OpenShift cluster, but in this specific case that error handling can probably just be adjusted in origin.

I am afraid this is also what happens with the other config api group resources. In this case though, the version for MicroShift is taken from the ConfigMap or the binary itself, and there are no special capabilities the cluster may have besides what k8s provides and MicroShift's config allows. Upgrades are also handled in a different way, as it is based in ostree commits.

You mean handling errors from non-existing ClusterVersion resource in origin?

* Changing the disruption tests to use the config shim may be a much bigger change than this one, and we already had the pattern for skipping some of these tests in microshift.

* Disruption tests (and even more so those based on cloud providers) do not apply in MicroShift. We wanted to make this as explicit as possible instead of relying on handcrafted files for the config shim, which change the outcome of the tests depending on local configuration.

Disruption testing will tell you if you're losing connectivity to a variety of endpoints in the cluster during your test run. It's a framework capable of catching the very worst of bugs before they make it to customers. We consider it very high priority, and it seems like it would also be a very good signal that Microshift is healthy.

I agree with the importance of these tests, in fact we are considering introducing the same logic to our own e2e tests, but there are some significant differences.
MicroShift runs the control plane (apiserver, kubelet, controller manager, scheduler, etc) in a single binary as a privileged host application. In other words, the whole control plane goes down at once if there is something wrong and its "outside" of the cluster. Some of those components run as pods in OCP, meaning they have their own resources and ways to reach them, which dont exist in MicroShift.
If we want to have the same kind of tests for origin we may need a separate implementation of those.

There are some cloud provider endpoint checks we use to gather broad data that we can reach cloud X, but these have no impact on the cluster under test, they run on the CI cluster and poll static endpoints in the clouds. I don't think any adjustments should be needed for Microshift jobs there afaict.

MicroShift does not really use the cloud in the ways OpenShift may use it. In fact we are only using AWS as a way of getting VMs where we install our packages.

@dgoodwin
Copy link
Contributor

dgoodwin commented May 8, 2024

These disruption monitors are just polling from the CI build cluster to static cloud endpoint, they won't impact the microshift cluster under test, I'm surprised that's hitting infra API. However we can live with Microshift not running them, we've got lots of signal from all the other jobs.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 8, 2024
Copy link
Contributor

openshift-ci bot commented May 8, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, pacevedom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 8, 2024
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 3f62412 and 2 for PR HEAD 5fe6dc8 in total

@pacevedom
Copy link
Contributor Author

/retest-required

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 9d46920 and 1 for PR HEAD 5fe6dc8 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD c5782f9 and 0 for PR HEAD 5fe6dc8 in total

Copy link
Contributor

openshift-ci bot commented May 9, 2024

@pacevedom: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node 5fe6dc8 link false /test e2e-aws-ovn-single-node
ci/prow/e2e-aws-ovn-single-node-serial 5fe6dc8 link false /test e2e-aws-ovn-single-node-serial
ci/prow/e2e-aws-ovn-single-node-upgrade 5fe6dc8 link false /test e2e-aws-ovn-single-node-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: 5fe6dc8

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-serial Medium
[bz-networking][invariant] alert/OVNKubernetesResourceRetryFailure should not be at or above info
This test has passed 97.83% of 46 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-serial'] in the last 14 days.

Open Bugs
e2e-vsphere-ovn-serial - alert/OVNKubernetesResourceRetryFailure should not be at or above info

@openshift-ci-robot
Copy link

/hold

Revision 5fe6dc8 was retested 3 times: holding

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 9, 2024
@pacevedom
Copy link
Contributor Author

/retest-required

@pacevedom
Copy link
Contributor Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 9, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 8045e28 into openshift:master May 9, 2024
20 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants