USHIFT-3133: Skip cloud provider disruption monitors for MicroShift #28767

pacevedom · 2024-05-03T12:58:21Z

These monitors are permafailing in MicroShift as their requirements are not met in MicroShift:
https://prow.ci.openshift.org/job-history/gs/test-platform-results/logs/periodic-ci-openshift-microshift-main-e2e-aws-ovn-ocp-conformance
/cc @dgoodwin

openshift-ci-robot · 2024-05-03T12:58:25Z

@pacevedom: This pull request references USHIFT-3133 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.16.0" version, but no target version was set.

In response to this:

/cc @dgoodwin

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-05-03T12:58:58Z

@pacevedom: This pull request references USHIFT-3133 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.16.0" version, but no target version was set.

In response to this:

These monitors are permafailing in MicroShift as their requirements are not met in MicroShift:
https://prow.ci.openshift.org/job-history/gs/test-platform-results/logs/periodic-ci-openshift-microshift-main-e2e-aws-ovn-ocp-conformance
/cc @dgoodwin

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

dgoodwin · 2024-05-03T18:02:49Z

Why does microshift not have an Infrastructure resource? Slightly worried there could be a lot of plumbing out there using that to determine what the cluster is.

pacevedom · 2024-05-06T08:36:21Z

Why does microshift not have an Infrastructure resource? Slightly worried there could be a lot of plumbing out there using that to determine what the cluster is.

There are several reasons for doing the change this way:

MicroShift does not include the config apigroup, which makes any test checking on any of those resources a skip candidate.
There is one way to fake network and infrastructure resources, this is through the use of the config shim defined in extended utils package. WRKLDS-605: e2e: Config v1 client shim for static configuration manifests with read-only operations #27714
The config shim requires running the tests with a specific configuration and manually crafted yaml files.
Disruption tests use DiscoverClusterState which also requires ClusterVersion resource, which is not present either (and cant be faked by the config shim too, only Network and Infrastructure are supported).
Changing the disruption tests to use the config shim may be a much bigger change than this one, and we already had the pattern for skipping some of these tests in microshift.
Disruption tests (and even more so those based on cloud providers) do not apply in MicroShift. We wanted to make this as explicit as possible instead of relying on handcrafted files for the config shim, which change the outcome of the tests depending on local configuration.

However, if you think we should go with a deeper change for all the disruption tests thats also ok with me. I havent checked for any further issues though, so this was just a brief analysis.

dgoodwin · 2024-05-07T11:55:52Z

Why does microshift not have an Infrastructure resource? Slightly worried there could be a lot of plumbing out there using that to determine what the cluster is.

There are several reasons for doing the change this way:

* MicroShift does not include the `config` apigroup, which makes any test checking on any of those resources a skip candidate.

* There is one way to fake `network` and `infrastructure` resources, this is through the use of the config shim defined in extended utils package. [WRKLDS-605: e2e: Config v1 client shim for static configuration manifests with read-only operations #27714](https://github.com/openshift/origin/pull/27714)

* The config shim requires running the tests with a specific configuration and manually crafted yaml files.

Does microshift deploy this shim by default? I'm not aware of all the history here, but all these APIs sound pretty core to OpenShift to me. It seems like the absence of these could cause a lot of problems with operator portability, things suddenly breaking when you deploy in Microshift. If I had to guess, I'd imagine their absence causes a lot of problems for you all, is that the case? Could they be present by default with static responses indicating that this is a Microshift cluster?

Is there another way to detect you're running in a Microshift cluster?

* Disruption tests use [DiscoverClusterState](https://github.com/openshift/origin/blob/master/pkg/clioptions/clusterdiscovery/cluster.go#L96C6-L96C26) which also requires ClusterVersion resource, which is not present either (and cant be faked by the config shim too, only `Network` and `Infrastructure` are [supported](https://github.com/openshift/origin/blob/master/test/extended/util/framework.go#L2239-L2266)).

ClusterVersion is also a broadly used and I would think very important API to have present in any OpenShift cluster, but in this specific case that error handling can probably just be adjusted in origin.

* Changing the disruption tests to use the config shim may be a much bigger change than this one, and we already had the pattern for skipping some of these tests in microshift.

* Disruption tests (and even more so those based on cloud providers) do not apply in MicroShift. We wanted to make this as explicit as possible instead of relying on handcrafted files for the config shim, which change the outcome of the tests depending on local configuration.

Disruption testing will tell you if you're losing connectivity to a variety of endpoints in the cluster during your test run. It's a framework capable of catching the very worst of bugs before they make it to customers. We consider it very high priority, and it seems like it would also be a very good signal that Microshift is healthy.

There are some cloud provider endpoint checks we use to gather broad data that we can reach cloud X, but these have no impact on the cluster under test, they run on the CI cluster and poll static endpoints in the clouds. I don't think any adjustments should be needed for Microshift jobs there afaict.

However, if you think we should go with a deeper change for all the disruption tests thats also ok with me. I havent checked for any further issues though, so this was just a brief analysis.

pacevedom · 2024-05-07T12:48:09Z

Why does microshift not have an Infrastructure resource? Slightly worried there could be a lot of plumbing out there using that to determine what the cluster is.

There are several reasons for doing the change this way:
* MicroShift does not include the `config` apigroup, which makes any test checking on any of those resources a skip candidate.

* There is one way to fake `network` and `infrastructure` resources, this is through the use of the config shim defined in extended utils package. [WRKLDS-605: e2e: Config v1 client shim for static configuration manifests with read-only operations #27714](https://github.com/openshift/origin/pull/27714)

* The config shim requires running the tests with a specific configuration and manually crafted yaml files.
Does microshift deploy this shim by default? I'm not aware of all the history here, but all these APIs sound pretty core to OpenShift to me. It seems like the absence of these could cause a lot of problems with operator portability, things suddenly breaking when you deploy in Microshift. If I had to guess, I'd imagine their absence causes a lot of problems for you all, is that the case? Could they be present by default with static responses indicating that this is a Microshift cluster?

The shim is something you need to configure manually before running the tests, as openshift-tests will check locally whether these files exist or not. Right now we are working on a PR to include a simple shim with both Infrastructure and Network resources. These will help skip more tests as they filter by fields in both api objects. There is an example here.

The absence of these api groups dates back to the origins of MicroShift, where only the minimal amount of APIs where included on purpose (SCCs and routes, specifically). In regards to operators, we just recently added support for OLM, and several upstream operators seem to work (like certmanager). If any OCP operators need to be deployed in MicroShift they can not make any assumptions on these API groups being present.

Is there another way to detect you're running in a Microshift cluster?

The "official" way is to look for the ConfigMap in kube-public namespace, which is what the exutil function is checking.

* Disruption tests use [DiscoverClusterState](https://github.com/openshift/origin/blob/master/pkg/clioptions/clusterdiscovery/cluster.go#L96C6-L96C26) which also requires ClusterVersion resource, which is not present either (and cant be faked by the config shim too, only `Network` and `Infrastructure` are [supported](https://github.com/openshift/origin/blob/master/test/extended/util/framework.go#L2239-L2266)).
ClusterVersion is also a broadly used and I would think very important API to have present in any OpenShift cluster, but in this specific case that error handling can probably just be adjusted in origin.

I am afraid this is also what happens with the other config api group resources. In this case though, the version for MicroShift is taken from the ConfigMap or the binary itself, and there are no special capabilities the cluster may have besides what k8s provides and MicroShift's config allows. Upgrades are also handled in a different way, as it is based in ostree commits.

You mean handling errors from non-existing ClusterVersion resource in origin?

* Changing the disruption tests to use the config shim may be a much bigger change than this one, and we already had the pattern for skipping some of these tests in microshift.

* Disruption tests (and even more so those based on cloud providers) do not apply in MicroShift. We wanted to make this as explicit as possible instead of relying on handcrafted files for the config shim, which change the outcome of the tests depending on local configuration.
Disruption testing will tell you if you're losing connectivity to a variety of endpoints in the cluster during your test run. It's a framework capable of catching the very worst of bugs before they make it to customers. We consider it very high priority, and it seems like it would also be a very good signal that Microshift is healthy.

I agree with the importance of these tests, in fact we are considering introducing the same logic to our own e2e tests, but there are some significant differences.
MicroShift runs the control plane (apiserver, kubelet, controller manager, scheduler, etc) in a single binary as a privileged host application. In other words, the whole control plane goes down at once if there is something wrong and its "outside" of the cluster. Some of those components run as pods in OCP, meaning they have their own resources and ways to reach them, which dont exist in MicroShift.
If we want to have the same kind of tests for origin we may need a separate implementation of those.

There are some cloud provider endpoint checks we use to gather broad data that we can reach cloud X, but these have no impact on the cluster under test, they run on the CI cluster and poll static endpoints in the clouds. I don't think any adjustments should be needed for Microshift jobs there afaict.

MicroShift does not really use the cloud in the ways OpenShift may use it. In fact we are only using AWS as a way of getting VMs where we install our packages.

dgoodwin · 2024-05-08T11:43:35Z

These disruption monitors are just polling from the CI build cluster to static cloud endpoint, they won't impact the microshift cluster under test, I'm surprised that's hitting infra API. However we can live with Microshift not running them, we've got lots of signal from all the other jobs.

/lgtm

openshift-ci · 2024-05-08T11:44:51Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, pacevedom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [dgoodwin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2024-05-08T12:37:07Z

/retest-required

Remaining retests: 0 against base HEAD 3f62412 and 2 for PR HEAD 5fe6dc8 in total

pacevedom · 2024-05-08T14:56:13Z

/retest-required

openshift-ci-robot · 2024-05-08T18:08:46Z

/retest-required

Remaining retests: 0 against base HEAD 9d46920 and 1 for PR HEAD 5fe6dc8 in total

openshift-ci-robot · 2024-05-08T21:08:40Z

/retest-required

Remaining retests: 0 against base HEAD c5782f9 and 0 for PR HEAD 5fe6dc8 in total

openshift-ci · 2024-05-09T03:25:07Z

@pacevedom: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-single-node	`5fe6dc8`	link	false	`/test e2e-aws-ovn-single-node`
ci/prow/e2e-aws-ovn-single-node-serial	`5fe6dc8`	link	false	`/test e2e-aws-ovn-single-node-serial`
ci/prow/e2e-aws-ovn-single-node-upgrade	`5fe6dc8`	link	false	`/test e2e-aws-ovn-single-node-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-trt-bot · 2024-05-09T03:26:10Z

Job Failure Risk Analysis for sha: 5fe6dc8

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-serial	Medium [bz-networking][invariant] alert/OVNKubernetesResourceRetryFailure should not be at or above info This test has passed 97.83% of 46 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-serial'] in the last 14 days. Open Bugs e2e-vsphere-ovn-serial - alert/OVNKubernetesResourceRetryFailure should not be at or above info

openshift-ci-robot · 2024-05-09T04:08:40Z

/hold

Revision 5fe6dc8 was retested 3 times: holding

pacevedom · 2024-05-09T06:59:58Z

/retest-required

pacevedom · 2024-05-09T08:33:02Z

/hold cancel

pacevedom added 3 commits May 3, 2024 13:15

USHIFT-3133: Skip gcp disruption external monitor for microshift

29c4bd2

USHIFT-3133: Skip azure disruption external monitor for microshift

2074917

USHIFT-3133: Skip aws disruption external monitor for microshift

5fe6dc8

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 3, 2024

openshift-ci bot requested a review from dgoodwin May 3, 2024 12:58

openshift-ci bot assigned dgoodwin May 8, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 8, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 8, 2024

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 9, 2024

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 9, 2024

openshift-merge-bot bot merged commit 8045e28 into openshift:master May 9, 2024
20 of 23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

USHIFT-3133: Skip cloud provider disruption monitors for MicroShift #28767

USHIFT-3133: Skip cloud provider disruption monitors for MicroShift #28767

pacevedom commented May 3, 2024 •

edited

Loading

openshift-ci-robot commented May 3, 2024 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented May 3, 2024 •

edited by openshift-ci bot

Loading

dgoodwin commented May 3, 2024

pacevedom commented May 6, 2024 •

edited

Loading

dgoodwin commented May 7, 2024

pacevedom commented May 7, 2024

dgoodwin commented May 8, 2024

openshift-ci bot commented May 8, 2024

openshift-ci-robot commented May 8, 2024

pacevedom commented May 8, 2024

openshift-ci-robot commented May 8, 2024

openshift-ci-robot commented May 8, 2024

openshift-ci bot commented May 9, 2024 •

edited

Loading

openshift-trt-bot commented May 9, 2024

openshift-ci-robot commented May 9, 2024

pacevedom commented May 9, 2024

pacevedom commented May 9, 2024

USHIFT-3133: Skip cloud provider disruption monitors for MicroShift #28767

USHIFT-3133: Skip cloud provider disruption monitors for MicroShift #28767

Conversation

pacevedom commented May 3, 2024 • edited Loading

openshift-ci-robot commented May 3, 2024 • edited by openshift-ci bot Loading

openshift-ci-robot commented May 3, 2024 • edited by openshift-ci bot Loading

dgoodwin commented May 3, 2024

pacevedom commented May 6, 2024 • edited Loading

dgoodwin commented May 7, 2024

pacevedom commented May 7, 2024

dgoodwin commented May 8, 2024

openshift-ci bot commented May 8, 2024

openshift-ci-robot commented May 8, 2024

pacevedom commented May 8, 2024

openshift-ci-robot commented May 8, 2024

openshift-ci-robot commented May 8, 2024

openshift-ci bot commented May 9, 2024 • edited Loading

openshift-trt-bot commented May 9, 2024

openshift-ci-robot commented May 9, 2024

pacevedom commented May 9, 2024

pacevedom commented May 9, 2024

pacevedom commented May 3, 2024 •

edited

Loading

openshift-ci-robot commented May 3, 2024 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented May 3, 2024 •

edited by openshift-ci bot

Loading

pacevedom commented May 6, 2024 •

edited

Loading

openshift-ci bot commented May 9, 2024 •

edited

Loading