Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCP3 testing problems #4128

Closed
pebrc opened this issue Jan 15, 2021 · 13 comments
Closed

OCP3 testing problems #4128

pebrc opened this issue Jan 15, 2021 · 13 comments
Assignees
Labels
:ci Things related to Continuous Integration, automation and releases >test Related to unit/integration/e2e tests

Comments

@pebrc
Copy link
Collaborator

pebrc commented Jan 15, 2021

Assuming #4118 is fixed. A few observations on testing on OCP3 on CI or with our e2e test suite in general:

  • all Beats or Agent tests need a manual adjustment of the SELinux security context with something like sudo chcon -Rt svirt_sandbox_file_t /var/lib/e2e-{random-string}-mercury/fb-{random-string}/filebeat-data because they use hostPath volumes.
  • Beats and Agent also need an SCC for their privileged access but we only install the SCC if monitoring is turned on. this should be separated.
@pebrc pebrc added :ci Things related to Continuous Integration, automation and releases >test Related to unit/integration/e2e tests labels Jan 15, 2021
@thbkrkr
Copy link
Contributor

thbkrkr commented Mar 10, 2021

The last run of the job failed because of:

18:19:57  === FAIL: test/e2e/agent TestSystemIntegrationConfig/Agent_status_should_be_updated (1800.00s)
18:19:57  Retries (30m0s timeout): ..................................................................................
18:19:57      utils.go:85: 
18:19:57          	Error Trace:	utils.go:85
18:19:57          	Error:      	Received unexpected error:
18:19:57          	            	
   expected status {Version:7.11.0 ExpectedNodes:0 AvailableNodes:0 Health:green ElasticsearchAssociationsStatus:} 
           but got {Version:7.11.0 ExpectedNodes:0 AvailableNodes:0 Health:yellow ElasticsearchAssociationsStatus:}
18:19:57          	Test:       	TestSystemIntegrationConfig/Agent_status_should_be_updated

@pebrc
Copy link
Collaborator Author

pebrc commented Mar 11, 2021

One thing to look into is whether it would be possible to run the Beats and Agent tests with emptyDir instead of hostPath volumes to step around the OCP3 SELinux restrictions.

Another workaround would be to adopt a similar approach to our current ARM testing strategy, use Go build constraints to only run the tests we know can work: Elasticsearch, Kibana and Enterprise Search.

@thbkrkr
Copy link
Contributor

thbkrkr commented Mar 11, 2021

Thanks for the hints.

I'm not yet sure about the root cause of the failure. As in CI, I can see that the agent has a yellow health and its daemonset gets only 3 replicas out of 5. I assume it's because there are 5 nodes in the k8s cluster (1 lb, 1 master and 3 workers) but only 3 accept our workload?

I'm trying to reproduce what is happening and I do not have completely the same behaviour as in CI.

When I run TESTS_MATCH=TestSystemIntegrationConfig make e2e-local (which may be the reason why something differs from CI), the container to test the agent (ubuntu & while true; echo; sleep; done) is in the CreateContainerConfigError status because container has runAsNonRoot and image will run as root. This does not happen on CI. I see that the two pods get different SCC. The one in CI gets the restricted scc, mine gets the anyuid SCC. I'm looking for where this difference comes from.

@thbkrkr thbkrkr self-assigned this Mar 11, 2021
@barkbay
Copy link
Contributor

barkbay commented Mar 15, 2021

The one in CI gets the restricted scc, mine gets the anyuid SCC. I'm looking for where this difference comes from.

I think this is because in your case the Pod is created with an admin account, which might belong to the cluster-admins group, while in the CI the Pod is created with the account of the e2e job ?

oc get scc anyuid -o yaml
kind: SecurityContextConstraints
metadata:
  annotations:
    kubernetes.io/description: anyuid provides all features of the restricted SCC but allows users to run with any UID and any GID.
  name: anyuid
priority: 10
readOnlyRootFilesystem: false
groups:
- system:cluster-admins <-- here
...

@thbkrkr
Copy link
Contributor

thbkrkr commented Mar 15, 2021

I think this is because in your case the Pod is created with an admin account, which might belong to the cluster-admins group, while in the CI the Pod is created with the account of e2e job?

Yes, that's it!

@barkbay
Copy link
Contributor

barkbay commented Mar 15, 2021

I assume it's because there are 5 nodes in the k8s cluster (1 lb, 1 master and 3 workers) but only 3 accept our workload?

Just found this in the master configuration:

projectConfig:
  defaultNodeSelector: node-role.kubernetes.io/compute=true

We need to add the openshift.io/node-selector annotation to the namespace to deploy the Pods on all the nodes:

oc annotate namespace e2e-xxxxx openshift.io/node-selector=""

I guess taints and tolerations were not considered as stable back in the old time of K8S 1.11 😄

@thbkrkr
Copy link
Contributor

thbkrkr commented Mar 15, 2021

Well found! 🙇

I'll prepare a fix.

@thbkrkr
Copy link
Contributor

thbkrkr commented Mar 16, 2021

New job execution failed this time on TestSystemIntegrationRecipe/ES_data_should_pass_validations.

We disabled all agent tests for OCP but not for OCP3.

// OpenShift requires different securityContext than provided in the recipe.
// Skipping it altogether to reduce maintenance burden.
if test.Ctx().Provider == "ocp" {
t.SkipNow()
}

I am not very convinced by the reason given in the comment to disable these tests: reduce maintenance burden.
So I'm not very keen on disabling testing on OCP3 as well right now.

Let's invest a little more time to see what is missing for the above test to pass on OCP3.

I didn't fully understand what it costs, in particular the need for a manual adjustment of the SELinux security context.
Let's disable the agent ant beat recipes tests for OCP3 as we already do for OCP.

@thbkrkr
Copy link
Contributor

thbkrkr commented Apr 2, 2021

Since #4388, TestVersionUpgradeOrdering/Beat_status_should_be_updated fails.

Error:      	Received unexpected error:
   expected status {Version:7.11.0 ExpectedNodes:0 AvailableNodes:0 Health:green ElasticsearchAssociationStatus: KibanaAssociationStatus:} 
   but got         {Version:7.11.0 ExpectedNodes:0 AvailableNodes:0 Health:red   ElasticsearchAssociationStatus: KibanaAssociationStatus:}

Filebeat health is red
Filebeat pods are in CrashLoopBackOff state.
Log contains the error Failed to create Beat meta file: open /usr/share/filebeat/data/meta.json.new: permission denied.

@thbkrkr
Copy link
Contributor

thbkrkr commented Apr 30, 2021

https://devops-ci.elastic.co/view/cloud-on-k8s/job/cloud-on-k8s-e2e-tests-ocp3/17 failed at the initialization of the tests execution with:

14:55:23  Error: exec: "oc": executable file not found in $PATH

In #4348, we added a dependency to the oc binary to reset node selector of managed namespaces. In #4387, we moved the oc binary from the base CI Docker image (.ci/Dockerfile) to a dedicated image for OCP hack/deployer/clients/ocp/Dockerfile.

This breaks the test setup for OCP3.

@pebrc
Copy link
Collaborator Author

pebrc commented Apr 30, 2021

In #4348, we added a dependency to the oc binary to reset node selector of managed namespaces. In #4387, we moved the oc binary from the base CI Docker image (.ci/Dockerfile) to a dedicated image for OCP hack/deployer/clients/ocp/Dockerfile

#4348 just sets an annotation. We could just use kubectl which is still in the CI image.

@thbkrkr
Copy link
Contributor

thbkrkr commented Apr 30, 2021

#4348 just sets an annotation. We could just use kubectl which is still in the CI image.

Yes, I will do that.

@thbkrkr
Copy link
Contributor

thbkrkr commented Sep 27, 2021

cloud-on-k8s-e2e-tests-ocp3#37 triggered with ECK 1.8.0 succeeded :-)

@thbkrkr thbkrkr closed this as completed Sep 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ci Things related to Continuous Integration, automation and releases >test Related to unit/integration/e2e tests
Projects
None yet
Development

No branches or pull requests

3 participants