-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable defaults from upstream e2e framework, including logging #11001
Conversation
Prints extended failure information
Wat:
@ncdc @derekwaynecarr our mystery failure? Pod terminating too fast and racing? |
End result is pod never leaves pending |
[merge] since this "passed" in surfacing our necessary debug info. |
|
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/9009/) (Image: devenv-rhel7_5042) |
Evaluated for origin merge up to 99ba1e1 |
Namespace: extended-test-cli-deployment-21lru-slgxs
I don't see anything in the docker journal in between that looks suspicious. |
|
I'm confused why the RC creates the pod deployment-test-2-v78ig, it gets scheduled to the node, the RC deletes the pod, then the node continues to try to run it. Mostly I'm confused by the RC deleting the pod. Maybe this is a bug in our test? |
Maaaybe. I'd be more inclined to leap to wild accusations that the kubelet
is losing track of fast-exiting containers given the history of that.
Also, in openshift the single process master+node+etcd stresses the heck
out of anything performance sensitive, so it could be that we're seeing a
really rare race ~10% of the time rather than 0.001%.
The node could be very far behind on its queue depth.
|
[test] |
Evaluated for origin test up to 99ba1e1 |
continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/9009/) |
If kubelet queue depth is in the metrics maybe we should grab that section of it during a failure in e2e |
This is how our deployments work. The deployment config controller notices that a new version needs to be deployed, so it cancels the previous running RC. Once cancelled, the RC is scaled down and the new one is created. This test specifically is probably the one that is stress-testing this exact scenario (and from the logs it seems to be working as expected) |
Ok, let's ignore deployment-test-2 as that's not what caused this test to fail directly. It was in fact deployment-test-3 that caused the failure, and that's because of this timeline:
So it looks like it's waiting ~ 20 seconds for the deployment to complete before it times out. Do we need to increase that value? |
Maybe but the fact is that we started seeing this flake after the rebase in all deployment tests much more frequently than before - Clayton's numbers are reflecting the reality. |
I would like a test failure analysis that points at the place(s) in the code that result in this failing after only 20 seconds. So far I haven't found it. |
It would have transitioned to running if it had not timed out after 20s. On Tuesday, September 20, 2016, Michail Kargakis notifications@github.com
|
Thank you! Now at least I don't feel like I'm crazy :-) It sounds like the average latency for pod creation, scheduling, infra container creation+starting, and actual container creation+starting has possibly increased post-rebase. Would you agree? |
Definitely. The increase though is substantial. |
Ok, I'd recommend creating a new issue and having @derekwaynecarr assign it |
Opened #11016 |
Prints extended failure information
[test] @karkagis if this passes I'll merge it - it reenables the upstream kube debugging on events