Enable defaults from upstream e2e framework, including logging #11001

smarterclayton · 2016-09-19T22:10:48Z

Prints extended failure information

[test] @karkagis if this passes I'll merge it - it reenables the upstream kube debugging on events

Prints extended failure information

smarterclayton · 2016-09-20T00:28:47Z

Wat:

Sep 19 19:45:28.583: INFO: At 2016-09-19 19:45:10 -0400 EDT - event for deployment-test-2-v78ig: {kubelet 172.18.13.142} FailedSync: Error syncing pod, skipping: failed to "StartContainer" for "myapp" with RunContainerError: "runContainer: Error response from daemon: cannot join network of a non running container: 624ec0ab71b9c84f8ea3d28395b3302000321ea8b432c10af7c8429044049605

@ncdc @derekwaynecarr our mystery failure? Pod terminating too fast and racing?

smarterclayton · 2016-09-20T00:28:58Z

End result is pod never leaves pending

smarterclayton · 2016-09-20T00:29:21Z

[merge] since this "passed" in surfacing our necessary debug info.

smarterclayton · 2016-09-20T00:29:46Z

@Kargakis @mfojtik this is the cause of the deployment hangs

ncdc · 2016-09-20T00:31:25Z

624ec0ab71b9c84f8ea3d28395b3302000321ea8b432c10af7c8429044049605 should be the infra container - can you check its logs / look in journald for docker for it / docker inspect it?

openshift-bot · 2016-09-20T00:32:23Z

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/9009/) (Image: devenv-rhel7_5042)

openshift-bot · 2016-09-20T00:32:24Z

Evaluated for origin merge up to 99ba1e1

ncdc · 2016-09-20T00:45:09Z

Namespace: extended-test-cli-deployment-21lru-slgxs
Pod name: deployment-test-2-v78ig
Pod UID: 14302610-7ec3-11e6-aa6a-0ee1a41b165d
Infra container id: 624ec0ab71b9c84f8ea3d28395b3302000321ea8b432c10af7c8429044049605
Container name: myapp
Container id: c1eea4d1ea68fc218d46fd962cc665253d8f3ab33df0fe56a117a37501c69932

2016-09-19T19:45:01.874812891-04:00 - infra container started
2016-09-19T19:45:03 - oci-register-machine prestart
2016-09-19T19:45:05.692463621-04:00 - infra container stopped
2016-09-19T19:45:06 - oci-register-machine poststop
2016-09-19T19:45:08.961262973-04:00 - attempt to start myapp container
2016-09-19T19:45:10.260855633-04:00 - error recorded for failing to join network ns of non-running container (reported 2x, oddly)

I don't see anything in the docker journal in between that looks suspicious.

ncdc · 2016-09-20T00:50:08Z

Sep 19 19:45:28.583: INFO: At 2016-09-19 19:45:00 -0400 EDT - event for deployment-test-2: {replication-controller } SuccessfulCreate: Created pod: deployment-test-2-v78ig
Sep 19 19:45:28.583: INFO: At 2016-09-19 19:45:00 -0400 EDT - event for deployment-test-2-v78ig: {default-scheduler } Scheduled: Successfully assigned deployment-test-2-v78ig to 172.18.13.142
Sep 19 19:45:28.583: INFO: At 2016-09-19 19:45:04 -0400 EDT - event for deployment-test-2: {replication-controller } SuccessfulDelete: Deleted pod: deployment-test-2-v78ig
Sep 19 19:45:28.583: INFO: At 2016-09-19 19:45:04 -0400 EDT - event for deployment-test-2-v78ig: {kubelet 172.18.13.142} Pulled: Container image "docker.io/centos:centos7" already present on machine
Sep 19 19:45:28.583: INFO: At 2016-09-19 19:45:05 -0400 EDT - event for deployment-test-2-deploy: {kubelet 172.18.13.142} FailedSync: Error syncing pod, skipping: failed to "KillContainer" for "POD" with KillContainerError: "Error response from daemon: Cannot stop container 82cef74059463d3130b165b6bfb438564f267f9e9ada3156948f3c2df98f5034: [2] Container does not exist: container destroyed"

Sep 19 19:45:28.583: INFO: At 2016-09-19 19:45:06 -0400 EDT - event for deployment-test: {deploymentconfig-controller } DeploymentCreated: Created new deployment "deployment-test-3" for version 3
Sep 19 19:45:28.583: INFO: At 2016-09-19 19:45:06 -0400 EDT - event for deployment-test-3-deploy: {default-scheduler } Scheduled: Successfully assigned deployment-test-3-deploy to 172.18.13.142
Sep 19 19:45:28.583: INFO: At 2016-09-19 19:45:07 -0400 EDT - event for deployment-test-2-v78ig: {kubelet 172.18.13.142} Created: Created container with docker id c1eea4d1ea68; Security:[seccomp=unconfined]
Sep 19 19:45:28.583: INFO: At 2016-09-19 19:45:10 -0400 EDT - event for deployment-test-2-v78ig: {kubelet 172.18.13.142} Failed: Failed to start container with docker id c1eea4d1ea68 with error: Error response from daemon: cannot join network of a non running container: 624ec0ab71b9c84f8ea3d28395b3302000321ea8b432c10af7c8429044049605
Sep 19 19:45:28.583: INFO: At 2016-09-19 19:45:10 -0400 EDT - event for deployment-test-2-v78ig: {kubelet 172.18.13.142} FailedSync: Error syncing pod, skipping: failed to "StartContainer" for "myapp" with RunContainerError: "runContainer: Error response from daemon: cannot join network of a non running container: 624ec0ab71b9c84f8ea3d28395b3302000321ea8b432c10af7c8429044049605"

ncdc · 2016-09-20T00:51:46Z

I'm confused why the RC creates the pod deployment-test-2-v78ig, it gets scheduled to the node, the RC deletes the pod, then the node continues to try to run it. Mostly I'm confused by the RC deleting the pod. Maybe this is a bug in our test?

smarterclayton · 2016-09-20T02:18:05Z

Maaaybe. I'd be more inclined to leap to wild accusations that the kubelet is losing track of fast-exiting containers given the history of that. Also, in openshift the single process master+node+etcd stresses the heck out of anything performance sensitive, so it could be that we're seeing a really rare race ~10% of the time rather than 0.001%. The node could be very far behind on its queue depth.

smarterclayton · 2016-09-20T02:46:22Z

[test]

openshift-bot · 2016-09-20T02:47:26Z

Evaluated for origin test up to 99ba1e1

openshift-bot · 2016-09-20T04:48:37Z

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/9009/)

smarterclayton · 2016-09-20T05:51:04Z

If kubelet queue depth is in the metrics maybe we should grab that section of it during a failure in e2e

0xmichalis · 2016-09-20T09:36:52Z

Mostly I'm confused by the RC deleting the pod. Maybe this is a bug in our test?

This is how our deployments work. The deployment config controller notices that a new version needs to be deployed, so it cancels the previous running RC. Once cancelled, the RC is scaled down and the new one is created. This test specifically is probably the one that is stress-testing this exact scenario (and from the logs it seems to be working as expected)

ncdc · 2016-09-20T14:16:52Z

Ok, let's ignore deployment-test-2 as that's not what caused this test to fail directly. It was in fact deployment-test-3 that caused the failure, and that's because of this timeline:

19:45:05.153 - run oc deploy --latest
19:45:26.351 - timed out waiting for condition
19:45:29.832 - deployer container started (from docker journal)

So it looks like it's waiting ~ 20 seconds for the deployment to complete before it times out. Do we need to increase that value?

0xmichalis · 2016-09-20T14:31:53Z

So it looks like it's waiting ~ 20 seconds for the deployment to complete before it times out. Do we need to increase that value?

Maybe but the fact is that we started seeing this flake after the rebase in all deployment tests much more frequently than before - Clayton's numbers are reflecting the reality.

ncdc · 2016-09-20T14:33:01Z

I would like a test failure analysis that points at the place(s) in the code that result in this failing after only 20 seconds. So far I haven't found it.

mfojtik · 2016-09-20T14:39:45Z

@Kargakis @ncdc has suggestion to add context to timeout errors, so we actually know what timeouts

0xmichalis · 2016-09-20T14:46:26Z

@Kargakis @ncdc has suggestion to add context to timeout errors, so we actually know what timeouts

But we already know what timeouts, don't we? The deployer pod never transitions to Running. Is there anything more we can get out of it?

ncdc · 2016-09-20T14:50:35Z

It would have transitioned to running if it had not timed out after 20s.
Where is that 20 coming from?

On Tuesday, September 20, 2016, Michail Kargakis notifications@github.com
wrote:

@Kargakis https://github.com/kargakis @ncdc https://github.com/ncdc
has suggestion to add context to timeout errors, so we actually know what
timeouts

But we already know what timeouts, don't we? The deployer pod never
transitions to Running. Is there anything more we can get out of it?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#11001 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAABYm5byyeqHYBb-5oWRhD7s4o5hrYFks5qr_HMgaJpZM4KBBYn
.

0xmichalis · 2016-09-20T14:55:29Z

It would have transitioned to running if it had not timed out after 20s. Where is that 20 coming from?

It's currently hardcoded in the deploylog registry here. We have a couple of issues for making it configurable: #9572 #5584

ncdc · 2016-09-20T14:58:12Z

Thank you! Now at least I don't feel like I'm crazy :-)

It sounds like the average latency for pod creation, scheduling, infra container creation+starting, and actual container creation+starting has possibly increased post-rebase. Would you agree?

0xmichalis · 2016-09-20T15:00:29Z

Definitely. The increase though is substantial.

ncdc · 2016-09-20T15:04:28Z

Ok, I'd recommend creating a new issue and having @derekwaynecarr assign it

0xmichalis · 2016-09-20T15:25:39Z

Opened #11016

Enable defaults from upstream e2e framework, including logging

99ba1e1

Prints extended failure information

openshift-bot merged commit 73667b5 into openshift:master Sep 20, 2016

0xmichalis mentioned this pull request Sep 20, 2016

Some containers take multiple minutes to start, resulting in timeouts and test failures #11016

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable defaults from upstream e2e framework, including logging #11001

Enable defaults from upstream e2e framework, including logging #11001

smarterclayton commented Sep 19, 2016

smarterclayton commented Sep 20, 2016

smarterclayton commented Sep 20, 2016

smarterclayton commented Sep 20, 2016

smarterclayton commented Sep 20, 2016

ncdc commented Sep 20, 2016

openshift-bot commented Sep 20, 2016 •

edited

Loading

openshift-bot commented Sep 20, 2016

ncdc commented Sep 20, 2016

ncdc commented Sep 20, 2016

ncdc commented Sep 20, 2016

smarterclayton commented Sep 20, 2016 via email

smarterclayton commented Sep 20, 2016

openshift-bot commented Sep 20, 2016

openshift-bot commented Sep 20, 2016

smarterclayton commented Sep 20, 2016

0xmichalis commented Sep 20, 2016

ncdc commented Sep 20, 2016

0xmichalis commented Sep 20, 2016

ncdc commented Sep 20, 2016

mfojtik commented Sep 20, 2016

0xmichalis commented Sep 20, 2016

ncdc commented Sep 20, 2016

0xmichalis commented Sep 20, 2016

ncdc commented Sep 20, 2016

0xmichalis commented Sep 20, 2016

ncdc commented Sep 20, 2016

0xmichalis commented Sep 20, 2016

Enable defaults from upstream e2e framework, including logging #11001

Enable defaults from upstream e2e framework, including logging #11001

Conversation

smarterclayton commented Sep 19, 2016

smarterclayton commented Sep 20, 2016

smarterclayton commented Sep 20, 2016

smarterclayton commented Sep 20, 2016

smarterclayton commented Sep 20, 2016

ncdc commented Sep 20, 2016

openshift-bot commented Sep 20, 2016 • edited Loading

openshift-bot commented Sep 20, 2016

ncdc commented Sep 20, 2016

ncdc commented Sep 20, 2016

ncdc commented Sep 20, 2016

smarterclayton commented Sep 20, 2016 via email

smarterclayton commented Sep 20, 2016

openshift-bot commented Sep 20, 2016

openshift-bot commented Sep 20, 2016

smarterclayton commented Sep 20, 2016

0xmichalis commented Sep 20, 2016

ncdc commented Sep 20, 2016

0xmichalis commented Sep 20, 2016

ncdc commented Sep 20, 2016

mfojtik commented Sep 20, 2016

0xmichalis commented Sep 20, 2016

ncdc commented Sep 20, 2016

0xmichalis commented Sep 20, 2016

ncdc commented Sep 20, 2016

0xmichalis commented Sep 20, 2016

ncdc commented Sep 20, 2016

0xmichalis commented Sep 20, 2016

openshift-bot commented Sep 20, 2016 •

edited

Loading