Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable defaults from upstream e2e framework, including logging #11001

Merged
merged 1 commit into from
Sep 20, 2016

Conversation

smarterclayton
Copy link
Contributor

Prints extended failure information

[test] @karkagis if this passes I'll merge it - it reenables the upstream kube debugging on events

@smarterclayton
Copy link
Contributor Author

Wat:

Sep 19 19:45:28.583: INFO: At 2016-09-19 19:45:10 -0400 EDT - event for deployment-test-2-v78ig: {kubelet 172.18.13.142} FailedSync: Error syncing pod, skipping: failed to "StartContainer" for "myapp" with RunContainerError: "runContainer: Error response from daemon: cannot join network of a non running container: 624ec0ab71b9c84f8ea3d28395b3302000321ea8b432c10af7c8429044049605

@ncdc @derekwaynecarr our mystery failure? Pod terminating too fast and racing?

@smarterclayton
Copy link
Contributor Author

End result is pod never leaves pending

@smarterclayton
Copy link
Contributor Author

[merge] since this "passed" in surfacing our necessary debug info.

@smarterclayton
Copy link
Contributor Author

@Kargakis @mfojtik this is the cause of the deployment hangs

@ncdc
Copy link
Contributor

ncdc commented Sep 20, 2016

624ec0ab71b9c84f8ea3d28395b3302000321ea8b432c10af7c8429044049605 should be the infra container - can you check its logs / look in journald for docker for it / docker inspect it?

@openshift-bot
Copy link
Contributor

openshift-bot commented Sep 20, 2016

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/9009/) (Image: devenv-rhel7_5042)

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to 99ba1e1

@ncdc
Copy link
Contributor

ncdc commented Sep 20, 2016

Namespace: extended-test-cli-deployment-21lru-slgxs
Pod name: deployment-test-2-v78ig
Pod UID: 14302610-7ec3-11e6-aa6a-0ee1a41b165d
Infra container id: 624ec0ab71b9c84f8ea3d28395b3302000321ea8b432c10af7c8429044049605
Container name: myapp
Container id: c1eea4d1ea68fc218d46fd962cc665253d8f3ab33df0fe56a117a37501c69932

  1. 2016-09-19T19:45:01.874812891-04:00 - infra container started
  2. 2016-09-19T19:45:03 - oci-register-machine prestart
  3. 2016-09-19T19:45:05.692463621-04:00 - infra container stopped
  4. 2016-09-19T19:45:06 - oci-register-machine poststop
  5. 2016-09-19T19:45:08.961262973-04:00 - attempt to start myapp container
  6. 2016-09-19T19:45:10.260855633-04:00 - error recorded for failing to join network ns of non-running container (reported 2x, oddly)

I don't see anything in the docker journal in between that looks suspicious.

@ncdc
Copy link
Contributor

ncdc commented Sep 20, 2016

Sep 19 19:45:28.583: INFO: At 2016-09-19 19:45:00 -0400 EDT - event for deployment-test-2: {replication-controller } SuccessfulCreate: Created pod: deployment-test-2-v78ig
Sep 19 19:45:28.583: INFO: At 2016-09-19 19:45:00 -0400 EDT - event for deployment-test-2-v78ig: {default-scheduler } Scheduled: Successfully assigned deployment-test-2-v78ig to 172.18.13.142
Sep 19 19:45:28.583: INFO: At 2016-09-19 19:45:04 -0400 EDT - event for deployment-test-2: {replication-controller } SuccessfulDelete: Deleted pod: deployment-test-2-v78ig
Sep 19 19:45:28.583: INFO: At 2016-09-19 19:45:04 -0400 EDT - event for deployment-test-2-v78ig: {kubelet 172.18.13.142} Pulled: Container image "docker.io/centos:centos7" already present on machine
Sep 19 19:45:28.583: INFO: At 2016-09-19 19:45:05 -0400 EDT - event for deployment-test-2-deploy: {kubelet 172.18.13.142} FailedSync: Error syncing pod, skipping: failed to "KillContainer" for "POD" with KillContainerError: "Error response from daemon: Cannot stop container 82cef74059463d3130b165b6bfb438564f267f9e9ada3156948f3c2df98f5034: [2] Container does not exist: container destroyed"

Sep 19 19:45:28.583: INFO: At 2016-09-19 19:45:06 -0400 EDT - event for deployment-test: {deploymentconfig-controller } DeploymentCreated: Created new deployment "deployment-test-3" for version 3
Sep 19 19:45:28.583: INFO: At 2016-09-19 19:45:06 -0400 EDT - event for deployment-test-3-deploy: {default-scheduler } Scheduled: Successfully assigned deployment-test-3-deploy to 172.18.13.142
Sep 19 19:45:28.583: INFO: At 2016-09-19 19:45:07 -0400 EDT - event for deployment-test-2-v78ig: {kubelet 172.18.13.142} Created: Created container with docker id c1eea4d1ea68; Security:[seccomp=unconfined]
Sep 19 19:45:28.583: INFO: At 2016-09-19 19:45:10 -0400 EDT - event for deployment-test-2-v78ig: {kubelet 172.18.13.142} Failed: Failed to start container with docker id c1eea4d1ea68 with error: Error response from daemon: cannot join network of a non running container: 624ec0ab71b9c84f8ea3d28395b3302000321ea8b432c10af7c8429044049605
Sep 19 19:45:28.583: INFO: At 2016-09-19 19:45:10 -0400 EDT - event for deployment-test-2-v78ig: {kubelet 172.18.13.142} FailedSync: Error syncing pod, skipping: failed to "StartContainer" for "myapp" with RunContainerError: "runContainer: Error response from daemon: cannot join network of a non running container: 624ec0ab71b9c84f8ea3d28395b3302000321ea8b432c10af7c8429044049605"

@ncdc
Copy link
Contributor

ncdc commented Sep 20, 2016

I'm confused why the RC creates the pod deployment-test-2-v78ig, it gets scheduled to the node, the RC deletes the pod, then the node continues to try to run it. Mostly I'm confused by the RC deleting the pod. Maybe this is a bug in our test?

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Sep 20, 2016 via email

@smarterclayton
Copy link
Contributor Author

[test]

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to 99ba1e1

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/9009/)

@smarterclayton
Copy link
Contributor Author

If kubelet queue depth is in the metrics maybe we should grab that section of it during a failure in e2e

@openshift-bot openshift-bot merged commit 73667b5 into openshift:master Sep 20, 2016
@0xmichalis
Copy link
Contributor

Mostly I'm confused by the RC deleting the pod. Maybe this is a bug in our test?

This is how our deployments work. The deployment config controller notices that a new version needs to be deployed, so it cancels the previous running RC. Once cancelled, the RC is scaled down and the new one is created. This test specifically is probably the one that is stress-testing this exact scenario (and from the logs it seems to be working as expected)

@ncdc
Copy link
Contributor

ncdc commented Sep 20, 2016

Ok, let's ignore deployment-test-2 as that's not what caused this test to fail directly. It was in fact deployment-test-3 that caused the failure, and that's because of this timeline:

  1. 19:45:05.153 - run oc deploy --latest
  2. 19:45:26.351 - timed out waiting for condition
  3. 19:45:29.832 - deployer container started (from docker journal)

So it looks like it's waiting ~ 20 seconds for the deployment to complete before it times out. Do we need to increase that value?

@0xmichalis
Copy link
Contributor

So it looks like it's waiting ~ 20 seconds for the deployment to complete before it times out. Do we need to increase that value?

Maybe but the fact is that we started seeing this flake after the rebase in all deployment tests much more frequently than before - Clayton's numbers are reflecting the reality.

@ncdc
Copy link
Contributor

ncdc commented Sep 20, 2016

I would like a test failure analysis that points at the place(s) in the code that result in this failing after only 20 seconds. So far I haven't found it.

@mfojtik
Copy link
Contributor

mfojtik commented Sep 20, 2016

@Kargakis @ncdc has suggestion to add context to timeout errors, so we actually know what timeouts

@0xmichalis
Copy link
Contributor

@Kargakis @ncdc has suggestion to add context to timeout errors, so we actually know what timeouts

But we already know what timeouts, don't we? The deployer pod never transitions to Running. Is there anything more we can get out of it?

@ncdc
Copy link
Contributor

ncdc commented Sep 20, 2016

It would have transitioned to running if it had not timed out after 20s.
Where is that 20 coming from?

On Tuesday, September 20, 2016, Michail Kargakis notifications@github.com
wrote:

@Kargakis https://github.com/kargakis @ncdc https://github.com/ncdc
has suggestion to add context to timeout errors, so we actually know what
timeouts

But we already know what timeouts, don't we? The deployer pod never
transitions to Running. Is there anything more we can get out of it?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#11001 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAABYm5byyeqHYBb-5oWRhD7s4o5hrYFks5qr_HMgaJpZM4KBBYn
.

@0xmichalis
Copy link
Contributor

It would have transitioned to running if it had not timed out after 20s. Where is that 20 coming from?

It's currently hardcoded in the deploylog registry here. We have a couple of issues for making it configurable: #9572 #5584

@ncdc
Copy link
Contributor

ncdc commented Sep 20, 2016

Thank you! Now at least I don't feel like I'm crazy :-)

It sounds like the average latency for pod creation, scheduling, infra container creation+starting, and actual container creation+starting has possibly increased post-rebase. Would you agree?

@0xmichalis
Copy link
Contributor

Definitely. The increase though is substantial.

@ncdc
Copy link
Contributor

ncdc commented Sep 20, 2016

Ok, I'd recommend creating a new issue and having @derekwaynecarr assign it

@0xmichalis
Copy link
Contributor

Opened #11016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants