Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for old pods to terminate before proceeding to Recreate #11917

Merged
merged 1 commit into from
Dec 13, 2016
Merged

Wait for old pods to terminate before proceeding to Recreate #11917

merged 1 commit into from
Dec 13, 2016

Conversation

0xmichalis
Copy link
Contributor

@mfojtik should fix https://bugzilla.redhat.com/show_bug.cgi?id=1369644

cc: @smarterclayton opted for this instead of having a TerminatingReplicas field in the replication controller.

Upstream equivalent PR is kubernetes/kubernetes#36748

@0xmichalis
Copy link
Contributor Author

[test]

}
return deletionsNeeded == 0, nil
}
// TODO: Timeout should be config.Spec.Strategy.RecreateParams.TimeoutSeconds - (time.Now - deployerPodCreationTime)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened kubernetes/kubernetes#36813 for this. There is another occurence in the deployer code where we need the creation time of the deployer pod (when we set activeDeadlineSeconds for hooks).

// Watch from the resource version of the list and wait for all pods to be deleted
// before proceeding with the Recreate strategy.
options.ResourceVersion = podList.ResourceVersion
w, err := s.podClient.Pods(from.Namespace).Watch(options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the watch will handle just 1000 events then it drops

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess that is ok here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you run 1000+ pods in one deployment, the deployer scales them down, the replication controller reaches rc.spec.replicas == rc.status.replicas == 0 (which means all of them have been marked for deletion) but >1000 pods are not deleted yet.

Copy link
Contributor

@liggitt liggitt Nov 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1000 watch events across all pods in all namespaces (or fewer if the watch cache is configured for a smaller number of events, or if the watch cache is disabled across all resources in all namespaces)... if dropping this watch fails the deployment, you need to handle re-establishment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This watch runs every time for a single deployment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And watches only the pods of that replication controller

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't matter, 1000 event limit still applies across all pods, even when the watch is filtered down. what happens if the watch gets dropped here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rollout fails. We can ignore watch errors I guess (fallback to the old behavior).

return deletionsNeeded == 0, nil
}
// TODO: Timeout should be config.Spec.Strategy.RecreateParams.TimeoutSeconds - (time.Now - deployerPodCreationTime)
config, err := deployutil.DecodeDeploymentConfig(from, s.decoder)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just pass the timeout in the function, no need to decode twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

@openshift-bot openshift-bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 26, 2016
eventClient: client.Core(),
getUpdateAcceptor: func(timeout time.Duration, minReadySeconds int32) strat.UpdateAcceptor {
return stratsupport.NewAcceptNewlyObservedReadyPods(out, client.Core(), timeout, AcceptorInterval, minReadySeconds)
},
scaler: scaler,
decoder: decoder,
hookExecutor: stratsupport.NewHookExecutor(client.Core(), tagClient, client.Core(), os.Stdout, decoder),
// TODO: Should be config.Spec.Strategy.RecreateParams.TimeoutSeconds - (time.Now - deployerPodCreationTime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have follow up issue created for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defer w.Stop()
// Observe as many deletions as the remaining pods and then return.
deletionsNeeded := len(podList.Items)
condition := func(event watch.Event) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to verify the object in the event?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, we should always get pods back.

}
return deletionsNeeded == 0, nil
}
// TODO: Timeout should be timeout - (time.Now - deployerPodStartTime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue pls and assigne to me :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically can you handle both of these as part of #12154?

@mfojtik
Copy link
Contributor

mfojtik commented Dec 6, 2016

@Kargakis rebase and LGTM

@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 6, 2016
@openshift-bot openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 6, 2016
@0xmichalis
Copy link
Contributor Author

#12157 [test]

@0xmichalis
Copy link
Contributor Author

flake is #12157

@0xmichalis
Copy link
Contributor Author

[merge]

@0xmichalis
Copy link
Contributor Author

yum [merge]

@0xmichalis
Copy link
Contributor Author

#8571 [merge]

@0xmichalis 0xmichalis added the lgtm Indicates that a PR is ready to be merged. label Dec 8, 2016
@0xmichalis
Copy link
Contributor Author

#8571 [merge]

@0xmichalis
Copy link
Contributor Author

#8502 [merge]

@0xmichalis
Copy link
Contributor Author

#8502 [merge]

@0xmichalis
Copy link
Contributor Author

[test]

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to 2a434d1

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/12285/) (Base Commit: 9ee2ff6)

@0xmichalis
Copy link
Contributor Author

#10988 [merge]

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to 2a434d1

@openshift-bot
Copy link
Contributor

openshift-bot commented Dec 13, 2016

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/12318/) (Base Commit: eb304fd) (Image: devenv-rhel7_5531)

@openshift-bot openshift-bot merged commit f4e266a into openshift:master Dec 13, 2016
@0xmichalis 0xmichalis deleted the fix-recreate-deployments branch December 13, 2016 11:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants