Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingress controller should react to node scale down event from autoscaler #595

Closed
freehan opened this issue Jan 4, 2019 · 6 comments · Fixed by #792
Closed

Ingress controller should react to node scale down event from autoscaler #595

freehan opened this issue Jan 4, 2019 · 6 comments · Fixed by #792
Assignees
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug.

Comments

@freehan
Copy link
Contributor

freehan commented Jan 4, 2019

Cluster autoscaler adds the ToBeDeletedByClusterAutoscaler taint on the candidate node. Then it goes on to evict pods from the nodes. After everything settles, it will delete the node.

Ingress controller should observe the taint and react by removing instance from instance group so that connection draining is triggered. This help avoid new connection to be arrive at the removing nodes which causes 502s.

@freehan freehan added the kind/bug Categorizes issue or PR as related to a bug. label Jan 4, 2019
@rramkumar1 rramkumar1 added the good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. label Mar 28, 2019
@retpolanne
Copy link
Contributor

retpolanne commented May 30, 2019

Hi, I would like to grab this issue, if no one is working on it already.

@retpolanne
Copy link
Contributor

@freehan I might be looking at it wrong, but I think the problem is not on ingress-gce itself.

Over here, on GetNodeConditionPredicate (which is used to filter nodes during sync), all Unschedulable nodes are marked out of the filter.

However, over here the node is only tainted as NoSchedule.

IMO the autoscaler should also mark the node as Unschedulable. I'm writing a few tests on the autoscaler for that.

@freehan
Copy link
Contributor Author

freehan commented Jul 10, 2019

Thanks for the PR!

@freehan
Copy link
Contributor Author

freehan commented Jul 10, 2019

Yes. As you pointed out, the problem is the handshake between autoscaler and ingress-gce (or any other loadbalancer controller). This required a better design for synchronization between cluster node life cycle and load balancer controller.

But for now, as a stop gap, we just need to watch autoscaler's taint and react.

@philpearl
Copy link

Would really like this fix in GKE - can anyone comment on how long it will take before it's available?

@freehan
Copy link
Contributor Author

freehan commented Jul 18, 2019

We will cherry pick this into 1.6 branch. And it will follow the GKE release pipeline and possibly available in newer version of 1.13.8+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants