-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a graceful shutdown mechanism #1990
Comments
|
@mattklein123 it looks like |
It does, and starts the draining process. (It doesn't make sense to stop accepting connections, as if your L4 LB is still sending you connections, you need to accept them). So basically it:
Follow this thread: https://github.com/envoyproxy/envoy/blob/master/source/server/server.cc#L151 the health check state is used from the drain manager and the HC filter. |
Is |
At Lyft we just do a timed drain. Roughly our process manager does:
If we want something that is self contained we could definitely build that on top of the existing functionality easily. |
Thanks Matt, the process you described should work for us. |
OK closing. Let's reopen if this doesn't work out. |
In Cloud Foundry we're currently running Envoys as a sidecar in each application container. They currently only handle ingress traffic from a downstream multi-tenant platform edge routing tier, and are configured with a TCP listener only. We use them only for terminating TLS from the edge routers. When our scheduler wants to delete a container, as when a user scales down the number of app instances, and we send a TERM to the app instance in the container, well behaved apps will stop accepting new connections and begin draining existing ones. However, Envoy continues to accept TCP connections, attempts to connect to the upstream app instance and fails, then closes the downstream connection with the platform edge router with an We have tested that removing a listener port from Envoy does not cause Envoy to reject requests to that port. This was unexpected. If Envoy stopped accepting new TCP connections during the drain period, while allowing existing ones to be drained by the upstream app instance, this would support passive healthchecks by our edge routing tier; if the routers can't establish a TCP connection with a backend they will try another one. We wouldn't want to retry on receiving an EOF as there are scenarios in which this would result in duplicate writes to the application. |
I've opened #2920 as a follow-up. |
@mattklein123 I tried looking through the source code to find this but one issue we have run into with other LB's is that at shutdown it closes new connection for both ingress and egress listeners, which makes procesing inflight requests difficult. I'm curious if envoy only shuttsdown ingress listeners and keeps egress ones alive to make this process easier. This would be another point in favor of modern/made for client side load balancing load balancers over existing legacy ones. |
Actually this seems like it might be the route for that:
|
@huggsboson yup, that was built for exactly the reason you specify. |
I should have asked this in the original question, but is it possible to get the process to exit once all of the ingress connections have drained/closed on their own? One of our major issues is that it's really hard to tell if the process using the proxy is done using it for egress. So we've had to do somewhat complicated coordination in kube to keep the sidecar available for egress even though the ingress should be closed due to shutting down. I know for you guys you probably just keep the egress open until you're timeout hits, but we unfortunately have some services with highly variable and potentially long (4hr) drains. So being able to shut them down sooner would be nice. And other than process level coordination using files on disk the easiest way to know would be:
|
CC: @mattklein123 ^ I'm sure you're super busy is there anyone else I should be at-mentioning? |
This is not supported currently. There is no support for "shutdown when connections reach 0" or something like that. It's been discussed before and would be a useful feature to add but someone would need to work on it. |
any update |
Add a graceful shutdown mechanism.
We would like to have a way to gracefully shutdown/lame-duck an Envoy, similar to how Envoy drains the old process' listeners during hot restart. Ideally we'd like to specify the drain time when initiating the graceful shutdown.
I discussed offline with @htuch and @mrice32 , we think it would be fairly straightforward to implement this using the existing drainListeners() method. Some options for the mechanism are creating a new admin URL for graceful shutdown, or toggling the behavior of /quitquitquit via a new command line flag.
Suggestions are welcome :)
The text was updated successfully, but these errors were encountered: