Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a graceful shutdown mechanism #1990

Closed
kmyerson opened this issue Nov 2, 2017 · 16 comments
Closed

Add a graceful shutdown mechanism #1990

kmyerson opened this issue Nov 2, 2017 · 16 comments
Labels
question Questions that are neither investigations, bugs, nor enhancements

Comments

@kmyerson
Copy link
Contributor

kmyerson commented Nov 2, 2017

Add a graceful shutdown mechanism.

We would like to have a way to gracefully shutdown/lame-duck an Envoy, similar to how Envoy drains the old process' listeners during hot restart. Ideally we'd like to specify the drain time when initiating the graceful shutdown.

I discussed offline with @htuch and @mrice32 , we think it would be fairly straightforward to implement this using the existing drainListeners() method. Some options for the mechanism are creating a new admin URL for graceful shutdown, or toggling the behavior of /quitquitquit via a new command line flag.

Suggestions are welcome :)

@mattklein123
Copy link
Member

mattklein123 commented Nov 2, 2017

/healthcheck/fail already basically does this. Can you describe how what you want is different? (This is how we lame duck / shutdown at Lyft).

@mattklein123 mattklein123 added the question Questions that are neither investigations, bugs, nor enhancements label Nov 2, 2017
@htuch
Copy link
Member

htuch commented Nov 2, 2017

@mattklein123 it looks like /healthcheck/fail just marks the Envoy as unhealthy, it doesn't prevent the listener from accepting new connections does it?

@mattklein123
Copy link
Member

It does, and starts the draining process. (It doesn't make sense to stop accepting connections, as if your L4 LB is still sending you connections, you need to accept them). So basically it:

  1. Starts sending HC failures
  2. Starts drain closing connections so the server will drain

Follow this thread: https://github.com/envoyproxy/envoy/blob/master/source/server/server.cc#L151

the health check state is used from the drain manager and the HC filter.

@kmyerson
Copy link
Contributor Author

kmyerson commented Nov 2, 2017

Is /healthcheck/fail just the first step in the shutdown process at Lyft? Do you use /quitquitquit some amount of time after /healthcheck/fail? It seems like there's no indication of when the server becomes drained.

@mattklein123
Copy link
Member

At Lyft we just do a timed drain. Roughly our process manager does:

  1. Get shutdown notice
  2. /healthcheck/fail
  3. Wait X time
  4. Shutdown

If we want something that is self contained we could definitely build that on top of the existing functionality easily.

@kmyerson
Copy link
Contributor Author

kmyerson commented Nov 2, 2017

Thanks Matt, the process you described should work for us.

@mattklein123
Copy link
Member

OK closing. Let's reopen if this doesn't work out.

@shalako
Copy link

shalako commented Mar 21, 2018

In Cloud Foundry we're currently running Envoys as a sidecar in each application container. They currently only handle ingress traffic from a downstream multi-tenant platform edge routing tier, and are configured with a TCP listener only. We use them only for terminating TLS from the edge routers.

When our scheduler wants to delete a container, as when a user scales down the number of app instances, and we send a TERM to the app instance in the container, well behaved apps will stop accepting new connections and begin draining existing ones. However, Envoy continues to accept TCP connections, attempts to connect to the upstream app instance and fails, then closes the downstream connection with the platform edge router with an EOF, which causes clients to receive a 502.

We have tested that removing a listener port from Envoy does not cause Envoy to reject requests to that port. This was unexpected.

If Envoy stopped accepting new TCP connections during the drain period, while allowing existing ones to be drained by the upstream app instance, this would support passive healthchecks by our edge routing tier; if the routers can't establish a TCP connection with a backend they will try another one. We wouldn't want to retry on receiving an EOF as there are scenarios in which this would result in duplicate writes to the application.

@rosenhouse
Copy link
Contributor

I've opened #2920 as a follow-up.

@huggsboson
Copy link

huggsboson commented Jul 9, 2018

@mattklein123 I tried looking through the source code to find this but one issue we have run into with other LB's is that at shutdown it closes new connection for both ingress and egress listeners, which makes procesing inflight requests difficult. I'm curious if envoy only shuttsdown ingress listeners and keeps egress ones alive to make this process easier.

This would be another point in favor of modern/made for client side load balancing load balancers over existing legacy ones.

@huggsboson
Copy link

Actually this seems like it might be the route for that:

  enum DrainType {
    // Drain in response to calling /healthcheck/fail admin endpoint (along with the health check
    // filter), listener removal/modification, and hot restart.
    DEFAULT = 0;
    // Drain in response to listener removal/modification and hot restart. This setting does not
    // include /healthcheck/fail. This setting may be desirable if Envoy is hosting both ingress
    // and egress listeners.
    MODIFY_ONLY = 1;
  }

@mattklein123
Copy link
Member

@huggsboson yup, that was built for exactly the reason you specify.

@huggsboson
Copy link

I should have asked this in the original question, but is it possible to get the process to exit once all of the ingress connections have drained/closed on their own?

One of our major issues is that it's really hard to tell if the process using the proxy is done using it for egress. So we've had to do somewhat complicated coordination in kube to keep the sidecar available for egress even though the ingress should be closed due to shutting down.

I know for you guys you probably just keep the egress open until you're timeout hits, but we unfortunately have some services with highly variable and potentially long (4hr) drains. So being able to shut them down sooner would be nice. And other than process level coordination using files on disk the easiest way to know would be:

  1. Unbind on ingress ports / Stay open on egress ports
  2. Once all ingress connections have closed exit envoy.

@huggsboson
Copy link

CC: @mattklein123 ^ I'm sure you're super busy is there anyone else I should be at-mentioning?

@mattklein123
Copy link
Member

This is not supported currently. There is no support for "shutdown when connections reach 0" or something like that. It's been discussed before and would be a useful feature to add but someone would need to work on it.

@hzxuzhonghu
Copy link
Contributor

any update

jpsim pushed a commit that referenced this issue Nov 28, 2022
Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: JP Simard <jp@jpsim.com>
jpsim pushed a commit that referenced this issue Nov 29, 2022
Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: JP Simard <jp@jpsim.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Questions that are neither investigations, bugs, nor enhancements
Projects
None yet
Development

No branches or pull requests

7 participants