Add a graceful shutdown mechanism #1990

kmyerson · 2017-11-02T15:49:27Z

Add a graceful shutdown mechanism.

We would like to have a way to gracefully shutdown/lame-duck an Envoy, similar to how Envoy drains the old process' listeners during hot restart. Ideally we'd like to specify the drain time when initiating the graceful shutdown.

I discussed offline with @htuch and @mrice32 , we think it would be fairly straightforward to implement this using the existing drainListeners() method. Some options for the mechanism are creating a new admin URL for graceful shutdown, or toggling the behavior of /quitquitquit via a new command line flag.

Suggestions are welcome :)

mattklein123 · 2017-11-02T16:24:04Z

/healthcheck/fail already basically does this. Can you describe how what you want is different? (This is how we lame duck / shutdown at Lyft).

htuch · 2017-11-02T16:56:29Z

@mattklein123 it looks like /healthcheck/fail just marks the Envoy as unhealthy, it doesn't prevent the listener from accepting new connections does it?

mattklein123 · 2017-11-02T17:01:55Z

It does, and starts the draining process. (It doesn't make sense to stop accepting connections, as if your L4 LB is still sending you connections, you need to accept them). So basically it:

Starts sending HC failures
Starts drain closing connections so the server will drain

Follow this thread: https://github.com/envoyproxy/envoy/blob/master/source/server/server.cc#L151

the health check state is used from the drain manager and the HC filter.

kmyerson · 2017-11-02T18:01:27Z

Is /healthcheck/fail just the first step in the shutdown process at Lyft? Do you use /quitquitquit some amount of time after /healthcheck/fail? It seems like there's no indication of when the server becomes drained.

mattklein123 · 2017-11-02T18:05:44Z

At Lyft we just do a timed drain. Roughly our process manager does:

Get shutdown notice
/healthcheck/fail
Wait X time
Shutdown

If we want something that is self contained we could definitely build that on top of the existing functionality easily.

kmyerson · 2017-11-02T18:24:39Z

Thanks Matt, the process you described should work for us.

mattklein123 · 2017-11-03T16:19:24Z

OK closing. Let's reopen if this doesn't work out.

shalako · 2018-03-21T19:34:25Z

In Cloud Foundry we're currently running Envoys as a sidecar in each application container. They currently only handle ingress traffic from a downstream multi-tenant platform edge routing tier, and are configured with a TCP listener only. We use them only for terminating TLS from the edge routers.

When our scheduler wants to delete a container, as when a user scales down the number of app instances, and we send a TERM to the app instance in the container, well behaved apps will stop accepting new connections and begin draining existing ones. However, Envoy continues to accept TCP connections, attempts to connect to the upstream app instance and fails, then closes the downstream connection with the platform edge router with an EOF, which causes clients to receive a 502.

We have tested that removing a listener port from Envoy does not cause Envoy to reject requests to that port. This was unexpected.

If Envoy stopped accepting new TCP connections during the drain period, while allowing existing ones to be drained by the upstream app instance, this would support passive healthchecks by our edge routing tier; if the routers can't establish a TCP connection with a backend they will try another one. We wouldn't want to retry on receiving an EOF as there are scenarios in which this would result in duplicate writes to the application.

rosenhouse · 2018-03-28T05:36:44Z

I've opened #2920 as a follow-up.

huggsboson · 2018-07-09T19:37:00Z

@mattklein123 I tried looking through the source code to find this but one issue we have run into with other LB's is that at shutdown it closes new connection for both ingress and egress listeners, which makes procesing inflight requests difficult. I'm curious if envoy only shuttsdown ingress listeners and keeps egress ones alive to make this process easier.

This would be another point in favor of modern/made for client side load balancing load balancers over existing legacy ones.

huggsboson · 2018-07-09T19:38:53Z

Actually this seems like it might be the route for that:

  enum DrainType {
    // Drain in response to calling /healthcheck/fail admin endpoint (along with the health check
    // filter), listener removal/modification, and hot restart.
    DEFAULT = 0;
    // Drain in response to listener removal/modification and hot restart. This setting does not
    // include /healthcheck/fail. This setting may be desirable if Envoy is hosting both ingress
    // and egress listeners.
    MODIFY_ONLY = 1;
  }

mattklein123 · 2018-07-09T20:59:28Z

@huggsboson yup, that was built for exactly the reason you specify.

huggsboson · 2018-07-10T00:01:45Z

I should have asked this in the original question, but is it possible to get the process to exit once all of the ingress connections have drained/closed on their own?

One of our major issues is that it's really hard to tell if the process using the proxy is done using it for egress. So we've had to do somewhat complicated coordination in kube to keep the sidecar available for egress even though the ingress should be closed due to shutting down.

I know for you guys you probably just keep the egress open until you're timeout hits, but we unfortunately have some services with highly variable and potentially long (4hr) drains. So being able to shut them down sooner would be nice. And other than process level coordination using files on disk the easiest way to know would be:

Unbind on ingress ports / Stay open on egress ports
Once all ingress connections have closed exit envoy.

huggsboson · 2018-07-10T19:05:18Z

CC: @mattklein123 ^ I'm sure you're super busy is there anyone else I should be at-mentioning?

mattklein123 · 2018-07-10T22:23:43Z

This is not supported currently. There is no support for "shutdown when connections reach 0" or something like that. It's been discussed before and would be a useful feature to add but someone would need to work on it.

hzxuzhonghu · 2019-03-01T09:26:48Z

any update

Signed-off-by: Jose Nino <jnino@lyft.com> Signed-off-by: JP Simard <jp@jpsim.com>

mattklein123 added the question Questions that are neither investigations, bugs, nor enhancements label Nov 2, 2017

mattklein123 closed this as completed Nov 3, 2017

This was referenced Jun 5, 2018

Enable rolling update of Envoy daemonset projectcontour/gimbal#135

Closed

Envoy graceful shutdown projectcontour/gimbal#144

Closed

huggsboson mentioned this issue Jul 10, 2018

Add an option to shutdown when connections reach zero #3837

Open

life1347 mentioned this issue Oct 15, 2018

Add envoy as sidecar containers of function pod for graceful shutdown fission/fission#935

Closed

louiscryan mentioned this issue Nov 29, 2018

istio support pod graceful shutdown istio/istio#9818

Closed

hzxuzhonghu mentioned this issue Dec 12, 2018

handle envoy timed drain istio/istio#10420

Closed

iandyh mentioned this issue Mar 12, 2019

add preStop hook to enable true graceful rolling update istio/istio#12342

Closed

rochacon mentioned this issue Jul 30, 2019

Cleanly draining envoy with DaemonSet/NLB deployment projectcontour/contour#145

Closed

auni53 mentioned this issue Dec 16, 2019

Confusion over TCP draining #9369

Closed

auni53 mentioned this issue Mar 31, 2020

Proposal: Last Call mode for Envoy [AKA graceful drain for drain manager on admin endpoint] #10592

Closed

freddygv mentioned this issue Jul 13, 2020

Mechanism to drain Envoy connections hashicorp/consul#8304

Open

jpsim pushed a commit that referenced this issue Nov 28, 2022

envoy: 8a75ac1 (#1990)

881ba1f

Signed-off-by: Jose Nino <jnino@lyft.com> Signed-off-by: JP Simard <jp@jpsim.com>

jpsim pushed a commit that referenced this issue Nov 29, 2022

envoy: 8a75ac1 (#1990)

f6f0dd5

Signed-off-by: Jose Nino <jnino@lyft.com> Signed-off-by: JP Simard <jp@jpsim.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a graceful shutdown mechanism #1990

Add a graceful shutdown mechanism #1990

kmyerson commented Nov 2, 2017 •

edited

Loading

mattklein123 commented Nov 2, 2017 •

edited

Loading

htuch commented Nov 2, 2017

mattklein123 commented Nov 2, 2017

kmyerson commented Nov 2, 2017

mattklein123 commented Nov 2, 2017

kmyerson commented Nov 2, 2017

mattklein123 commented Nov 3, 2017

shalako commented Mar 21, 2018 •

edited

Loading

rosenhouse commented Mar 28, 2018

huggsboson commented Jul 9, 2018 •

edited

Loading

huggsboson commented Jul 9, 2018

mattklein123 commented Jul 9, 2018

huggsboson commented Jul 10, 2018

huggsboson commented Jul 10, 2018

mattklein123 commented Jul 10, 2018

hzxuzhonghu commented Mar 1, 2019

Add a graceful shutdown mechanism #1990

Add a graceful shutdown mechanism #1990

Comments

kmyerson commented Nov 2, 2017 • edited Loading

mattklein123 commented Nov 2, 2017 • edited Loading

htuch commented Nov 2, 2017

mattklein123 commented Nov 2, 2017

kmyerson commented Nov 2, 2017

mattklein123 commented Nov 2, 2017

kmyerson commented Nov 2, 2017

mattklein123 commented Nov 3, 2017

shalako commented Mar 21, 2018 • edited Loading

rosenhouse commented Mar 28, 2018

huggsboson commented Jul 9, 2018 • edited Loading

huggsboson commented Jul 9, 2018

mattklein123 commented Jul 9, 2018

huggsboson commented Jul 10, 2018

huggsboson commented Jul 10, 2018

mattklein123 commented Jul 10, 2018

hzxuzhonghu commented Mar 1, 2019

kmyerson commented Nov 2, 2017 •

edited

Loading

mattklein123 commented Nov 2, 2017 •

edited

Loading

shalako commented Mar 21, 2018 •

edited

Loading

huggsboson commented Jul 9, 2018 •

edited

Loading