outlier detection: distinguish upstream from internal errors. #3643

htuch · 2018-06-15T21:29:46Z

Today, the outlier detector doesn't distinguish between a 5xx from the backend or generated internally inside Envoy due to backend unreachability. We would like to be able to only eject hosts when they are unreachable, but not if they explicitly 5xx. Opening this issue to track.

stale · 2018-07-15T21:41:39Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

stale · 2018-07-22T22:13:19Z

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

cpakulski · 2018-09-25T15:12:27Z

@htuch, @ggreenway I can look at this issue.

cpakulski · 2018-09-27T19:57:25Z

@htuch. Would you mind explaining why

We would like to be able to only eject hosts when they are unreachable, but not if they explicitly 5xx. ?

outlierDetector has a specific handler for 5xx: https://github.com/envoyproxy/envoy/blob/master/source/common/upstream/outlier_detection_impl.cc#L70. This is called directly when upstream host returns 5xx.

But router also calls outlier detector with its internal code https://github.com/envoyproxy/envoy/blob/master/source/common/router/router.cc#L464-L466 which is mapped to 503 and eventually is handled by the same handler as for HTTP 5xx.

As you noted both cases are handled without distinction. What is exact logic you had in mind?

htuch · 2018-09-27T20:06:12Z

If the error comes from router.cc, then it should count towards outlier detection. If it comes from the backend, it should be ignored for this purpose.

The use case is when you only want to exclude those backends that have network connectivity issues, but where 5xx might be a normal part of their behavior that we don't want to be opinionated on.

cpakulski · 2018-09-27T20:18:31Z

@htuch Thanks, I assume that another handler for connectivity errors is needed plus configuration item, let us call it "consecutive_connection_failure", so you can distinguish those two cases and switch them on/off separately, correct?

mattklein123 · 2018-09-29T20:01:16Z

@htuch @cpakulski the intention is to make this change configurable, right? I can see a scenario in which the user wants to configure this but I also think that the current behavior is a more sensible default for the way most users intend to use outlier detection?

cpakulski · 2018-09-29T20:38:13Z

@htuch @mattklein123 I am implementing it by adding counters and config items parallel to consecutive5xx. They will be called consecutive_conn_failure..... The defaults will be as for 5xx. It means that if a user uses just defaults there will be no difference in outlier detector's behavior. After several consecutive attempts to connect, a host will be ejected. Just different statistics counter will be incremented. However, by changing "enforcing..." config item, it will be possible to shut down one detection mechanism and leave other one active.

htuch · 2018-09-30T11:34:08Z

@cpakulski SGTM

stale · 2018-10-30T11:55:14Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

cpakulski · 2018-10-30T16:30:26Z

Work in progress.

stale · 2018-12-16T19:19:42Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

repokitteh-read-only · 2018-12-16T19:19:44Z

🙀 Error while processing event:

get sha failed
error: get PR head sha failed, err=get pull request failed, err=GET https://api.github.com/repos/envoyproxy/envoy/pulls/3643: 404 Not Found []

🐱

Caused by: #3643 was labeled by @Stale[bot].

see: more, trace.

ramaraochavali · 2022-05-12T07:46:56Z

@mattklein123 Sorry to tag you on a really old comment of yours

I can see a scenario in which the user wants to configure this but I also think that the current behavior is a more sensible default for the way most users intend to use outlier detection?

This is more of question on this comment. Let us say Client -> Service A and Service A calls multiple services B,C,D in one particular flow. One of the service among B,C,D returns 5xx. A returns 5xx back to client. If we configure default outlier detection with consecutive 5xx, Client sees Service A being ejected though Service A is actually healthy and can process other flows. I think the above scenario is quite common and local origin errors makes more sense here so that A is not ejected from Client. Want to capture your thoughts on is there any better way to configure outlier detector other than using local origin errors here and if there is no other way - is n't local origin error a better default ?

mattklein123 · 2022-05-12T20:32:10Z

From my perspective, if A is failing because B/C/D are failing, it is still failing. It's irrelevant to the client whether service A itself is failing or its dependencies are failing. Personally, I would keep it simple and just eject A and let the max ejection % kick in to avoid ejecting too many hosts.

ramaraochavali · 2022-05-13T03:29:40Z

Thank you @mattklein123

htuch added the enhancement Feature requests. Not bugs or questions. label Jun 15, 2018

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 15, 2018

stale bot closed this as completed Jul 22, 2018

ggreenway reopened this Jul 23, 2018

stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Jul 23, 2018

ggreenway added the help wanted Needs help! label Jul 23, 2018

ggreenway assigned cpakulski Sep 25, 2018

ggreenway removed the help wanted Needs help! label Sep 25, 2018

cpakulski mentioned this issue Oct 22, 2018

upstream: outlier detector - distinguish upstream from internal errors #4822

Merged

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Oct 30, 2018

stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Oct 30, 2018

mattklein123 added this to the 1.9.0 milestone Nov 16, 2018

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Dec 16, 2018

mattklein123 added the no stalebot Disables stalebot from closing an issue label Dec 16, 2018

stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Dec 16, 2018

mattklein123 modified the milestones: 1.9.0, 1.10.0 Dec 19, 2018

mattklein123 modified the milestones: 1.10.0, 1.11.0 Mar 20, 2019

mattklein123 closed this as completed in #4822 Jun 27, 2019

ramaraochavali mentioned this issue May 14, 2021

add support for local origin errors in outlier detection istio/istio#32645

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

outlier detection: distinguish upstream from internal errors. #3643

outlier detection: distinguish upstream from internal errors. #3643

htuch commented Jun 15, 2018

stale bot commented Jul 15, 2018

stale bot commented Jul 22, 2018

cpakulski commented Sep 25, 2018

cpakulski commented Sep 27, 2018

htuch commented Sep 27, 2018

cpakulski commented Sep 27, 2018

mattklein123 commented Sep 29, 2018

cpakulski commented Sep 29, 2018

htuch commented Sep 30, 2018

stale bot commented Oct 30, 2018

cpakulski commented Oct 30, 2018

stale bot commented Dec 16, 2018

repokitteh-read-only bot commented Dec 16, 2018

ramaraochavali commented May 12, 2022

mattklein123 commented May 12, 2022

ramaraochavali commented May 13, 2022

outlier detection: distinguish upstream from internal errors. #3643

outlier detection: distinguish upstream from internal errors. #3643

Comments

htuch commented Jun 15, 2018

stale bot commented Jul 15, 2018

stale bot commented Jul 22, 2018

cpakulski commented Sep 25, 2018

cpakulski commented Sep 27, 2018

htuch commented Sep 27, 2018

cpakulski commented Sep 27, 2018

mattklein123 commented Sep 29, 2018

cpakulski commented Sep 29, 2018

htuch commented Sep 30, 2018

stale bot commented Oct 30, 2018

cpakulski commented Oct 30, 2018

stale bot commented Dec 16, 2018

repokitteh-read-only bot commented Dec 16, 2018

ramaraochavali commented May 12, 2022

mattklein123 commented May 12, 2022

ramaraochavali commented May 13, 2022