Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

outlier detection: distinguish upstream from internal errors. #3643

Closed
htuch opened this issue Jun 15, 2018 · 16 comments · Fixed by #4822
Closed

outlier detection: distinguish upstream from internal errors. #3643

htuch opened this issue Jun 15, 2018 · 16 comments · Fixed by #4822
Assignees
Labels
enhancement Feature requests. Not bugs or questions. no stalebot Disables stalebot from closing an issue
Milestone

Comments

@htuch
Copy link
Member

htuch commented Jun 15, 2018

Today, the outlier detector doesn't distinguish between a 5xx from the backend or generated internally inside Envoy due to backend unreachability. We would like to be able to only eject hosts when they are unreachable, but not if they explicitly 5xx. Opening this issue to track.

@htuch htuch added the enhancement Feature requests. Not bugs or questions. label Jun 15, 2018
@stale
Copy link

stale bot commented Jul 15, 2018

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 15, 2018
@stale
Copy link

stale bot commented Jul 22, 2018

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

@stale stale bot closed this as completed Jul 22, 2018
@ggreenway ggreenway reopened this Jul 23, 2018
@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Jul 23, 2018
@ggreenway ggreenway added the help wanted Needs help! label Jul 23, 2018
@cpakulski
Copy link
Contributor

@htuch, @ggreenway I can look at this issue.

@ggreenway ggreenway removed the help wanted Needs help! label Sep 25, 2018
@cpakulski
Copy link
Contributor

@htuch. Would you mind explaining why

We would like to be able to only eject hosts when they are unreachable, but not if they explicitly 5xx. ?

outlierDetector has a specific handler for 5xx: https://github.com/envoyproxy/envoy/blob/master/source/common/upstream/outlier_detection_impl.cc#L70. This is called directly when upstream host returns 5xx.

But router also calls outlier detector with its internal code https://github.com/envoyproxy/envoy/blob/master/source/common/router/router.cc#L464-L466 which is mapped to 503 and eventually is handled by the same handler as for HTTP 5xx.

As you noted both cases are handled without distinction. What is exact logic you had in mind?

@htuch
Copy link
Member Author

htuch commented Sep 27, 2018

If the error comes from router.cc, then it should count towards outlier detection. If it comes from the backend, it should be ignored for this purpose.

The use case is when you only want to exclude those backends that have network connectivity issues, but where 5xx might be a normal part of their behavior that we don't want to be opinionated on.

@cpakulski
Copy link
Contributor

@htuch Thanks, I assume that another handler for connectivity errors is needed plus configuration item, let us call it "consecutive_connection_failure", so you can distinguish those two cases and switch them on/off separately, correct?

@mattklein123
Copy link
Member

@htuch @cpakulski the intention is to make this change configurable, right? I can see a scenario in which the user wants to configure this but I also think that the current behavior is a more sensible default for the way most users intend to use outlier detection?

@cpakulski
Copy link
Contributor

@htuch @mattklein123 I am implementing it by adding counters and config items parallel to consecutive5xx. They will be called consecutive_conn_failure..... The defaults will be as for 5xx. It means that if a user uses just defaults there will be no difference in outlier detector's behavior. After several consecutive attempts to connect, a host will be ejected. Just different statistics counter will be incremented. However, by changing "enforcing..." config item, it will be possible to shut down one detection mechanism and leave other one active.

@htuch
Copy link
Member Author

htuch commented Sep 30, 2018

@cpakulski SGTM

@stale
Copy link

stale bot commented Oct 30, 2018

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Oct 30, 2018
@cpakulski
Copy link
Contributor

Work in progress.

@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Oct 30, 2018
@mattklein123 mattklein123 added this to the 1.9.0 milestone Nov 16, 2018
@stale
Copy link

stale bot commented Dec 16, 2018

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Dec 16, 2018
@repokitteh-read-only
Copy link

🙀 Error while processing event:

get sha failed
error: get PR head sha failed, err=get pull request failed, err=GET https://api.github.com/repos/envoyproxy/envoy/pulls/3643: 404 Not Found []
🐱

Caused by: #3643 was labeled by @Stale[bot].

see: more, trace.

@mattklein123 mattklein123 added the no stalebot Disables stalebot from closing an issue label Dec 16, 2018
@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Dec 16, 2018
@mattklein123 mattklein123 modified the milestones: 1.9.0, 1.10.0 Dec 19, 2018
@ramaraochavali
Copy link
Contributor

@mattklein123 Sorry to tag you on a really old comment of yours

I can see a scenario in which the user wants to configure this but I also think that the current behavior is a more sensible default for the way most users intend to use outlier detection?

This is more of question on this comment. Let us say Client -> Service A and Service A calls multiple services B,C,D in one particular flow. One of the service among B,C,D returns 5xx. A returns 5xx back to client. If we configure default outlier detection with consecutive 5xx, Client sees Service A being ejected though Service A is actually healthy and can process other flows. I think the above scenario is quite common and local origin errors makes more sense here so that A is not ejected from Client. Want to capture your thoughts on is there any better way to configure outlier detector other than using local origin errors here and if there is no other way - is n't local origin error a better default ?

@mattklein123
Copy link
Member

From my perspective, if A is failing because B/C/D are failing, it is still failing. It's irrelevant to the client whether service A itself is failing or its dependencies are failing. Personally, I would keep it simple and just eject A and let the max ejection % kick in to avoid ejecting too many hosts.

@ramaraochavali
Copy link
Contributor

Thank you @mattklein123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests. Not bugs or questions. no stalebot Disables stalebot from closing an issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants