-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
outlier detection: distinguish upstream from internal errors. #3643
Comments
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions. |
@htuch, @ggreenway I can look at this issue. |
@htuch. Would you mind explaining why
outlierDetector has a specific handler for 5xx: https://github.com/envoyproxy/envoy/blob/master/source/common/upstream/outlier_detection_impl.cc#L70. This is called directly when upstream host returns 5xx. But router also calls outlier detector with its internal code https://github.com/envoyproxy/envoy/blob/master/source/common/router/router.cc#L464-L466 which is mapped to 503 and eventually is handled by the same handler as for HTTP 5xx. As you noted both cases are handled without distinction. What is exact logic you had in mind? |
If the error comes from router.cc, then it should count towards outlier detection. If it comes from the backend, it should be ignored for this purpose. The use case is when you only want to exclude those backends that have network connectivity issues, but where 5xx might be a normal part of their behavior that we don't want to be opinionated on. |
@htuch Thanks, I assume that another handler for connectivity errors is needed plus configuration item, let us call it "consecutive_connection_failure", so you can distinguish those two cases and switch them on/off separately, correct? |
@htuch @cpakulski the intention is to make this change configurable, right? I can see a scenario in which the user wants to configure this but I also think that the current behavior is a more sensible default for the way most users intend to use outlier detection? |
@htuch @mattklein123 I am implementing it by adding counters and config items parallel to consecutive5xx. They will be called consecutive_conn_failure..... The defaults will be as for 5xx. It means that if a user uses just defaults there will be no difference in outlier detector's behavior. After several consecutive attempts to connect, a host will be ejected. Just different statistics counter will be incremented. However, by changing "enforcing..." config item, it will be possible to shut down one detection mechanism and leave other one active. |
@cpakulski SGTM |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
Work in progress. |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
@mattklein123 Sorry to tag you on a really old comment of yours
This is more of question on this comment. Let us say Client -> Service A and Service A calls multiple services B,C,D in one particular flow. One of the service among B,C,D returns 5xx. A returns 5xx back to client. If we configure default outlier detection with consecutive 5xx, Client sees Service A being ejected though Service A is actually healthy and can process other flows. I think the above scenario is quite common and local origin errors makes more sense here so that A is not ejected from Client. Want to capture your thoughts on is there any better way to configure outlier detector other than using local origin errors here and if there is no other way - is n't local origin error a better default ? |
From my perspective, if A is failing because B/C/D are failing, it is still failing. It's irrelevant to the client whether service A itself is failing or its dependencies are failing. Personally, I would keep it simple and just eject A and let the max ejection % kick in to avoid ejecting too many hosts. |
Thank you @mattklein123 |
Today, the outlier detector doesn't distinguish between a 5xx from the backend or generated internally inside Envoy due to backend unreachability. We would like to be able to only eject hosts when they are unreachable, but not if they explicitly 5xx. Opening this issue to track.
The text was updated successfully, but these errors were encountered: