-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upstream: outlier detector - distinguish upstream from internal errors #4822
Conversation
… errors. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
I believe that I have implemented that feature correctly. In most places it was just copy and paste of outlier's methods for 5xx errors. Only in envoy/source/common/router/router.cc Line 478 in 09bba39
envoy/source/common/router/router.cc Line 1000 in 09bba39
All test cases are passing OK. However, there are few things I am not sure about:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. Some high level comments to get started. @htuch can you weigh in also since you opened the original issue? I want to make sure this satisfies Google's needs.
@@ -82,6 +82,9 @@ class DetectorHostMonitor { | |||
* or the cluster did not have enough hosts to run through success rate outlier ejection. | |||
*/ | |||
virtual double successRate() const PURE; | |||
|
|||
virtual void connectFailure() PURE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc comments please
source/common/router/router.cc
Outdated
upstream_host->outlierDetector().putHttpResponseCode( | ||
enumToInt(type == UpstreamResetType::Reset ? Http::Code::ServiceUnavailable | ||
: timeout_response_code_)); | ||
upstream_host->outlierDetector().connectFailure(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tcp_proxy also supports outlier detection. I think that should be modified with these new methods?
|
||
// The number of consecutive connect failures before ejection | ||
// occurs. Defaults to 5. | ||
google.protobuf.UInt32Value consecutive_connect_failure = 12; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I think this is more complicated than "connect failure" since some of the reasons for "connection failures" can be internal circuit breaking. We should make a more generic name here and clearly document how this is basically all errors that are internally generated by Envoy (failure to TCP connection, TLS handshake, overflow, timeout, etc.).
One special case that needs considering is when upstream actually does send an HTTP reset. How should this be handled? IMO it should count as an upstream real service unavailable error. Anyway needs thinking about.
Once we lock down the nomenclature here I think we will need to update the entire PR to use the new wording.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the Google perspective, our main goal is to distinguish "backend returned 5xx as an HTTP response" from everything else. We want to be able to treat backend returning 5xx as business as usual and not eject (since opaque application logic might be returning this). The bucket of the "the rest" is what we want to eject on. I think this PR is sufficient for us, but I think your suggestions sound good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I think this is more complicated than "connect failure" since some of the reasons for "connection failures" can be internal circuit breaking. We should make a more generic name here and clearly document how this is basically all errors that are internally generated by Envoy (failure to TCP connection, TLS handshake, overflow, timeout, etc.).
@mattklein123 I can definitely address those cases. Outlier detector works on the principle that errors are reported to an outlier implementation attached to a host. I guess that other errors you mentioned: TCP connection, TLS handshake, overflow, timeout are currently not reported to host's outlier detector, correct?
We should make a more generic name...
What is your suggestion for naming those types of errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that other errors you mentioned: TCP connection, TLS handshake, overflow, timeout are currently not reported to host's outlier detector, correct?
Yes they are reported. These generally come through as "resets" of various types.
What is your suggestion for naming those types of errors?
I'm not sure of the best name, but something along the lines of "local origin error" vs. "remote origin error" or something like that is how I would think about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I will look at TCP connection and TLS handshake first and will come up with a proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are looking for names here, I would do something like: "local_origin" vs. "connect" and describe what that means. I think "local_origin" is a wrapper for circuit breaking, connect failure, etc.
Marking is waiting. /wait |
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
- removed mapping connect and server transactions errors to HTTP codes. - changed tcp_proxy and redis proxy to use connect errors and server transactions errors. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Summary of this PR push:
It is still not in the final form. Just want to get a feedback whether direction is good. The following changes are still required:
Just want to get a feedback, if I should continue with those changes or drop them. |
@cpakulski will you be making the success rate parameters (internal vs. connect/request derived failures) a tunable option? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I think this is heading in the right direction. Thanks for working on this. I add a few comments. Will respond also to some of your other questions.
|
||
// The number of consecutive connect failures before ejection | ||
// occurs. Defaults to 5. | ||
google.protobuf.UInt32Value consecutive_connect_failure = 12; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are looking for names here, I would do something like: "local_origin" vs. "connect" and describe what that means. I think "local_origin" is a wrapper for circuit breaking, connect failure, etc.
google.protobuf.UInt32Value enforcing_consecutive_connect_failure = 13 | ||
[(validate.rules).uint32.lte = 100]; | ||
|
||
// The number of consecutive server transaction failures before ejection occurs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just from reading this documentation, I'm not sure what a server transaction failure is. I will keep reading the PR but can we update that to make it more clear?
Edit: OK So the idea here is this is an upstream server failure for non-HTTP servers? Is that right? If so I think that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct. Upstream server failure would be a situation when connection was OK, but for example redis DB transaction failed. Server could not understand what it received or parsing of reply failed.
@@ -10,6 +10,17 @@ such as consecutive failures, temporal success rate, temporal latency, etc. Outl | |||
form of *passive* health checking. Envoy also supports :ref:`active health checking | |||
<arch_overview_health_checking>`. *Passive* and *active* health checking can be enabled together or | |||
independently, and form the basis for an overall upstream health checking solution. | |||
Outlier detection is part of :ref:`cluster configuration <envoy_api_msg_cluster.OutlierDetection>`, | |||
but it needs network filters to report errors, timeouts, resets. Currently the following filters support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"network or HTTP filters" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK only HTTP router, tcp_proxy and redis_proxy provide feedback to outlier detector. The latter two do not really talk http, so I called them "network".
@@ -113,7 +146,7 @@ action | |||
|
|||
type | |||
If ``action`` is ``eject``, specifies the type of ejection that took place. Currently type can | |||
be one of ``5xx``, ``GatewayFailure`` or ``SuccessRate``. | |||
be one of ``5xx``, ``GatewayFailure``, ``SuccessRate``, ``ConnectFailure`` or ``MalformedPayload``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "MalformedPayload" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. Thanks for catching it. This is how I originally called "upstream server failure", but later on changed naming.
Yes I think we definitely need it, though for my understanding, what was the issue with continuing to share the HTTP logic? Clarity or something else? |
@htuch I will try to do it such way that there is one config parameter set for success rate and those settings will apply to all kind of failures. I am still not sure whether to treat failure counters separately or in aggregate. So, for example it a host fails because of http 5xx errors and also occasionally fails because of internal errors, should Success Rate treat those errors separately or sum them together. Based on my understanding of #3643, they should be treated separately. |
There were few reasons. First was that in order to distinguish internal errors from upstream server replies, new way of reporting such events to outlier detector was required. Those events had separate counters for failures. Originally I added connectSuccess/Failure calls to router, which updated newly added counters. Then I added the new API calls to tcp_proxy and redis proxy. At that moment mapping to http code stopped being used. |
@htuch @mattklein123 I still struggle a bit with exact understanding what "internal" vs "upstream errors" mean. I understand that they should be distinguished. Here is my understanding how events should be handled. Please correct me if I am wrong:
This is what is more or less implemented now. In essence it distinguishes HTTP errors from layer 2/3 errors, but does not distinguish failures caused by external factors vs internal. I know I still used connect name to describe it, will change it to something else (internal?). |
Agreed separately.
OK that's fine. The use of HTTP codes previous was kind of a hack. With that said, we need to keep the functionality we had before. Meaning, we still must support SR outlier detection for redis and TCP.
Personally I would use that I already suggested which is "local_origin." If you trace through the code, there are many things that can cause this: timeouts, circuit breakers, connect errors, etc. I think as long as it's well documented, this will be clear to people. Regarding "server failure" vs. "HTTP success rate" I'm still not sure if it's worth configuring different options for this, since logically it's really the same thing. Can we somehow merge them so we only have two concepts: "local origin errors" and "upstream errors?" |
@mattklein123 Thanks for your patience and help in explaining this. I think we are on the same page now.
I will try to code it that way. |
+1 on local origin as name to distinguish from actual backends.
Yep, separately. |
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
Work in progress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cpakulski for sticking with this. This is really coming together and I think we are very close. I would like to resolve the one API issue that I pointed out but otherwise I think we are on the right track. Very cool stuff!
/wait
@@ -81,8 +93,14 @@ class DetectorHostMonitor { | |||
* @return the success rate of the host in the last calculated interval, in the range 0-100. | |||
* -1 means that the host did not have enough request volume to calculate success rate | |||
* or the cluster did not have enough hosts to run through success rate outlier ejection. | |||
* @param type specifies for which Success Rate Monitor the success rate value should be returned. | |||
If the outlier detector is configured not to split external and local origin errors, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: missing '*' at beginning of block comment lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* @param type specifies for which Success Rate Monitor the success rate value should be returned. | ||
If the outlier detector is configured not to split external and local origin errors, | ||
ExternalOrigin type returns success rate for all types of errors: external and local origin and | ||
LocalOrigin type returns -1. If the outlier detector is configured to split external and local |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of returning -1 here, would it be better to use absl::optional<>? Maybe something to think about as a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That -1 is displayed on GUI screen by admin.cc. If nobody interprets it, I can change it (would prefer another PR).
source/common/router/router.cc
Outdated
@@ -771,7 +776,14 @@ void Filter::onUpstreamReset(Http::StreamResetReason reset_reason, | |||
ENVOY_STREAM_LOG(debug, "upstream reset: reset reason {}", *callbacks_, | |||
Http::Utility::resetReasonToString(reset_reason)); | |||
|
|||
updateOutlierDetection(Http::Code::ServiceUnavailable, upstream_request); | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use '//' comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
source/common/router/router.cc
Outdated
// If it was mapped to HTTP code it would interfere with code returned by HTTP server and two | ||
// codes would be reported to outlier detector for one request. | ||
host->outlierDetector().putResult(Upstream::Outlier::Result::LOCAL_ORIGIN_CONNECT_SUCCESS, | ||
absl::optional<uint64_t>(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to have a different putResult API that better captures this case? Sending 0 here seems like a hack to me. Why can't we just send nullopt, because that is the default? If so, maybe another API makes sense? It would be nice to match this up somehow w/ the LOCAL_ORIGIN_CONNECT_FAILED call above.
HostSharedPtr host) | ||
: detector_(detector), host_(host), | ||
// add Success Rate monitors | ||
external_origin_SR_monitor_(std::make_unique<SuccessRateMonitor>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these just be member variables? Do they need to actually be pointers and allocated additionally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Changed them to be member variables.
// are not treated differently. All errors are mapped to HTTP codes. | ||
// Depending on the value of the parameter *code* the function behaves differently: | ||
// - if the *code* is not defined, mapping uses resultToHttpCode method to do mapping. | ||
// - if *code* is zero, no mapping takes place: result is dropped and not reported to outlier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my comment I would like to figure out a better API to avoid this 0 hack. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently putResult is used in 3 scenarios:
- report event with optional HTTP code. The specified HTTP code is directly fed to outlier detector without any conversions.
- report event without optional HTTP code. OutlierDetector uses legacy code https://github.com/cpakulski/envoy/blob/issue/3643/source/common/upstream/outlier_detection_impl.cc#L97-L117 to convert the event to HTTP code based on defaults, which is passed to outlier detector.
- report event, but do not convert it to any HTTP code. This is to catch scenario with router, which must not convert LOCAL_ORIGIN_CONNECT_SUCCESS to HTTP code 200. Otherwise it will collide with the response from the real HTTP server.
I somehow have to indicate not to convert to HTTP code to handle scenario 3. nullopt is already used for scenario 2, what means " I am not giving any code, use defaults for conversion".
I think that the cleanest way would be to always require to specify HTTP code in API. So instead of writing:
putResult(EXT_ORIGIN_REQUEST_SUCCESS)
one would have to write
putResult(EXT_ORIGIN_REQUEST_SUCCESS, absl::optional<uint64_t>(200))
Then specifying nullopt would mean "Do nothing with HTTP code" to handle scenario 3.
I would have to convert all instances of putResult to add the code - about 8-10 places.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fundamentally I think the issue we have is this API is broken and we are reaching the limits of what we can do with it. Why do we need HTTP codes for non-HTTP paths? Can we have a the following APIs:
putHttpResult(...)
(for this one just make the caller provide the code each time)
putExternalOriginResult(...)
(maybe this just takes success/failure like for redis)
putLocalOriginResult(...)
And then can't we just do the right thing with them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need HTTP codes for non-HTTP paths?
We need them because of backward compatibility. They were previously mapped to HTTP codes.
putHttpResult(...) (for this one just make the caller provide the code each time)
putExternalOriginResult(...) (maybe this just takes success/failure like for redis)
putLocalOriginResult(...)
There is no problem with implementing/using putHttpResult and putExternalOriginResult. They always eventually go to external origin counters (consecutive_5xx) regardless whether we work in split or non-split mode.
The problem is with putLocalOriginResult. I went through all the cases where LOCAL_ORIGIN event is reported:
- for LOCAL_ORIGIN_TIMEOUT, the code actually instructs to which HTTP code it should be translated: https://github.com/cpakulski/envoy/blob/issue/3643/source/common/router/router.cc#L612-L616, so HTTP code must be passed to outlier detector.
- for LOCAL_ORIGIN_CONNECT_SUCCESS: tcp_proxy wants to translate it to code 200 so it behaves as before my changes:https://github.com/cpakulski/envoy/blob/issue/3643/source/common/tcp_proxy/tcp_proxy.cc#L539-L540. However, router does not want to translate CONNECT_SUCCESS to HTTP code 200 because it will report the real code via putHttpResult. So here to indicate that no conversion should happen, I stick value of zero: https://github.com/cpakulski/envoy/blob/issue/3643/source/common/router/router.cc#L1414-L1419
I can implement
putHttpResult(...) (for this one just make the caller provide the code each time)
putExternalOriginResult(...) (maybe this just takes success/failure like for redis)
putLocalOriginResult(...)
but putLocalOriginResult must carry HTTP code and somehow I have to indicate not to convert to any code to handle router's case. One way would be to modify include/envoy/http/codes.h and add enum InvalidCode = 0, so we do not hack it explicitly with zero, but use valid enum.
If this sounds good I will implement this today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this is making my head hurt. Can we somehow unify the TCP proxy and router local origin cases so we don't need the special case, even if it is not explicitly the exact same behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that this is mind boggling. It is hard to design nice and elegant API to serve all the cases. There is no way to simply unify TCP cases and router cases. I somehow have to distinguish them. In order to avoid a hack with passing value zero I added another event LOCAL_ORIGIN_CONNECT_SUCCESS_FINAL to indicate that there is no other protocol which will report final error/success.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also experimented with implementing
putHttpResult(...) (for this one just make the caller provide the code each time)
putExternalOriginResult(...) (maybe this just takes success/failure like for redis)
putLocalOriginResult(...)
but outlier detector's API is most often invoked through wrappers, so I would have to implement a wrapped for each of those methods. It very quickly became ugly.
put_result_func_(this, result, code); | ||
} | ||
|
||
void DetectorHostMonitorImpl::localOriginFailure() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to share this code somehow with the other failure function? There is some tricky logic here and I'm wondering if somehow there could be a shared function that is used in both places? Not sure if that is possible or not but just throwing it out there for thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean. I went through the code and does not see how they could be shared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is this logic of locking the weak pointer, etc. is duplicated from putHttpResponseCode
. It would be nice to share that if possible but nbd if you can't.
other protocol which will report final transaction error. Refactored putResult method to avoid hacking some scenarios with non-enum values. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cpakulski for sticking with this and again sorry for the long review days. I look through again, and I think the way you have things now with final vs. not connect success works for me and is pretty clear from the filter perspective, so let's ship it. Can you merge master and I will give this a final pass? Thank you!
/wait
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
/retest |
🔨 rebuilding |
@mattklein123 rebase is completed. Thanks for reviewing my latest changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this LGTM. Let's ship. I have a few small comments to clean up. Can you also take a look at the coverage report and make sure we have 100% coverage on the new code? Also, needs another master merge. Thank you!
*/ | ||
virtual double successRate() const PURE; | ||
virtual double successRate(SuccessRateMonitorType type) PURE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did we lost const here?
http_code = Http::Code::InternalServerError; | ||
break; | ||
case Result::SERVER_FAILURE: | ||
http_code = Http::Code::ServiceUnavailable; | ||
/* LOCAL_ORIGIN_CONNECT_SUCCESS is used is 2-layer protocols, like HTTP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: '//' comments
// are not treated differently. All errors are mapped to HTTP codes. | ||
// Depending on the value of the parameter *code* the function behaves differently: | ||
// - if the *code* is not defined, mapping uses resultToHttpCode method to do mapping. | ||
// - if *code* is non-zero, it is taken as HTTP code and reported as such to outlier detector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is out of date, right? We no longer look at zero?
case Result::EXT_ORIGIN_REQUEST_FAILED: | ||
// map it to http code and call http handler. | ||
return putHttpResponseCode(enumToInt(Http::Code::ServiceUnavailable)); | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: del break, not needed
put_result_func_(this, result, code); | ||
} | ||
|
||
void DetectorHostMonitorImpl::localOriginFailure() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is this logic of locking the weak pointer, etc. is duplicated from putHttpResponseCode
. It would be nice to share that if possible but nbd if you can't.
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Adjusted unit tests after rebase from master. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!!! Thanks @cpakulski for sticking with this for so long. I know it took forever but I think it was worth it to iterate until we got to the current state. 🚢
Thanks @mattklein123 for your patience and valuable comments. I added many lines of documentaion regarding outlier detector. Most likely issue #3730 can also be closed. |
Description: Outlier detector has specific handlers for 5xx errors. Internal errors, like connect failures, timeout, resets have been mapped to 5xx errors and handled by the same logic in outlier detector as upstream's 5xx errors. This PR introduces separate handlers for internal errors (connect failures, timeout, resets). New internal counters and new config items have been added.
Risk Level: Low
Testing: Unit tests.
Docs Changes: Yes - new config items have been added to config reference and section about outlier detector.
Release Notes: N/A
Fixes #3643