upstream: outlier detector - distinguish upstream from internal errors #4822

cpakulski · 2018-10-22T20:11:46Z

Description: Outlier detector has specific handlers for 5xx errors. Internal errors, like connect failures, timeout, resets have been mapped to 5xx errors and handled by the same logic in outlier detector as upstream's 5xx errors. This PR introduces separate handlers for internal errors (connect failures, timeout, resets). New internal counters and new config items have been added.

Risk Level: Low

Testing: Unit tests.

Docs Changes: Yes - new config items have been added to config reference and section about outlier detector.

Release Notes: N/A
Fixes #3643

… errors. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski · 2018-10-23T18:49:45Z

I believe that I have implemented that feature correctly. In most places it was just copy and paste of outlier's methods for 5xx errors. Only in

envoy/source/common/router/router.cc

Line 478 in 09bba39

upstream_host->outlierDetector().connectFailure();

and

envoy/source/common/router/router.cc

Line 1000 in 09bba39

host->outlierDetector().connectSuccess();

I had to add calls to report failure and success.
All test cases are passing OK. However, there are few things I am not sure about:

I did not modify v1 config. It means that logic to differentiate connect errors and 5xx error is there, but works only with defaults and cannot be modified via v1 config.
I think that documentation is not clear enough to explain differences between gateway errors, 5xx errors and connect errors. I can extend docs, but I am not sure what cases are handled by gateway errors.

In router_test.cc I had to replace calls to outlier detector’s handler for 5xx with calls to handler for connect errors. However, in many places there are references to http headers, like "x-envoy-retry-on", "5xx" and I am not sure if they still make sense when connect errors are processes by a handler for connect failures, not 5xx handlers. Like it is done here:

envoy/test/common/router/router_test.cc

Lines 848 to 862 in 09bba39

    
           Http::TestHeaderMapImpl headers{{"x-envoy-retry-on", "5xx"}, {"x-envoy-internal", "true"}}; 
        
           HttpTestUtility::addDefaultHeaders(headers); 
        
           router_.decodeHeaders(headers, false); 
        
           Buffer::OwnedImpl data; 
        
           router_.decodeData(data, true); 
        
           EXPECT_CALL(callbacks_.stream_info_, 
        
                       setResponseFlag(StreamInfo::ResponseFlag::UpstreamRequestTimeout)); 
        
           EXPECT_CALL(encoder.stream_, resetStream(Http::StreamResetReason::LocalReset)); 
        
           Http::TestHeaderMapImpl response_headers{ 
        
               {":status", "504"}, {"content-length", "24"}, {"content-type", "text/plain"}}; 
        
           EXPECT_CALL(callbacks_, encodeHeaders_(HeaderMapEqualRef(&response_headers), false)); 
        
           EXPECT_CALL(callbacks_, encodeData(_, true)); 
        
           EXPECT_CALL(*router_.retry_state_, shouldRetry(_, _, _)).Times(0); 
        
           EXPECT_CALL(cm_.conn_pool_.host_->outlier_detector_, connectFailure());

. I would need some guidance here what is the purpose of "x-envoy-retry-on" and whether it’s use should be extended for “connect” type errors.

mattklein123

Thanks for working on this. Some high level comments to get started. @htuch can you weigh in also since you opened the original issue? I want to make sure this satisfies Google's needs.

mattklein123 · 2018-10-25T18:14:29Z

include/envoy/upstream/outlier_detection.h

@@ -82,6 +82,9 @@ class DetectorHostMonitor {
   *         or the cluster did not have enough hosts to run through success rate outlier ejection.
   */
  virtual double successRate() const PURE;
+
+  virtual void connectFailure() PURE;


doc comments please

mattklein123 · 2018-10-25T18:16:17Z

source/common/router/router.cc

-      upstream_host->outlierDetector().putHttpResponseCode(
-          enumToInt(type == UpstreamResetType::Reset ? Http::Code::ServiceUnavailable
-                                                     : timeout_response_code_));
+      upstream_host->outlierDetector().connectFailure();


tcp_proxy also supports outlier detection. I think that should be modified with these new methods?

mattklein123 · 2018-10-25T18:18:07Z

api/envoy/api/v2/cluster/outlier_detection.proto

+
+  // The number of consecutive connect failures before ejection
+  // occurs. Defaults to 5.
+  google.protobuf.UInt32Value consecutive_connect_failure = 12;


Unfortunately I think this is more complicated than "connect failure" since some of the reasons for "connection failures" can be internal circuit breaking. We should make a more generic name here and clearly document how this is basically all errors that are internally generated by Envoy (failure to TCP connection, TLS handshake, overflow, timeout, etc.).

One special case that needs considering is when upstream actually does send an HTTP reset. How should this be handled? IMO it should count as an upstream real service unavailable error. Anyway needs thinking about.

Once we lock down the nomenclature here I think we will need to update the entire PR to use the new wording.

From the Google perspective, our main goal is to distinguish "backend returned 5xx as an HTTP response" from everything else. We want to be able to treat backend returning 5xx as business as usual and not eject (since opaque application logic might be returning this). The bucket of the "the rest" is what we want to eject on. I think this PR is sufficient for us, but I think your suggestions sound good.

Unfortunately I think this is more complicated than "connect failure" since some of the reasons for "connection failures" can be internal circuit breaking. We should make a more generic name here and clearly document how this is basically all errors that are internally generated by Envoy (failure to TCP connection, TLS handshake, overflow, timeout, etc.).

@mattklein123 I can definitely address those cases. Outlier detector works on the principle that errors are reported to an outlier implementation attached to a host. I guess that other errors you mentioned: TCP connection, TLS handshake, overflow, timeout are currently not reported to host's outlier detector, correct?

We should make a more generic name...

What is your suggestion for naming those types of errors?

I guess that other errors you mentioned: TCP connection, TLS handshake, overflow, timeout are currently not reported to host's outlier detector, correct?

Yes they are reported. These generally come through as "resets" of various types.

What is your suggestion for naming those types of errors?

I'm not sure of the best name, but something along the lines of "local origin error" vs. "remote origin error" or something like that is how I would think about it.

Thanks. I will look at TCP connection and TLS handshake first and will come up with a proposal.

If we are looking for names here, I would do something like: "local_origin" vs. "connect" and describe what that means. I think "local_origin" is a wrapper for circuit breaking, connect failure, etc.

mattklein123 · 2018-11-03T23:45:54Z

Marking is waiting.

/wait

stale · 2018-11-11T01:44:01Z

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

- removed mapping connect and server transactions errors to HTTP codes. - changed tcp_proxy and redis proxy to use connect errors and server transactions errors. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski · 2018-11-12T21:03:27Z

Summary of this PR push:

removed mapping of connection errors, server errors to HTTP codes. This change will affect reporting of connection problems by router, tcp_proxy and redis_proxy.
updated documentation to better explain how outlier detector uses its failure counters. I hope it will also fix Outlier detection docs are outdated #3730

It is still not in the final form. Just want to get a feedback whether direction is good. The following changes are still required:

I still used word connect to describe connection problems. As noted by @mattklein123 (above in this PR) better name should be used, but I could not come up with a better naming.
Outlier detector used to have only two counters: consecutive_5xx and consecutive_gateway_failures. Now there are 2 more: consecutive_connect_failure and consecutive_server_request_failure. Handling of those counters is almost identical and there is lots of code repetition. I think we should collapse the code to use a common class or template.
SuccessRate algorithm rides on top of HTTP error code. Because tcp_proxy and redis_proxy errors were mapped to HTTP code, SuccessRate logic automatically applied to those filters as well. Now, tcp_proxy, redis_proxy and router's connect failures have their own handlers for connect errrors and SuccessRate logic does not apply there. I can add it if it makes sense.

Just want to get a feedback, if I should continue with those changes or drop them.

htuch · 2018-11-13T22:18:43Z

@cpakulski will you be making the success rate parameters (internal vs. connect/request derived failures) a tunable option?

mattklein123

In general I think this is heading in the right direction. Thanks for working on this. I add a few comments. Will respond also to some of your other questions.

mattklein123 · 2018-11-13T23:30:43Z

api/envoy/api/v2/cluster/outlier_detection.proto

+
+  // The number of consecutive connect failures before ejection
+  // occurs. Defaults to 5.
+  google.protobuf.UInt32Value consecutive_connect_failure = 12;


If we are looking for names here, I would do something like: "local_origin" vs. "connect" and describe what that means. I think "local_origin" is a wrapper for circuit breaking, connect failure, etc.

mattklein123 · 2018-11-13T23:31:31Z

api/envoy/api/v2/cluster/outlier_detection.proto

+  google.protobuf.UInt32Value enforcing_consecutive_connect_failure = 13
+      [(validate.rules).uint32.lte = 100];
+
+  // The number of consecutive server transaction failures before ejection occurs.


Just from reading this documentation, I'm not sure what a server transaction failure is. I will keep reading the PR but can we update that to make it more clear?

Edit: OK So the idea here is this is an upstream server failure for non-HTTP servers? Is that right? If so I think that makes sense.

That is correct. Upstream server failure would be a situation when connection was OK, but for example redis DB transaction failed. Server could not understand what it received or parsing of reply failed.

mattklein123 · 2018-11-13T23:32:05Z

docs/root/intro/arch_overview/outlier.rst

@@ -10,6 +10,17 @@ such as consecutive failures, temporal success rate, temporal latency, etc. Outl
 form of *passive* health checking. Envoy also supports :ref:`active health checking
 <arch_overview_health_checking>`. *Passive* and *active* health checking can be enabled together or
 independently, and form the basis for an overall upstream health checking solution.
+Outlier detection is part of :ref:`cluster configuration <envoy_api_msg_cluster.OutlierDetection>`, 
+but it needs network filters to report errors, timeouts, resets. Currently the following filters support


"network or HTTP filters" ?

AFAIK only HTTP router, tcp_proxy and redis_proxy provide feedback to outlier detector. The latter two do not really talk http, so I called them "network".

mattklein123 · 2018-11-13T23:33:14Z

docs/root/intro/arch_overview/outlier.rst

@@ -113,7 +146,7 @@ action

 type
  If ``action`` is ``eject``, specifies the type of ejection that took place. Currently type can
-  be one of ``5xx``, ``GatewayFailure`` or ``SuccessRate``.
+  be one of ``5xx``, ``GatewayFailure``, ``SuccessRate``, ``ConnectFailure`` or ``MalformedPayload``.


What is "MalformedPayload" ?

My bad. Thanks for catching it. This is how I originally called "upstream server failure", but later on changed naming.

mattklein123 · 2018-11-13T23:38:06Z

SuccessRate algorithm rides on top of HTTP error code. Because tcp_proxy and redis_proxy errors were mapped to HTTP code, SuccessRate logic automatically applied to those filters as well. Now, tcp_proxy, redis_proxy and router's connect failures have their own handlers for connect errrors and SuccessRate logic does not apply there. I can add it if it makes sense.

Yes I think we definitely need it, though for my understanding, what was the issue with continuing to share the HTTP logic? Clarity or something else?

cpakulski · 2018-11-14T14:46:35Z

will you be making the success rate parameters (internal vs. connect/request derived failures) a tunable option?

@htuch I will try to do it such way that there is one config parameter set for success rate and those settings will apply to all kind of failures. I am still not sure whether to treat failure counters separately or in aggregate. So, for example it a host fails because of http 5xx errors and also occasionally fails because of internal errors, should Success Rate treat those errors separately or sum them together. Based on my understanding of #3643, they should be treated separately.

cpakulski · 2018-11-14T15:08:02Z

SuccessRate algorithm rides on top of HTTP error code. Because tcp_proxy and redis_proxy errors were mapped to HTTP code, SuccessRate logic automatically applied to those filters as well. Now, tcp_proxy, redis_proxy and router's connect failures have their own handlers for connect errrors and SuccessRate logic does not apply there. I can add it if it makes sense.

Yes I think we definitely need it, though for my understanding, what was the issue with continuing to share the HTTP logic? Clarity or something else?

There were few reasons. First was that in order to distinguish internal errors from upstream server replies, new way of reporting such events to outlier detector was required. Those events had separate counters for failures. Originally I added connectSuccess/Failure calls to router, which updated newly added counters. Then I added the new API calls to tcp_proxy and redis proxy. At that moment mapping to http code stopped being used.
The second reason was more of the clarity. By reading documentation I completely did not understand how the outlier detector's mechanism worked. Only by reading the code, the logic became clear. But from a user point of view, it is difficult to make a connection between tcp_connect failures or redis connect failures to 5xx http error codes, and configuration uses http error codes.

cpakulski · 2018-11-14T15:24:23Z

@htuch @mattklein123 I still struggle a bit with exact understanding what "internal" vs "upstream errors" mean. I understand that they should be distinguished. Here is my understanding how events should be handled. Please correct me if I am wrong:

5xx error from upstream server: count it towards 5xx errors. Controlled by "5xx and gateway failures config params"
failure connecting to upstream server because it cannot be reached (routing, server is not running at specified port, timeout, etc): count it towards "connect Errors". Controlled by "connect failures params".
failure connecting to upstream server because of internal limitations (circuit breaking for example): count it towards "connect Errors" Controlled by "connect failures params"

This is what is more or less implemented now. In essence it distinguishes HTTP errors from layer 2/3 errors, but does not distinguish failures caused by external factors vs internal.

I know I still used connect name to describe it, will change it to something else (internal?).

mattklein123 · 2018-11-14T21:04:11Z

Based on my understanding of #3643, they should be treated separately.

Agreed separately.

The second reason was more of the clarity.

OK that's fine. The use of HTTP codes previous was kind of a hack. With that said, we need to keep the functionality we had before. Meaning, we still must support SR outlier detection for redis and TCP.

I know I still used connect name to describe it, will change it to something else (internal?).

Personally I would use that I already suggested which is "local_origin." If you trace through the code, there are many things that can cause this: timeouts, circuit breakers, connect errors, etc. I think as long as it's well documented, this will be clear to people.

Regarding "server failure" vs. "HTTP success rate" I'm still not sure if it's worth configuring different options for this, since logically it's really the same thing. Can we somehow merge them so we only have two concepts: "local origin errors" and "upstream errors?"

cpakulski · 2018-11-15T15:10:14Z

@mattklein123 Thanks for your patience and help in explaining this. I think we are on the same page now.

Regarding "server failure" vs. "HTTP success rate" I'm still not sure if it's worth configuring different options for this, since logically it's really the same thing. Can we somehow merge them so we only have two concepts: "local origin errors" and "upstream errors?"

I will try to code it that way.

htuch · 2018-11-15T19:53:51Z

+1 on local origin as name to distinguish from actual backends.

@htuch I will try to do it such way that there is one config parameter set for success rate and those settings will apply to all kind of failures. I am still not sure whether to treat failure counters separately or in aggregate. So, for example it a host fails because of http 5xx errors and also occasionally fails because of internal errors, should Success Rate treat those errors separately or sum them together. Based on my understanding of #3643, they should be treated separately.

Yep, separately.

stale · 2018-11-22T20:09:33Z

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

cpakulski · 2018-11-22T20:52:31Z

Work in progress.

mattklein123

Thanks @cpakulski for sticking with this. This is really coming together and I think we are very close. I would like to resolve the one API issue that I pointed out but otherwise I think we are on the right track. Very cool stuff!

/wait

mattklein123 · 2019-06-04T18:18:14Z

include/envoy/upstream/outlier_detection.h

@@ -81,8 +93,14 @@ class DetectorHostMonitor {
   * @return the success rate of the host in the last calculated interval, in the range 0-100.
   *         -1 means that the host did not have enough request volume to calculate success rate
   *         or the cluster did not have enough hosts to run through success rate outlier ejection.
+   * @param type specifies for which Success Rate Monitor the success rate value should be returned.
+   If the outlier detector is configured not to split external and local origin errors,


nit: missing '*' at beginning of block comment lines.

mattklein123 · 2019-06-04T18:19:36Z

include/envoy/upstream/outlier_detection.h

+   * @param type specifies for which Success Rate Monitor the success rate value should be returned.
+   If the outlier detector is configured not to split external and local origin errors,
+   ExternalOrigin type returns success rate for all types of errors: external and local origin and
+   LocalOrigin type returns -1. If the outlier detector is configured to split external and local


Instead of returning -1 here, would it be better to use absl::optional<>? Maybe something to think about as a follow up.

That -1 is displayed on GUI screen by admin.cc. If nobody interprets it, I can change it (would prefer another PR).

mattklein123 · 2019-06-04T18:21:02Z

source/common/router/router.cc

@@ -771,7 +776,14 @@ void Filter::onUpstreamReset(Http::StreamResetReason reset_reason,
  ENVOY_STREAM_LOG(debug, "upstream reset: reset reason {}", *callbacks_,
                   Http::Utility::resetReasonToString(reset_reason));

-  updateOutlierDetection(Http::Code::ServiceUnavailable, upstream_request);
+  /*


nit: use '//' comments.

mattklein123 · 2019-06-04T18:22:44Z

source/common/router/router.cc

+  // If it was mapped to HTTP code it would interfere with code returned by HTTP server and two
+  // codes would be reported to outlier detector for one request.
+  host->outlierDetector().putResult(Upstream::Outlier::Result::LOCAL_ORIGIN_CONNECT_SUCCESS,
+                                    absl::optional<uint64_t>(0));


Would it be better to have a different putResult API that better captures this case? Sending 0 here seems like a hack to me. Why can't we just send nullopt, because that is the default? If so, maybe another API makes sense? It would be nice to match this up somehow w/ the LOCAL_ORIGIN_CONNECT_FAILED call above.

mattklein123 · 2019-06-04T18:25:12Z

source/common/upstream/outlier_detection_impl.cc

+                                                 HostSharedPtr host)
+    : detector_(detector), host_(host),
+      // add Success Rate monitors
+      external_origin_SR_monitor_(std::make_unique<SuccessRateMonitor>(


Can these just be member variables? Do they need to actually be pointers and allocated additionally?

Done. Changed them to be member variables.

mattklein123 · 2019-06-04T18:26:03Z

source/common/upstream/outlier_detection_impl.cc

+// are not treated differently. All errors are mapped to HTTP codes.
+// Depending on the value of the parameter *code* the function behaves differently:
+// - if the *code* is not defined, mapping uses resultToHttpCode method to do mapping.
+// - if *code* is zero, no mapping takes place: result is dropped and not reported to outlier


Per my comment I would like to figure out a better API to avoid this 0 hack. Thoughts?

Currently putResult is used in 3 scenarios:

report event with optional HTTP code. The specified HTTP code is directly fed to outlier detector without any conversions.

report event without optional HTTP code. OutlierDetector uses legacy code https://github.com/cpakulski/envoy/blob/issue/3643/source/common/upstream/outlier_detection_impl.cc#L97-L117 to convert the event to HTTP code based on defaults, which is passed to outlier detector.

report event, but do not convert it to any HTTP code. This is to catch scenario with router, which must not convert LOCAL_ORIGIN_CONNECT_SUCCESS to HTTP code 200. Otherwise it will collide with the response from the real HTTP server.

I somehow have to indicate not to convert to HTTP code to handle scenario 3. nullopt is already used for scenario 2, what means " I am not giving any code, use defaults for conversion".
I think that the cleanest way would be to always require to specify HTTP code in API. So instead of writing:
putResult(EXT_ORIGIN_REQUEST_SUCCESS)
one would have to write
putResult(EXT_ORIGIN_REQUEST_SUCCESS, absl::optional<uint64_t>(200))
Then specifying nullopt would mean "Do nothing with HTTP code" to handle scenario 3.

I would have to convert all instances of putResult to add the code - about 8-10 places.

WDYT?

Fundamentally I think the issue we have is this API is broken and we are reaching the limits of what we can do with it. Why do we need HTTP codes for non-HTTP paths? Can we have a the following APIs:

putHttpResult(...) (for this one just make the caller provide the code each time)
putExternalOriginResult(...) (maybe this just takes success/failure like for redis)
putLocalOriginResult(...)

And then can't we just do the right thing with them?

Why do we need HTTP codes for non-HTTP paths?

We need them because of backward compatibility. They were previously mapped to HTTP codes.

putHttpResult(...) (for this one just make the caller provide the code each time)
putExternalOriginResult(...) (maybe this just takes success/failure like for redis)
putLocalOriginResult(...)

There is no problem with implementing/using putHttpResult and putExternalOriginResult. They always eventually go to external origin counters (consecutive_5xx) regardless whether we work in split or non-split mode.

The problem is with putLocalOriginResult. I went through all the cases where LOCAL_ORIGIN event is reported:

for LOCAL_ORIGIN_TIMEOUT, the code actually instructs to which HTTP code it should be translated: https://github.com/cpakulski/envoy/blob/issue/3643/source/common/router/router.cc#L612-L616, so HTTP code must be passed to outlier detector.

for LOCAL_ORIGIN_CONNECT_SUCCESS: tcp_proxy wants to translate it to code 200 so it behaves as before my changes:https://github.com/cpakulski/envoy/blob/issue/3643/source/common/tcp_proxy/tcp_proxy.cc#L539-L540. However, router does not want to translate CONNECT_SUCCESS to HTTP code 200 because it will report the real code via putHttpResult. So here to indicate that no conversion should happen, I stick value of zero: https://github.com/cpakulski/envoy/blob/issue/3643/source/common/router/router.cc#L1414-L1419

I can implement

putHttpResult(...) (for this one just make the caller provide the code each time)
putExternalOriginResult(...) (maybe this just takes success/failure like for redis)
putLocalOriginResult(...)

but putLocalOriginResult must carry HTTP code and somehow I have to indicate not to convert to any code to handle router's case. One way would be to modify include/envoy/http/codes.h and add enum InvalidCode = 0, so we do not hack it explicitly with zero, but use valid enum.

If this sounds good I will implement this today.

Sorry this is making my head hurt. Can we somehow unify the TCP proxy and router local origin cases so we don't need the special case, even if it is not explicitly the exact same behavior?

I understand that this is mind boggling. It is hard to design nice and elegant API to serve all the cases. There is no way to simply unify TCP cases and router cases. I somehow have to distinguish them. In order to avoid a hack with passing value zero I added another event LOCAL_ORIGIN_CONNECT_SUCCESS_FINAL to indicate that there is no other protocol which will report final error/success.

I also experimented with implementing

putHttpResult(...) (for this one just make the caller provide the code each time) putExternalOriginResult(...) (maybe this just takes success/failure like for redis) putLocalOriginResult(...)

but outlier detector's API is most often invoked through wrappers, so I would have to implement a wrapped for each of those methods. It very quickly became ugly.

mattklein123 · 2019-06-04T18:27:23Z

source/common/upstream/outlier_detection_impl.cc

+  put_result_func_(this, result, code);
+}
+
+void DetectorHostMonitorImpl::localOriginFailure() {


Is it possible to share this code somehow with the other failure function? There is some tricky logic here and I'm wondering if somehow there could be a shared function that is used in both places? Not sure if that is possible or not but just throwing it out there for thought.

Not sure what you mean. I went through the code and does not see how they could be shared.

My point is this logic of locking the weak pointer, etc. is duplicated from putHttpResponseCode. It would be nice to share that if possible but nbd if you can't.

other protocol which will report final transaction error. Refactored putResult method to avoid hacking some scenarios with non-enum values. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

mattklein123

Thanks @cpakulski for sticking with this and again sorry for the long review days. I look through again, and I think the way you have things now with final vs. not connect success works for me and is pretty clear from the filter perspective, so let's ship it. Can you merge master and I will give this a final pass? Thank you!

/wait

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski · 2019-06-25T07:12:36Z

/retest

repokitteh-read-only · 2019-06-25T07:12:40Z

🔨 rebuilding ci/circleci: tsan (failed build)

🐱

Caused by: a #4822 (comment) was created by @cpakulski.

see: more, trace.

cpakulski · 2019-06-25T08:06:27Z

@mattklein123 rebase is completed. Thanks for reviewing my latest changes.

mattklein123

Thanks, this LGTM. Let's ship. I have a few small comments to clean up. Can you also take a look at the coverage report and make sure we have 100% coverage on the new code? Also, needs another master merge. Thank you!

mattklein123 · 2019-06-26T18:18:23Z

include/envoy/upstream/outlier_detection.h

   */
-  virtual double successRate() const PURE;
+  virtual double successRate(SuccessRateMonitorType type) PURE;


why did we lost const here?

mattklein123 · 2019-06-26T18:19:58Z

source/common/upstream/outlier_detection_impl.cc

    http_code = Http::Code::InternalServerError;
    break;
-  case Result::SERVER_FAILURE:
-    http_code = Http::Code::ServiceUnavailable;
+  /* LOCAL_ORIGIN_CONNECT_SUCCESS  is used is 2-layer protocols, like HTTP.


nit: '//' comments

mattklein123 · 2019-06-26T18:20:53Z

source/common/upstream/outlier_detection_impl.cc

+// are not treated differently. All errors are mapped to HTTP codes.
+// Depending on the value of the parameter *code* the function behaves differently:
+// - if the *code* is not defined, mapping uses resultToHttpCode method to do mapping.
+// - if *code* is non-zero, it is taken as HTTP code and reported as such to outlier detector.


I think this comment is out of date, right? We no longer look at zero?

mattklein123 · 2019-06-26T18:21:38Z

source/common/upstream/outlier_detection_impl.cc

+  case Result::EXT_ORIGIN_REQUEST_FAILED:
+    // map it to http code and call http handler.
+    return putHttpResponseCode(enumToInt(Http::Code::ServiceUnavailable));
+    break;


nit: del break, not needed

mattklein123 · 2019-06-26T18:24:52Z

source/common/upstream/outlier_detection_impl.cc

+  put_result_func_(this, result, code);
+}
+
+void DetectorHostMonitorImpl::localOriginFailure() {


My point is this logic of locking the weak pointer, etc. is duplicated from putHttpResponseCode. It would be nice to share that if possible but nbd if you can't.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Adjusted unit tests after rebase from master. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

mattklein123

Nice!!! Thanks @cpakulski for sticking with this for so long. I know it took forever but I think it was worth it to iterate until we got to the current state. 🚢

cpakulski · 2019-06-27T19:16:03Z

Thanks @mattklein123 for your patience and valuable comments.

I added many lines of documentaion regarding outlier detector. Most likely issue #3730 can also be closed.

cpakulski added 3 commits October 22, 2018 09:34

Outlier detector: connection failures are handled separately from 5xx…

5530d11

… errors. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Merge branch 'master' into issue/3643

44d815b

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Fixed test directory compile problems after rebase.

09bba39

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

dnoe self-assigned this Oct 23, 2018

mattklein123 self-assigned this Oct 23, 2018

mattklein123 requested changes Oct 25, 2018

View reviewed changes

repokitteh-read-only bot added the waiting label Nov 3, 2018

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Nov 11, 2018

Outlier detector:

9784ddd

- removed mapping connect and server transactions errors to HTTP codes. - changed tcp_proxy and redis proxy to use connect errors and server transactions errors. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Nov 12, 2018

repokitteh-read-only bot removed the waiting label Nov 12, 2018

cpakulski added 2 commits November 12, 2018 12:26

Corrected unit tests to use new API to outlier detector.

f0a8660

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Corrected tcp_proxy unit test to use new outlier detector API.

1954b03

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

mattklein123 reviewed Nov 13, 2018

View reviewed changes

mattklein123 added the waiting label Nov 14, 2018

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Nov 22, 2018

repokitteh-read-only bot removed the waiting label Jun 4, 2019

mattklein123 requested changes Jun 4, 2019

View reviewed changes

repokitteh-read-only bot added the waiting label Jun 4, 2019

cpakulski changed the title ~~upstream: outlier detector - distinguish upstream from internal errors~~ WIP: upstream: outlier detector - distinguish upstream from internal errors Jun 7, 2019

Added new code for local origin errors to indicate that there is no

30fa5c7

other protocol which will report final transaction error. Refactored putResult method to avoid hacking some scenarios with non-enum values. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

repokitteh-read-only bot removed the waiting label Jun 10, 2019

cpakulski changed the title ~~WIP: upstream: outlier detector - distinguish upstream from internal errors~~ upstream: outlier detector - distinguish upstream from internal errors Jun 10, 2019

Merge branch 'master' into issue/3643

e8e9b1c

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

mattklein123 added the no stalebot Disables stalebot from closing an issue label Jun 17, 2019

mattklein123 requested changes Jun 24, 2019

View reviewed changes

repokitteh-read-only bot added the waiting label Jun 24, 2019

Merge branch 'master' into issue/3643

a5c251a

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

repokitteh-read-only bot removed the waiting label Jun 24, 2019

mattklein123 requested changes Jun 26, 2019

View reviewed changes

mattklein123 added the waiting label Jun 26, 2019

cpakulski added 3 commits June 27, 2019 03:20

Style corrections after code review.

e85f07a

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Merge branch 'master' into issue/3643

d91fb30

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Changed successRate method to be const.

0049372

Adjusted unit tests after rebase from master. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

repokitteh-read-only bot removed the waiting label Jun 27, 2019

mattklein123 approved these changes Jun 27, 2019

View reviewed changes

mattklein123 merged commit 89d81e6 into envoyproxy:master Jun 27, 2019

mattklein123 mentioned this pull request Jul 1, 2019

Outlier detection docs are outdated #3730

Closed

cpakulski mentioned this pull request Sep 4, 2019

Outlier Detection: use gRPC status code for detecting failures #7942

Merged

kidiyoor mentioned this pull request Sep 29, 2020

Distinguish upstream errors from local error for outlier detection istio/api#1662

Closed

cpakulski deleted the issue/3643 branch December 14, 2020 19:27

cpakulski mentioned this pull request Dec 8, 2023

upstream: API proposal for extensions to outlier detection #31205

Merged

upstream: outlier detector - distinguish upstream from internal errors #4822

upstream: outlier detector - distinguish upstream from internal errors #4822

Conversation

cpakulski commented Oct 22, 2018

cpakulski commented Oct 23, 2018

mattklein123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattklein123 commented Nov 3, 2018

stale bot commented Nov 11, 2018

cpakulski commented Nov 12, 2018

htuch commented Nov 13, 2018

mattklein123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattklein123 commented Nov 13, 2018

cpakulski commented Nov 14, 2018

cpakulski commented Nov 14, 2018

cpakulski commented Nov 14, 2018

mattklein123 commented Nov 14, 2018

cpakulski commented Nov 15, 2018

htuch commented Nov 15, 2018

stale bot commented Nov 22, 2018

cpakulski commented Nov 22, 2018

mattklein123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattklein123 left a comment

Choose a reason for hiding this comment

cpakulski commented Jun 25, 2019

repokitteh-read-only bot commented Jun 25, 2019

cpakulski commented Jun 25, 2019

mattklein123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattklein123 left a comment

Choose a reason for hiding this comment

cpakulski commented Jun 27, 2019