Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upstream: outlier detector - distinguish upstream from internal errors #4822

Merged
merged 59 commits into from
Jun 27, 2019

Conversation

cpakulski
Copy link
Contributor

Description: Outlier detector has specific handlers for 5xx errors. Internal errors, like connect failures, timeout, resets have been mapped to 5xx errors and handled by the same logic in outlier detector as upstream's 5xx errors. This PR introduces separate handlers for internal errors (connect failures, timeout, resets). New internal counters and new config items have been added.

Risk Level: Low

Testing: Unit tests.

Docs Changes: Yes - new config items have been added to config reference and section about outlier detector.

Release Notes: N/A
Fixes #3643

… errors.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@dnoe dnoe self-assigned this Oct 23, 2018
@cpakulski
Copy link
Contributor Author

I believe that I have implemented that feature correctly. In most places it was just copy and paste of outlier's methods for 5xx errors. Only in

upstream_host->outlierDetector().connectFailure();
and
host->outlierDetector().connectSuccess();
I had to add calls to report failure and success.
All test cases are passing OK. However, there are few things I am not sure about:

  • I did not modify v1 config. It means that logic to differentiate connect errors and 5xx error is there, but works only with defaults and cannot be modified via v1 config.
  • I think that documentation is not clear enough to explain differences between gateway errors, 5xx errors and connect errors. I can extend docs, but I am not sure what cases are handled by gateway errors.
  • In router_test.cc I had to replace calls to outlier detector’s handler for 5xx with calls to handler for connect errors. However, in many places there are references to http headers, like "x-envoy-retry-on", "5xx" and I am not sure if they still make sense when connect errors are processes by a handler for connect failures, not 5xx handlers. Like it is done here:
    Http::TestHeaderMapImpl headers{{"x-envoy-retry-on", "5xx"}, {"x-envoy-internal", "true"}};
    HttpTestUtility::addDefaultHeaders(headers);
    router_.decodeHeaders(headers, false);
    Buffer::OwnedImpl data;
    router_.decodeData(data, true);
    EXPECT_CALL(callbacks_.stream_info_,
    setResponseFlag(StreamInfo::ResponseFlag::UpstreamRequestTimeout));
    EXPECT_CALL(encoder.stream_, resetStream(Http::StreamResetReason::LocalReset));
    Http::TestHeaderMapImpl response_headers{
    {":status", "504"}, {"content-length", "24"}, {"content-type", "text/plain"}};
    EXPECT_CALL(callbacks_, encodeHeaders_(HeaderMapEqualRef(&response_headers), false));
    EXPECT_CALL(callbacks_, encodeData(_, true));
    EXPECT_CALL(*router_.retry_state_, shouldRetry(_, _, _)).Times(0);
    EXPECT_CALL(cm_.conn_pool_.host_->outlier_detector_, connectFailure());
    . I would need some guidance here what is the purpose of "x-envoy-retry-on" and whether it’s use should be extended for “connect” type errors.

@mattklein123 mattklein123 self-assigned this Oct 23, 2018
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. Some high level comments to get started. @htuch can you weigh in also since you opened the original issue? I want to make sure this satisfies Google's needs.

@@ -82,6 +82,9 @@ class DetectorHostMonitor {
* or the cluster did not have enough hosts to run through success rate outlier ejection.
*/
virtual double successRate() const PURE;

virtual void connectFailure() PURE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc comments please

upstream_host->outlierDetector().putHttpResponseCode(
enumToInt(type == UpstreamResetType::Reset ? Http::Code::ServiceUnavailable
: timeout_response_code_));
upstream_host->outlierDetector().connectFailure();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tcp_proxy also supports outlier detection. I think that should be modified with these new methods?


// The number of consecutive connect failures before ejection
// occurs. Defaults to 5.
google.protobuf.UInt32Value consecutive_connect_failure = 12;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I think this is more complicated than "connect failure" since some of the reasons for "connection failures" can be internal circuit breaking. We should make a more generic name here and clearly document how this is basically all errors that are internally generated by Envoy (failure to TCP connection, TLS handshake, overflow, timeout, etc.).

One special case that needs considering is when upstream actually does send an HTTP reset. How should this be handled? IMO it should count as an upstream real service unavailable error. Anyway needs thinking about.

Once we lock down the nomenclature here I think we will need to update the entire PR to use the new wording.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the Google perspective, our main goal is to distinguish "backend returned 5xx as an HTTP response" from everything else. We want to be able to treat backend returning 5xx as business as usual and not eject (since opaque application logic might be returning this). The bucket of the "the rest" is what we want to eject on. I think this PR is sufficient for us, but I think your suggestions sound good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I think this is more complicated than "connect failure" since some of the reasons for "connection failures" can be internal circuit breaking. We should make a more generic name here and clearly document how this is basically all errors that are internally generated by Envoy (failure to TCP connection, TLS handshake, overflow, timeout, etc.).

@mattklein123 I can definitely address those cases. Outlier detector works on the principle that errors are reported to an outlier implementation attached to a host. I guess that other errors you mentioned: TCP connection, TLS handshake, overflow, timeout are currently not reported to host's outlier detector, correct?

We should make a more generic name...

What is your suggestion for naming those types of errors?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that other errors you mentioned: TCP connection, TLS handshake, overflow, timeout are currently not reported to host's outlier detector, correct?

Yes they are reported. These generally come through as "resets" of various types.

What is your suggestion for naming those types of errors?

I'm not sure of the best name, but something along the lines of "local origin error" vs. "remote origin error" or something like that is how I would think about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I will look at TCP connection and TLS handshake first and will come up with a proposal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are looking for names here, I would do something like: "local_origin" vs. "connect" and describe what that means. I think "local_origin" is a wrapper for circuit breaking, connect failure, etc.

@mattklein123
Copy link
Member

Marking is waiting.

/wait

@stale
Copy link

stale bot commented Nov 11, 2018

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Nov 11, 2018
 - removed mapping connect and server transactions errors to HTTP codes.
 - changed tcp_proxy and redis proxy to use connect errors and
   server transactions errors.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Nov 12, 2018
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski
Copy link
Contributor Author

Summary of this PR push:

  • removed mapping of connection errors, server errors to HTTP codes. This change will affect reporting of connection problems by router, tcp_proxy and redis_proxy.
  • updated documentation to better explain how outlier detector uses its failure counters. I hope it will also fix Outlier detection docs are outdated #3730

It is still not in the final form. Just want to get a feedback whether direction is good. The following changes are still required:

  • I still used word connect to describe connection problems. As noted by @mattklein123 (above in this PR) better name should be used, but I could not come up with a better naming.
  • Outlier detector used to have only two counters: consecutive_5xx and consecutive_gateway_failures. Now there are 2 more: consecutive_connect_failure and consecutive_server_request_failure. Handling of those counters is almost identical and there is lots of code repetition. I think we should collapse the code to use a common class or template.
  • SuccessRate algorithm rides on top of HTTP error code. Because tcp_proxy and redis_proxy errors were mapped to HTTP code, SuccessRate logic automatically applied to those filters as well. Now, tcp_proxy, redis_proxy and router's connect failures have their own handlers for connect errrors and SuccessRate logic does not apply there. I can add it if it makes sense.

Just want to get a feedback, if I should continue with those changes or drop them.

@htuch
Copy link
Member

htuch commented Nov 13, 2018

@cpakulski will you be making the success rate parameters (internal vs. connect/request derived failures) a tunable option?

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I think this is heading in the right direction. Thanks for working on this. I add a few comments. Will respond also to some of your other questions.


// The number of consecutive connect failures before ejection
// occurs. Defaults to 5.
google.protobuf.UInt32Value consecutive_connect_failure = 12;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are looking for names here, I would do something like: "local_origin" vs. "connect" and describe what that means. I think "local_origin" is a wrapper for circuit breaking, connect failure, etc.

google.protobuf.UInt32Value enforcing_consecutive_connect_failure = 13
[(validate.rules).uint32.lte = 100];

// The number of consecutive server transaction failures before ejection occurs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just from reading this documentation, I'm not sure what a server transaction failure is. I will keep reading the PR but can we update that to make it more clear?

Edit: OK So the idea here is this is an upstream server failure for non-HTTP servers? Is that right? If so I think that makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. Upstream server failure would be a situation when connection was OK, but for example redis DB transaction failed. Server could not understand what it received or parsing of reply failed.

@@ -10,6 +10,17 @@ such as consecutive failures, temporal success rate, temporal latency, etc. Outl
form of *passive* health checking. Envoy also supports :ref:`active health checking
<arch_overview_health_checking>`. *Passive* and *active* health checking can be enabled together or
independently, and form the basis for an overall upstream health checking solution.
Outlier detection is part of :ref:`cluster configuration <envoy_api_msg_cluster.OutlierDetection>`,
but it needs network filters to report errors, timeouts, resets. Currently the following filters support
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"network or HTTP filters" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK only HTTP router, tcp_proxy and redis_proxy provide feedback to outlier detector. The latter two do not really talk http, so I called them "network".

@@ -113,7 +146,7 @@ action

type
If ``action`` is ``eject``, specifies the type of ejection that took place. Currently type can
be one of ``5xx``, ``GatewayFailure`` or ``SuccessRate``.
be one of ``5xx``, ``GatewayFailure``, ``SuccessRate``, ``ConnectFailure`` or ``MalformedPayload``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "MalformedPayload" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. Thanks for catching it. This is how I originally called "upstream server failure", but later on changed naming.

@mattklein123
Copy link
Member

SuccessRate algorithm rides on top of HTTP error code. Because tcp_proxy and redis_proxy errors were mapped to HTTP code, SuccessRate logic automatically applied to those filters as well. Now, tcp_proxy, redis_proxy and router's connect failures have their own handlers for connect errrors and SuccessRate logic does not apply there. I can add it if it makes sense.

Yes I think we definitely need it, though for my understanding, what was the issue with continuing to share the HTTP logic? Clarity or something else?

@cpakulski
Copy link
Contributor Author

will you be making the success rate parameters (internal vs. connect/request derived failures) a tunable option?

@htuch I will try to do it such way that there is one config parameter set for success rate and those settings will apply to all kind of failures. I am still not sure whether to treat failure counters separately or in aggregate. So, for example it a host fails because of http 5xx errors and also occasionally fails because of internal errors, should Success Rate treat those errors separately or sum them together. Based on my understanding of #3643, they should be treated separately.

@cpakulski
Copy link
Contributor Author

SuccessRate algorithm rides on top of HTTP error code. Because tcp_proxy and redis_proxy errors were mapped to HTTP code, SuccessRate logic automatically applied to those filters as well. Now, tcp_proxy, redis_proxy and router's connect failures have their own handlers for connect errrors and SuccessRate logic does not apply there. I can add it if it makes sense.

Yes I think we definitely need it, though for my understanding, what was the issue with continuing to share the HTTP logic? Clarity or something else?

There were few reasons. First was that in order to distinguish internal errors from upstream server replies, new way of reporting such events to outlier detector was required. Those events had separate counters for failures. Originally I added connectSuccess/Failure calls to router, which updated newly added counters. Then I added the new API calls to tcp_proxy and redis proxy. At that moment mapping to http code stopped being used.
The second reason was more of the clarity. By reading documentation I completely did not understand how the outlier detector's mechanism worked. Only by reading the code, the logic became clear. But from a user point of view, it is difficult to make a connection between tcp_connect failures or redis connect failures to 5xx http error codes, and configuration uses http error codes.

@cpakulski
Copy link
Contributor Author

@htuch @mattklein123 I still struggle a bit with exact understanding what "internal" vs "upstream errors" mean. I understand that they should be distinguished. Here is my understanding how events should be handled. Please correct me if I am wrong:

  • 5xx error from upstream server: count it towards 5xx errors. Controlled by "5xx and gateway failures config params"
  • failure connecting to upstream server because it cannot be reached (routing, server is not running at specified port, timeout, etc): count it towards "connect Errors". Controlled by "connect failures params".
  • failure connecting to upstream server because of internal limitations (circuit breaking for example): count it towards "connect Errors" Controlled by "connect failures params"

This is what is more or less implemented now. In essence it distinguishes HTTP errors from layer 2/3 errors, but does not distinguish failures caused by external factors vs internal.

I know I still used connect name to describe it, will change it to something else (internal?).

@mattklein123
Copy link
Member

Based on my understanding of #3643, they should be treated separately.

Agreed separately.

The second reason was more of the clarity.

OK that's fine. The use of HTTP codes previous was kind of a hack. With that said, we need to keep the functionality we had before. Meaning, we still must support SR outlier detection for redis and TCP.

I know I still used connect name to describe it, will change it to something else (internal?).

Personally I would use that I already suggested which is "local_origin." If you trace through the code, there are many things that can cause this: timeouts, circuit breakers, connect errors, etc. I think as long as it's well documented, this will be clear to people.

Regarding "server failure" vs. "HTTP success rate" I'm still not sure if it's worth configuring different options for this, since logically it's really the same thing. Can we somehow merge them so we only have two concepts: "local origin errors" and "upstream errors?"

@cpakulski
Copy link
Contributor Author

@mattklein123 Thanks for your patience and help in explaining this. I think we are on the same page now.

Regarding "server failure" vs. "HTTP success rate" I'm still not sure if it's worth configuring different options for this, since logically it's really the same thing. Can we somehow merge them so we only have two concepts: "local origin errors" and "upstream errors?"

I will try to code it that way.

@htuch
Copy link
Member

htuch commented Nov 15, 2018

+1 on local origin as name to distinguish from actual backends.

@htuch I will try to do it such way that there is one config parameter set for success rate and those settings will apply to all kind of failures. I am still not sure whether to treat failure counters separately or in aggregate. So, for example it a host fails because of http 5xx errors and also occasionally fails because of internal errors, should Success Rate treat those errors separately or sum them together. Based on my understanding of #3643, they should be treated separately.

Yep, separately.

@stale
Copy link

stale bot commented Nov 22, 2018

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Nov 22, 2018
@cpakulski
Copy link
Contributor Author

Work in progress.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cpakulski for sticking with this. This is really coming together and I think we are very close. I would like to resolve the one API issue that I pointed out but otherwise I think we are on the right track. Very cool stuff!

/wait

@@ -81,8 +93,14 @@ class DetectorHostMonitor {
* @return the success rate of the host in the last calculated interval, in the range 0-100.
* -1 means that the host did not have enough request volume to calculate success rate
* or the cluster did not have enough hosts to run through success rate outlier ejection.
* @param type specifies for which Success Rate Monitor the success rate value should be returned.
If the outlier detector is configured not to split external and local origin errors,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing '*' at beginning of block comment lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* @param type specifies for which Success Rate Monitor the success rate value should be returned.
If the outlier detector is configured not to split external and local origin errors,
ExternalOrigin type returns success rate for all types of errors: external and local origin and
LocalOrigin type returns -1. If the outlier detector is configured to split external and local
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of returning -1 here, would it be better to use absl::optional<>? Maybe something to think about as a follow up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That -1 is displayed on GUI screen by admin.cc. If nobody interprets it, I can change it (would prefer another PR).

@@ -771,7 +776,14 @@ void Filter::onUpstreamReset(Http::StreamResetReason reset_reason,
ENVOY_STREAM_LOG(debug, "upstream reset: reset reason {}", *callbacks_,
Http::Utility::resetReasonToString(reset_reason));

updateOutlierDetection(Http::Code::ServiceUnavailable, upstream_request);
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use '//' comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// If it was mapped to HTTP code it would interfere with code returned by HTTP server and two
// codes would be reported to outlier detector for one request.
host->outlierDetector().putResult(Upstream::Outlier::Result::LOCAL_ORIGIN_CONNECT_SUCCESS,
absl::optional<uint64_t>(0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to have a different putResult API that better captures this case? Sending 0 here seems like a hack to me. Why can't we just send nullopt, because that is the default? If so, maybe another API makes sense? It would be nice to match this up somehow w/ the LOCAL_ORIGIN_CONNECT_FAILED call above.

HostSharedPtr host)
: detector_(detector), host_(host),
// add Success Rate monitors
external_origin_SR_monitor_(std::make_unique<SuccessRateMonitor>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these just be member variables? Do they need to actually be pointers and allocated additionally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Changed them to be member variables.

// are not treated differently. All errors are mapped to HTTP codes.
// Depending on the value of the parameter *code* the function behaves differently:
// - if the *code* is not defined, mapping uses resultToHttpCode method to do mapping.
// - if *code* is zero, no mapping takes place: result is dropped and not reported to outlier
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my comment I would like to figure out a better API to avoid this 0 hack. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently putResult is used in 3 scenarios:

  • report event with optional HTTP code. The specified HTTP code is directly fed to outlier detector without any conversions.
  • report event without optional HTTP code. OutlierDetector uses legacy code https://github.com/cpakulski/envoy/blob/issue/3643/source/common/upstream/outlier_detection_impl.cc#L97-L117 to convert the event to HTTP code based on defaults, which is passed to outlier detector.
  • report event, but do not convert it to any HTTP code. This is to catch scenario with router, which must not convert LOCAL_ORIGIN_CONNECT_SUCCESS to HTTP code 200. Otherwise it will collide with the response from the real HTTP server.

I somehow have to indicate not to convert to HTTP code to handle scenario 3. nullopt is already used for scenario 2, what means " I am not giving any code, use defaults for conversion".
I think that the cleanest way would be to always require to specify HTTP code in API. So instead of writing:
putResult(EXT_ORIGIN_REQUEST_SUCCESS)
one would have to write
putResult(EXT_ORIGIN_REQUEST_SUCCESS, absl::optional<uint64_t>(200))
Then specifying nullopt would mean "Do nothing with HTTP code" to handle scenario 3.

I would have to convert all instances of putResult to add the code - about 8-10 places.

WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fundamentally I think the issue we have is this API is broken and we are reaching the limits of what we can do with it. Why do we need HTTP codes for non-HTTP paths? Can we have a the following APIs:

putHttpResult(...) (for this one just make the caller provide the code each time)
putExternalOriginResult(...) (maybe this just takes success/failure like for redis)
putLocalOriginResult(...)

And then can't we just do the right thing with them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need HTTP codes for non-HTTP paths?

We need them because of backward compatibility. They were previously mapped to HTTP codes.

putHttpResult(...) (for this one just make the caller provide the code each time)
putExternalOriginResult(...) (maybe this just takes success/failure like for redis)
putLocalOriginResult(...)

There is no problem with implementing/using putHttpResult and putExternalOriginResult. They always eventually go to external origin counters (consecutive_5xx) regardless whether we work in split or non-split mode.

The problem is with putLocalOriginResult. I went through all the cases where LOCAL_ORIGIN event is reported:

I can implement

putHttpResult(...) (for this one just make the caller provide the code each time)
putExternalOriginResult(...) (maybe this just takes success/failure like for redis)
putLocalOriginResult(...)

but putLocalOriginResult must carry HTTP code and somehow I have to indicate not to convert to any code to handle router's case. One way would be to modify include/envoy/http/codes.h and add enum InvalidCode = 0, so we do not hack it explicitly with zero, but use valid enum.

If this sounds good I will implement this today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this is making my head hurt. Can we somehow unify the TCP proxy and router local origin cases so we don't need the special case, even if it is not explicitly the exact same behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that this is mind boggling. It is hard to design nice and elegant API to serve all the cases. There is no way to simply unify TCP cases and router cases. I somehow have to distinguish them. In order to avoid a hack with passing value zero I added another event LOCAL_ORIGIN_CONNECT_SUCCESS_FINAL to indicate that there is no other protocol which will report final error/success.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also experimented with implementing

putHttpResult(...) (for this one just make the caller provide the code each time)
putExternalOriginResult(...) (maybe this just takes success/failure like for redis)
putLocalOriginResult(...)

but outlier detector's API is most often invoked through wrappers, so I would have to implement a wrapped for each of those methods. It very quickly became ugly.

put_result_func_(this, result, code);
}

void DetectorHostMonitorImpl::localOriginFailure() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to share this code somehow with the other failure function? There is some tricky logic here and I'm wondering if somehow there could be a shared function that is used in both places? Not sure if that is possible or not but just throwing it out there for thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean. I went through the code and does not see how they could be shared.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is this logic of locking the weak pointer, etc. is duplicated from putHttpResponseCode. It would be nice to share that if possible but nbd if you can't.

@cpakulski cpakulski changed the title upstream: outlier detector - distinguish upstream from internal errors WIP: upstream: outlier detector - distinguish upstream from internal errors Jun 7, 2019
other protocol which will report final transaction error.

Refactored putResult method to avoid hacking some scenarios with
non-enum values.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski cpakulski changed the title WIP: upstream: outlier detector - distinguish upstream from internal errors upstream: outlier detector - distinguish upstream from internal errors Jun 10, 2019
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@mattklein123 mattklein123 added the no stalebot Disables stalebot from closing an issue label Jun 17, 2019
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cpakulski for sticking with this and again sorry for the long review days. I look through again, and I think the way you have things now with final vs. not connect success works for me and is pretty clear from the filter perspective, so let's ship it. Can you merge master and I will give this a final pass? Thank you!

/wait

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

🔨 rebuilding ci/circleci: tsan (failed build)

🐱

Caused by: a #4822 (comment) was created by @cpakulski.

see: more, trace.

@cpakulski
Copy link
Contributor Author

@mattklein123 rebase is completed. Thanks for reviewing my latest changes.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this LGTM. Let's ship. I have a few small comments to clean up. Can you also take a look at the coverage report and make sure we have 100% coverage on the new code? Also, needs another master merge. Thank you!

*/
virtual double successRate() const PURE;
virtual double successRate(SuccessRateMonitorType type) PURE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did we lost const here?

http_code = Http::Code::InternalServerError;
break;
case Result::SERVER_FAILURE:
http_code = Http::Code::ServiceUnavailable;
/* LOCAL_ORIGIN_CONNECT_SUCCESS is used is 2-layer protocols, like HTTP.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: '//' comments

// are not treated differently. All errors are mapped to HTTP codes.
// Depending on the value of the parameter *code* the function behaves differently:
// - if the *code* is not defined, mapping uses resultToHttpCode method to do mapping.
// - if *code* is non-zero, it is taken as HTTP code and reported as such to outlier detector.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is out of date, right? We no longer look at zero?

case Result::EXT_ORIGIN_REQUEST_FAILED:
// map it to http code and call http handler.
return putHttpResponseCode(enumToInt(Http::Code::ServiceUnavailable));
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: del break, not needed

put_result_func_(this, result, code);
}

void DetectorHostMonitorImpl::localOriginFailure() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is this logic of locking the weak pointer, etc. is duplicated from putHttpResponseCode. It would be nice to share that if possible but nbd if you can't.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Adjusted unit tests after rebase from master.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!!! Thanks @cpakulski for sticking with this for so long. I know it took forever but I think it was worth it to iterate until we got to the current state. 🚢

@mattklein123 mattklein123 merged commit 89d81e6 into envoyproxy:master Jun 27, 2019
@cpakulski
Copy link
Contributor Author

Thanks @mattklein123 for your patience and valuable comments.

I added many lines of documentaion regarding outlier detector. Most likely issue #3730 can also be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no stalebot Disables stalebot from closing an issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

outlier detection: distinguish upstream from internal errors.
4 participants