Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upstream: outlier detector - distinguish upstream from internal errors #4822

Merged
merged 59 commits into from
Jun 27, 2019
Merged
Show file tree
Hide file tree
Changes from 53 commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
5530d11
Outlier detector: connection failures are handled separately from 5xx…
cpakulski Oct 1, 2018
44d815b
Merge branch 'master' into issue/3643
cpakulski Oct 22, 2018
09bba39
Fixed test directory compile problems after rebase.
cpakulski Oct 23, 2018
9784ddd
Outlier detector:
cpakulski Nov 12, 2018
f0a8660
Corrected unit tests to use new API to outlier detector.
cpakulski Nov 12, 2018
1954b03
Corrected tcp_proxy unit test to use new outlier detector API.
cpakulski Nov 12, 2018
535cddc
- Added separate Success Rate monitor for local origin errors.
cpakulski Dec 13, 2018
cab57f3
Merge branch 'master' into issue/3643
cpakulski Dec 13, 2018
41fbe26
Updated docs to reflect separation of local origin and external origin
cpakulski Dec 14, 2018
af046e2
Corrected spelling errors and compile errors.
cpakulski Dec 14, 2018
9e6ca10
Corrected docs formatting.
cpakulski Dec 14, 2018
fcd5ec7
Merge branch 'master' into issue/3643
cpakulski Jan 3, 2019
fea1434
Changes after code review:
cpakulski Feb 1, 2019
0fe3d1b
Merge branch 'master' into issue/3643
cpakulski Feb 4, 2019
3c6a0f9
Run formatting tool.
cpakulski Feb 4, 2019
6ed3be2
Merge branch 'master' into issue/3643
cpakulski Feb 4, 2019
b820bf4
Merge branch 'master' into issue/3643
cpakulski Feb 4, 2019
551275f
Corrected spelling mistakes.
cpakulski Feb 4, 2019
7484049
Merge branch 'master' into issue/3643
cpakulski Feb 22, 2019
61a298c
Added split_external_local_origin_error config parameter. If enebled
cpakulski Mar 27, 2019
420eab0
Merge branch 'master' into issue/3643
cpakulski Apr 2, 2019
91c256f
Bring code to required functionality after the rebase from master.
cpakulski Apr 11, 2019
2edef1c
Corrected unit test to match code logic.
cpakulski Apr 11, 2019
18a42ed
Updated documentation after PR review.
cpakulski Apr 18, 2019
c41fb86
Merge branch 'master' into issue/3643
cpakulski Apr 18, 2019
93fa804
Added release note.
cpakulski Apr 18, 2019
0c09aa4
Corrected format in release notes.
cpakulski Apr 19, 2019
00c8289
Converted new redis unit test cases to use new API to outlier detector.
cpakulski Apr 19, 2019
e8cc31c
Merge branch 'master' into issue/3643
cpakulski Apr 24, 2019
6fb5367
After rebase:
cpakulski Apr 24, 2019
2217d99
Updated documentation after code review.
cpakulski Apr 25, 2019
8fa84b9
Changed back to use enum to distinguish localOrigin and externalOrigin
cpakulski May 3, 2019
10cdab0
Small style corrections after code review.
cpakulski May 3, 2019
6c3d55a
Removed hash tables storing success rate monitors and numbers and
cpakulski May 4, 2019
5d24bb0
Merge branch 'master' into issue/3643
cpakulski May 4, 2019
c0cc0e3
Changed DetectorHostMonitor::SuccessRateMonitorType enum to be scoped.
cpakulski May 7, 2019
9234892
Small style changes: use snake case for variable names.
cpakulski May 7, 2019
4ba5f89
Merge branch 'master' into issue/3643
cpakulski May 7, 2019
5fe4bc0
Corrections after code review:
cpakulski May 10, 2019
6301eab
Updated description for CONSECUTIVE_LOCAL_ORIGIN_FAILURE log type.
cpakulski May 10, 2019
8e0641b
Added comments after code review.
cpakulski May 14, 2019
9703454
Renamed outlier detector's non-http events to contain prefix indicating
cpakulski May 16, 2019
8e3aea0
Removed __attribute__((fallthrough)) statement as it was causing
cpakulski May 16, 2019
eaaff40
Fixed compile error.
cpakulski May 16, 2019
0707339
Merge branch 'master' into issue/3643
cpakulski May 17, 2019
9792f90
Added several tests to bring coverage to ~100%.
cpakulski May 17, 2019
3840f6d
Added comment to test case.
cpakulski May 17, 2019
ec44238
Removed complicated logic of not mappoing LOCAL_ORIGIN_CONNECT_SUCCESS
cpakulski May 22, 2019
43f69b6
Fixed format issue.
cpakulski May 22, 2019
4216f8e
Merge branch 'master' into issue/3643
cpakulski May 28, 2019
cd69579
Merge branch 'master' into issue/3643
cpakulski May 29, 2019
08584ba
Merge branch 'master' into issue/3643
cpakulski May 29, 2019
feed828
Merge branch 'master' into issue/3643
cpakulski Jun 4, 2019
30fa5c7
Added new code for local origin errors to indicate that there is no
cpakulski Jun 10, 2019
e8e9b1c
Merge branch 'master' into issue/3643
cpakulski Jun 10, 2019
a5c251a
Merge branch 'master' into issue/3643
cpakulski Jun 24, 2019
e85f07a
Style corrections after code review.
cpakulski Jun 27, 2019
d91fb30
Merge branch 'master' into issue/3643
cpakulski Jun 27, 2019
0049372
Changed successRate method to be const.
cpakulski Jun 27, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 48 additions & 3 deletions api/envoy/admin/v2alpha/clusters.proto
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,15 @@ message ClusterStatus {
// Denotes whether this cluster was added via API or configured statically.
bool added_via_api = 2;

// The success rate threshold used in the last interval. The threshold is used to eject hosts
// based on their success rate. See
// :ref:`Cluster outlier detection <arch_overview_outlier_detection>` statistics
// The success rate threshold used in the last interval.
// If
// :ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>`
// is *false*, all errors: externally and locally generated were used to calculate the threshold.
// If
// :ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>`
// is *true*, only externally generated errors were used to calculate the threshold.
// The threshold is used to eject hosts based on their success rate. See
// :ref:`Cluster outlier detection <arch_overview_outlier_detection>` documentation for details.
//
// Note: this field may be omitted in any of the three following cases:
//
Expand All @@ -43,6 +49,23 @@ message ClusterStatus {

// Mapping from host address to the host's current status.
repeated HostStatus host_statuses = 4;

// The success rate threshold used in the last interval when only locally originated failures were
// taken into account and externally originated errors were treated as success.
// This field should be interpretted only when
// :ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>`
// is *true*. The threshold is used to eject hosts based on their success rate.
// See :ref:`Cluster outlier detection <arch_overview_outlier_detection>` documentation for
// details.
//
// Note: this field may be omitted in any of the three following cases:
//
// 1. There were not enough hosts with enough request volume to proceed with success rate based
// outlier ejection.
// 2. The threshold is computed to be < 0 because a negative value implies that there was no
// threshold for that interval.
// 3. Outlier detection is not enabled for this cluster.
envoy.type.Percent local_origin_success_rate_ejection_threshold = 5;
}

// Current state of a particular host.
Expand All @@ -57,6 +80,14 @@ message HostStatus {
HostHealthStatus health_status = 3;

// Request success rate for this host over the last calculated interval.
// If
// :ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>`
// is *false*, all errors: externally and locally generated were used in success rate
// calculation. If
// :ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>`
// is *true*, only externally generated errors were used in success rate calculation.
// See :ref:`Cluster outlier detection <arch_overview_outlier_detection>` documentation for
// details.
//
// Note: the message will not be present if host did not have enough request volume to calculate
// success rate or the cluster did not have enough hosts to run through success rate outlier
Expand All @@ -65,6 +96,20 @@ message HostStatus {

// The host's weight. If not configured, the value defaults to 1.
uint32 weight = 5;

// Request success rate for this host over the last calculated
// interval when only locally originated errors are taken into account and externally originated
// errors were treated as success.
// This field should be interpretted only when
// :ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>`
// is *true*.
// See :ref:`Cluster outlier detection <arch_overview_outlier_detection>` documentation for
// details.
//
// Note: the message will not be present if host did not have enough request volume to calculate
// success rate or the cluster did not have enough hosts to run through success rate outlier
// ejection.
envoy.type.Percent local_origin_success_rate = 6;
}

// Health status for a host.
Expand Down
41 changes: 37 additions & 4 deletions api/envoy/api/v2/cluster/outlier_detection.proto
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@ option (gogoproto.equal_all) = true;
// See the :ref:`architecture overview <arch_overview_outlier_detection>` for
// more information on outlier detection.
message OutlierDetection {
// The number of consecutive 5xx responses before a consecutive 5xx ejection
// The number of consecutive 5xx responses or local origin errors that are mapped
// to 5xx error codes before a consecutive 5xx ejection
// occurs. Defaults to 5.
google.protobuf.UInt32Value consecutive_5xx = 1;

Expand Down Expand Up @@ -70,14 +71,46 @@ message OutlierDetection {
// be 1900. Defaults to 1900.
google.protobuf.UInt32Value success_rate_stdev_factor = 9;

// The number of consecutive gateway failures (502, 503, 504 status or
// connection errors that are mapped to one of those status codes) before a
// consecutive gateway failure ejection occurs. Defaults to 5.
// The number of consecutive gateway failures (502, 503, 504 status codes)
// before a consecutive gateway failure ejection occurs. Defaults to 5.
google.protobuf.UInt32Value consecutive_gateway_failure = 10;

// The % chance that a host will be actually ejected when an outlier status
// is detected through consecutive gateway failures. This setting can be
// used to disable ejection or to ramp it up slowly. Defaults to 0.
google.protobuf.UInt32Value enforcing_consecutive_gateway_failure = 11
[(validate.rules).uint32.lte = 100];

// Determines whether to distinguish local origin failures from external errors. If set to true
// the following configuration parameters are taken into account:
// :ref:`consecutive_local_origin_failure<envoy_api_field_cluster.OutlierDetection.consecutive_local_origin_failure>`,
// :ref:`enforcing_consecutive_local_origin_failure<envoy_api_field_cluster.OutlierDetection.enforcing_consecutive_local_origin_failure>`
// and
// :ref:`enforcing_local_origin_success_rate<envoy_api_field_cluster.OutlierDetection.enforcing_local_origin_success_rate>`.
// Defaults to false.
bool split_external_local_origin_errors = 12;

// The number of consecutive locally originated failures before ejection
// occurs. Defaults to 5. Parameter takes effect only when
// :ref:`split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>`
// is set to true.
google.protobuf.UInt32Value consecutive_local_origin_failure = 13;

// The % chance that a host will be actually ejected when an outlier status
// is detected through consecutive locally originated failures. This setting can be
// used to disable ejection or to ramp it up slowly. Defaults to 100.
// Parameter takes effect only when
// :ref:`split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>`
// is set to true.
google.protobuf.UInt32Value enforcing_consecutive_local_origin_failure = 14
[(validate.rules).uint32.lte = 100];

// The % chance that a host will be actually ejected when an outlier status
// is detected through success rate statistics for locally originated errors.
// This setting can be used to disable ejection or to ramp it up slowly. Defaults to 100.
// Parameter takes effect only when
// :ref:`split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>`
// is set to true.
google.protobuf.UInt32Value enforcing_local_origin_success_rate = 15
[(validate.rules).uint32.lte = 100];
}
29 changes: 27 additions & 2 deletions api/envoy/data/cluster/v2alpha/outlier_detection_event.proto
Original file line number Diff line number Diff line change
Expand Up @@ -47,12 +47,37 @@ message OutlierDetectionEvent {

// Type of ejection that took place
enum OutlierEjectionType {
// In case upstream host returns certain number of consecutive 5xx
// In case upstream host returns certain number of consecutive 5xx.
// If
// :ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>`
// is *false*, all type of errors are treated as HTTP 5xx errors.
// See :ref:`Cluster outlier detection <arch_overview_outlier_detection>` documentation for
// details.
CONSECUTIVE_5XX = 0;
// In case upstream host returns certain number of consecutive gateway errors
CONSECUTIVE_GATEWAY_FAILURE = 1;
// Runs over aggregated success rate statistics from every host in cluster
mattklein123 marked this conversation as resolved.
Show resolved Hide resolved
// and selects hosts for which ratio of successful replies deviates from other hosts
// in the cluster.
// If
// :ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>`
// is *false*, all errors (externally and locally generated) are used to calculate success rate
// statistics. See :ref:`Cluster outlier detection <arch_overview_outlier_detection>`
// documentation for details.
SUCCESS_RATE = 2;
// Consecutive local origin failures: Connection failures, resets, timeouts, etc
// This type of ejection happens only when
// :ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>`
// is set to *true*.
// See :ref:`Cluster outlier detection <arch_overview_outlier_detection>` documentation for
CONSECUTIVE_LOCAL_ORIGIN_FAILURE = 3;
// Runs over aggregated success rate statistics for local origin failures
// for all hosts in the cluster and selects hosts for which success rate deviates from other
// hosts in the cluster. This type of ejection happens only when
// :ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>`
// is set to *true*.
// See :ref:`Cluster outlier detection <arch_overview_outlier_detection>` documentation for
SUCCESS_RATE_LOCAL_ORIGIN = 4;
}

// Represents possible action applied to upstream host
Expand All @@ -74,4 +99,4 @@ message OutlierEjectSuccessRate {
}

message OutlierEjectConsecutive {
}
}
15 changes: 15 additions & 0 deletions docs/root/configuration/cluster_manager/cluster_runtime.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,11 @@ outlier_detection.consecutive_gateway_failure
<envoy_api_field_cluster.OutlierDetection.consecutive_gateway_failure>`
setting in outlier detection

outlier_detection.consecutive_local_origin_failure
:ref:`consecutive_local_origin_failure
<envoy_api_field_cluster.OutlierDetection.consecutive_local_origin_failure>`
setting in outlier detection

outlier_detection.interval_ms
:ref:`interval_ms
<envoy_api_field_cluster.OutlierDetection.interval>`
Expand All @@ -67,11 +72,21 @@ outlier_detection.enforcing_consecutive_gateway_failure
<envoy_api_field_cluster.OutlierDetection.enforcing_consecutive_gateway_failure>`
setting in outlier detection

outlier_detection.enforcing_consecutive_local_origin_failure
:ref:`enforcing_consecutive_local_origin_failure
<envoy_api_field_cluster.OutlierDetection.enforcing_consecutive_local_origin_failure>`
setting in outlier detection

outlier_detection.enforcing_success_rate
:ref:`enforcing_success_rate
<envoy_api_field_cluster.OutlierDetection.enforcing_success_rate>`
setting in outlier detection

outlier_detection.enforcing_local_origin_success_rate
:ref:`enforcing_local_origin_success_rate
<envoy_api_field_cluster.OutlierDetection.enforcing_local_origin_success_rate>`
setting in outlier detection

outlier_detection.success_rate_minimum_hosts
:ref:`success_rate_minimum_hosts
<envoy_api_field_cluster.OutlierDetection.success_rate_minimum_hosts>`
Expand Down
8 changes: 6 additions & 2 deletions docs/root/configuration/cluster_manager/cluster_stats.rst
Original file line number Diff line number Diff line change
Expand Up @@ -133,10 +133,14 @@ statistics will be rooted at *cluster.<name>.outlier_detection.* and contain the
ejections_overflow, Counter, Number of ejections aborted due to the max ejection %
ejections_enforced_consecutive_5xx, Counter, Number of enforced consecutive 5xx ejections
ejections_detected_consecutive_5xx, Counter, Number of detected consecutive 5xx ejections (even if unenforced)
ejections_enforced_success_rate, Counter, Number of enforced success rate outlier ejections
ejections_detected_success_rate, Counter, Number of detected success rate outlier ejections (even if unenforced)
ejections_enforced_success_rate, Counter, Number of enforced success rate outlier ejections. Exact meaning of this counter depends on :ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>` config item. Refer to :ref:`Outlier Detection documentation<arch_overview_outlier_detection>` for details.
ejections_detected_success_rate, Counter, Number of detected success rate outlier ejections (even if unenforced). Exact meaning of this counter depends on :ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>` config item. Refer to :ref:`Outlier Detection documentation<arch_overview_outlier_detection>` for details.
ejections_enforced_consecutive_gateway_failure, Counter, Number of enforced consecutive gateway failure ejections
ejections_detected_consecutive_gateway_failure, Counter, Number of detected consecutive gateway failure ejections (even if unenforced)
ejections_enforced_consecutive_local_origin_failure, Counter, Number of enforced consecutive local origin failure ejections
ejections_detected_consecutive_local_origin_failure, Counter, Number of detected consecutive local origin failure ejections (even if unenforced)
ejections_enforced_local_origin_success_rate, Counter, Number of enforced success rate outlier ejections for locally originated failures
ejections_detected_local_origin_success_rate, Counter, Number of detected success rate outlier ejections for locally originated failures (even if unenforced)
ejections_total, Counter, Deprecated. Number of ejections due to any outlier type (even if unenforced)
ejections_consecutive_5xx, Counter, Deprecated. Number of consecutive 5xx ejections (even if unenforced)

Expand Down
Loading