Skip to content

Commit

Permalink
upstream: outlier detector - distinguish upstream from internal errors (
Browse files Browse the repository at this point in the history
#4822)

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
  • Loading branch information
cpakulski authored and mattklein123 committed Jun 27, 2019
1 parent eb65b20 commit 89d81e6
Show file tree
Hide file tree
Showing 23 changed files with 1,412 additions and 253 deletions.
51 changes: 48 additions & 3 deletions api/envoy/admin/v2alpha/clusters.proto
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,15 @@ message ClusterStatus {
// Denotes whether this cluster was added via API or configured statically.
bool added_via_api = 2;

// The success rate threshold used in the last interval. The threshold is used to eject hosts
// based on their success rate. See
// :ref:`Cluster outlier detection <arch_overview_outlier_detection>` statistics
// The success rate threshold used in the last interval.
// If
// :ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>`
// is *false*, all errors: externally and locally generated were used to calculate the threshold.
// If
// :ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>`
// is *true*, only externally generated errors were used to calculate the threshold.
// The threshold is used to eject hosts based on their success rate. See
// :ref:`Cluster outlier detection <arch_overview_outlier_detection>` documentation for details.
//
// Note: this field may be omitted in any of the three following cases:
//
Expand All @@ -43,6 +49,23 @@ message ClusterStatus {

// Mapping from host address to the host's current status.
repeated HostStatus host_statuses = 4;

// The success rate threshold used in the last interval when only locally originated failures were
// taken into account and externally originated errors were treated as success.
// This field should be interpretted only when
// :ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>`
// is *true*. The threshold is used to eject hosts based on their success rate.
// See :ref:`Cluster outlier detection <arch_overview_outlier_detection>` documentation for
// details.
//
// Note: this field may be omitted in any of the three following cases:
//
// 1. There were not enough hosts with enough request volume to proceed with success rate based
// outlier ejection.
// 2. The threshold is computed to be < 0 because a negative value implies that there was no
// threshold for that interval.
// 3. Outlier detection is not enabled for this cluster.
envoy.type.Percent local_origin_success_rate_ejection_threshold = 5;
}

// Current state of a particular host.
Expand All @@ -57,6 +80,14 @@ message HostStatus {
HostHealthStatus health_status = 3;

// Request success rate for this host over the last calculated interval.
// If
// :ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>`
// is *false*, all errors: externally and locally generated were used in success rate
// calculation. If
// :ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>`
// is *true*, only externally generated errors were used in success rate calculation.
// See :ref:`Cluster outlier detection <arch_overview_outlier_detection>` documentation for
// details.
//
// Note: the message will not be present if host did not have enough request volume to calculate
// success rate or the cluster did not have enough hosts to run through success rate outlier
Expand All @@ -71,6 +102,20 @@ message HostStatus {

// The host's priority. If not configured, the value defaults to 0 (highest priority).
uint32 priority = 7;

// Request success rate for this host over the last calculated
// interval when only locally originated errors are taken into account and externally originated
// errors were treated as success.
// This field should be interpretted only when
// :ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>`
// is *true*.
// See :ref:`Cluster outlier detection <arch_overview_outlier_detection>` documentation for
// details.
//
// Note: the message will not be present if host did not have enough request volume to calculate
// success rate or the cluster did not have enough hosts to run through success rate outlier
// ejection.
envoy.type.Percent local_origin_success_rate = 8;
}

// Health status for a host.
Expand Down
41 changes: 37 additions & 4 deletions api/envoy/api/v2/cluster/outlier_detection.proto
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,8 @@ option (gogoproto.equal_all) = true;
// See the :ref:`architecture overview <arch_overview_outlier_detection>` for
// more information on outlier detection.
message OutlierDetection {
// The number of consecutive 5xx responses before a consecutive 5xx ejection
// The number of consecutive 5xx responses or local origin errors that are mapped
// to 5xx error codes before a consecutive 5xx ejection
// occurs. Defaults to 5.
google.protobuf.UInt32Value consecutive_5xx = 1;

Expand Down Expand Up @@ -71,14 +72,46 @@ message OutlierDetection {
// be 1900. Defaults to 1900.
google.protobuf.UInt32Value success_rate_stdev_factor = 9;

// The number of consecutive gateway failures (502, 503, 504 status or
// connection errors that are mapped to one of those status codes) before a
// consecutive gateway failure ejection occurs. Defaults to 5.
// The number of consecutive gateway failures (502, 503, 504 status codes)
// before a consecutive gateway failure ejection occurs. Defaults to 5.
google.protobuf.UInt32Value consecutive_gateway_failure = 10;

// The % chance that a host will be actually ejected when an outlier status
// is detected through consecutive gateway failures. This setting can be
// used to disable ejection or to ramp it up slowly. Defaults to 0.
google.protobuf.UInt32Value enforcing_consecutive_gateway_failure = 11
[(validate.rules).uint32.lte = 100];

// Determines whether to distinguish local origin failures from external errors. If set to true
// the following configuration parameters are taken into account:
// :ref:`consecutive_local_origin_failure<envoy_api_field_cluster.OutlierDetection.consecutive_local_origin_failure>`,
// :ref:`enforcing_consecutive_local_origin_failure<envoy_api_field_cluster.OutlierDetection.enforcing_consecutive_local_origin_failure>`
// and
// :ref:`enforcing_local_origin_success_rate<envoy_api_field_cluster.OutlierDetection.enforcing_local_origin_success_rate>`.
// Defaults to false.
bool split_external_local_origin_errors = 12;

// The number of consecutive locally originated failures before ejection
// occurs. Defaults to 5. Parameter takes effect only when
// :ref:`split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>`
// is set to true.
google.protobuf.UInt32Value consecutive_local_origin_failure = 13;

// The % chance that a host will be actually ejected when an outlier status
// is detected through consecutive locally originated failures. This setting can be
// used to disable ejection or to ramp it up slowly. Defaults to 100.
// Parameter takes effect only when
// :ref:`split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>`
// is set to true.
google.protobuf.UInt32Value enforcing_consecutive_local_origin_failure = 14
[(validate.rules).uint32.lte = 100];

// The % chance that a host will be actually ejected when an outlier status
// is detected through success rate statistics for locally originated errors.
// This setting can be used to disable ejection or to ramp it up slowly. Defaults to 100.
// Parameter takes effect only when
// :ref:`split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>`
// is set to true.
google.protobuf.UInt32Value enforcing_local_origin_success_rate = 15
[(validate.rules).uint32.lte = 100];
}
29 changes: 27 additions & 2 deletions api/envoy/data/cluster/v2alpha/outlier_detection_event.proto
Original file line number Diff line number Diff line change
Expand Up @@ -47,12 +47,37 @@ message OutlierDetectionEvent {

// Type of ejection that took place
enum OutlierEjectionType {
// In case upstream host returns certain number of consecutive 5xx
// In case upstream host returns certain number of consecutive 5xx.
// If
// :ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>`
// is *false*, all type of errors are treated as HTTP 5xx errors.
// See :ref:`Cluster outlier detection <arch_overview_outlier_detection>` documentation for
// details.
CONSECUTIVE_5XX = 0;
// In case upstream host returns certain number of consecutive gateway errors
CONSECUTIVE_GATEWAY_FAILURE = 1;
// Runs over aggregated success rate statistics from every host in cluster
// and selects hosts for which ratio of successful replies deviates from other hosts
// in the cluster.
// If
// :ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>`
// is *false*, all errors (externally and locally generated) are used to calculate success rate
// statistics. See :ref:`Cluster outlier detection <arch_overview_outlier_detection>`
// documentation for details.
SUCCESS_RATE = 2;
// Consecutive local origin failures: Connection failures, resets, timeouts, etc
// This type of ejection happens only when
// :ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>`
// is set to *true*.
// See :ref:`Cluster outlier detection <arch_overview_outlier_detection>` documentation for
CONSECUTIVE_LOCAL_ORIGIN_FAILURE = 3;
// Runs over aggregated success rate statistics for local origin failures
// for all hosts in the cluster and selects hosts for which success rate deviates from other
// hosts in the cluster. This type of ejection happens only when
// :ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>`
// is set to *true*.
// See :ref:`Cluster outlier detection <arch_overview_outlier_detection>` documentation for
SUCCESS_RATE_LOCAL_ORIGIN = 4;
}

// Represents possible action applied to upstream host
Expand All @@ -74,4 +99,4 @@ message OutlierEjectSuccessRate {
}

message OutlierEjectConsecutive {
}
}
15 changes: 15 additions & 0 deletions docs/root/configuration/cluster_manager/cluster_runtime.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,11 @@ outlier_detection.consecutive_gateway_failure
<envoy_api_field_cluster.OutlierDetection.consecutive_gateway_failure>`
setting in outlier detection

outlier_detection.consecutive_local_origin_failure
:ref:`consecutive_local_origin_failure
<envoy_api_field_cluster.OutlierDetection.consecutive_local_origin_failure>`
setting in outlier detection

outlier_detection.interval_ms
:ref:`interval_ms
<envoy_api_field_cluster.OutlierDetection.interval>`
Expand All @@ -67,11 +72,21 @@ outlier_detection.enforcing_consecutive_gateway_failure
<envoy_api_field_cluster.OutlierDetection.enforcing_consecutive_gateway_failure>`
setting in outlier detection

outlier_detection.enforcing_consecutive_local_origin_failure
:ref:`enforcing_consecutive_local_origin_failure
<envoy_api_field_cluster.OutlierDetection.enforcing_consecutive_local_origin_failure>`
setting in outlier detection

outlier_detection.enforcing_success_rate
:ref:`enforcing_success_rate
<envoy_api_field_cluster.OutlierDetection.enforcing_success_rate>`
setting in outlier detection

outlier_detection.enforcing_local_origin_success_rate
:ref:`enforcing_local_origin_success_rate
<envoy_api_field_cluster.OutlierDetection.enforcing_local_origin_success_rate>`
setting in outlier detection

outlier_detection.success_rate_minimum_hosts
:ref:`success_rate_minimum_hosts
<envoy_api_field_cluster.OutlierDetection.success_rate_minimum_hosts>`
Expand Down
8 changes: 6 additions & 2 deletions docs/root/configuration/cluster_manager/cluster_stats.rst
Original file line number Diff line number Diff line change
Expand Up @@ -133,10 +133,14 @@ statistics will be rooted at *cluster.<name>.outlier_detection.* and contain the
ejections_overflow, Counter, Number of ejections aborted due to the max ejection %
ejections_enforced_consecutive_5xx, Counter, Number of enforced consecutive 5xx ejections
ejections_detected_consecutive_5xx, Counter, Number of detected consecutive 5xx ejections (even if unenforced)
ejections_enforced_success_rate, Counter, Number of enforced success rate outlier ejections
ejections_detected_success_rate, Counter, Number of detected success rate outlier ejections (even if unenforced)
ejections_enforced_success_rate, Counter, Number of enforced success rate outlier ejections. Exact meaning of this counter depends on :ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>` config item. Refer to :ref:`Outlier Detection documentation<arch_overview_outlier_detection>` for details.
ejections_detected_success_rate, Counter, Number of detected success rate outlier ejections (even if unenforced). Exact meaning of this counter depends on :ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>` config item. Refer to :ref:`Outlier Detection documentation<arch_overview_outlier_detection>` for details.
ejections_enforced_consecutive_gateway_failure, Counter, Number of enforced consecutive gateway failure ejections
ejections_detected_consecutive_gateway_failure, Counter, Number of detected consecutive gateway failure ejections (even if unenforced)
ejections_enforced_consecutive_local_origin_failure, Counter, Number of enforced consecutive local origin failure ejections
ejections_detected_consecutive_local_origin_failure, Counter, Number of detected consecutive local origin failure ejections (even if unenforced)
ejections_enforced_local_origin_success_rate, Counter, Number of enforced success rate outlier ejections for locally originated failures
ejections_detected_local_origin_success_rate, Counter, Number of detected success rate outlier ejections for locally originated failures (even if unenforced)
ejections_total, Counter, Deprecated. Number of ejections due to any outlier type (even if unenforced)
ejections_consecutive_5xx, Counter, Deprecated. Number of consecutive 5xx ejections (even if unenforced)

Expand Down
Loading

0 comments on commit 89d81e6

Please sign in to comment.