Skip to content

Templating

Quentin Manfroi edited this page Jun 17, 2021 · 20 revisions

🔗 Contents

Here are templating tips and rules used to create detectors.

Variables

Providing variables is the key to allow user adapt the behavior to meet their requirement.

We can see the templating as a mechanism which defines the behavior of a detector depending on variables values which will change the underlying code (Terraform or SignalFlow).

Check the full list of available variables, most of them are used in Templating below.

Default values

Providing (or not) default values for variables could help (or force) users to configure properly his monitoring.

  • define default value as recommendation for everybody when possible.
  • do not define default value to make something specific obvious to user. For example, if it is not possible to advice a good generic threshold, so do not set it will ask to user to define their own adapted to his use case.
  • if alerting rule is too dangerous or tricky to deploy it by default, it is possible to disable it by default to avoid to many false alerts and let advanced users to configure it even so.

Thresholds

This is probably the most obvious Templating mechanism to explain. This is a basic need for the user to be able to adapt the thresholds of their detectors alerts rules according to their needs.

The thresholds values will be used in SignalFlow program for when function and in rules descriptions to show limit in alerts.

You can set thresholds from the threshold variables. Or you can decide to hard-code the value because it does not make sens to change it and will only provide a way to break the detector.

Each threshold implies its own alerting rule and a detector could have multiple rules in to operate at different severity level. For example, Critical for CPU>90% and Major for CPU>90% and CPU<90%.

Rules dependency

Think to dependency rules for a detector with multiple rules. This will avoid to trigger multiple alerts for the same problem.

In the previous example, we do not raise Critical AND Major alerts when CPU>90%. There is no native feature in SignalFx to configure explicit dependency between two rules, so you will have to "negate" your first condition in the second one.

That said, you can create detector with only one rule or not dependent rules.

Disabling

The disabled variables make it possible to disable alerting rules at different level depending on the scope of the variable.

It is very similar to thresholds but has their higher level variables for bulk configuration. For example, disabling all rules of a detector or only the critical rule.

Naming

The alert subject and body are determined from different information:

  • the environment variable value
  • the severity choose for the rule
  • if defined, the prefixes list will be added between the severity and environment
  • the defined threshold for this severity rule.
  • the name argument must be format("%s %s", local.detector_name_prefix, "[summary]") where summary is a short description of the check done, what it monitors like Memory usage.
  • the rule description argument must describe the condition break like is too high. In general, we try to append the threshold value for more context like is too high > ${var.[id]_threshold_[severity]}.

Example

Suppose following values:

  • id of the detector is memory (used to prefix variables names)
  • environment = "Testing"
  • severity = "Critical"
  • prefixes = ["MyClient", "MyStack"]
  • threshold = 90
  • name = format("%s %s", local.detector_name_prefix, "Memory usage")
  • rule.description = "is too high > ${var.memory_threshold_critical}"

The resulted full alert name will be :

[Critical][MyClient][MyStack][Testing] Memory usage limit is too high > 90% (98.00) on {host=xxx}

The end will depend on the dimensions available on metrics used in signals.

Filtering

The filtering_custom variable allows to change the filtering behavior on every detectors of the module.

If not defined (default is null), the default filtering policy for the module is used. If defined to "" empty string, so there is no filtering policy applied at all. If defined to a signalflow filtering string so it will be applied in place of the default filtering policy. Finally, if a string is defined in addition to filtering_append set to true so both the default filtering policy (based on the Tagging convention) and the custom filters will be used separated by the and logical operator.

No matter what kind of filters are applied, the underlying metrics must have related metadata.

You have to import the common internal filtering module and use its output as filters in every detectors implemented for your module to support this mechanism.

However, you still can add other filters specific to each detector directly into the code in addition to this mandatory one but you should rely only of metadata provided by the data source (or at least, available in general). Else, you have to document this required "custom" dimension in the local module README.md.

Aggregation

SignalFx does not aggregate by default so detectors evaluate every single active MTS matching a "signal". This is very convenient in general and allow to apply a detector to all reporting resources (depending of the filtering) without to know the available metadata in advance.

You must use the aggregation_function variable in the signalflow program to let the user change it if desired because available dimensions could depend on multiple factors (like extraDimensions or disableHostDimensions parameters or the environment itself) so user should be able to configure it.

When possible leave an empty value by default to works as many scenarios as possible. But if your detector needs to work at higher level you will have to define a value. It highly depends on the granularity of each metric and its corresponding source.

For example, you can have a metric serving multiple MTS for each member of a cluster. But you could want to apply a detector on the entire cluster "above" each member because the aggregated stat makes more sens for alerting purpose.

In this case, choose only dimensions available everywhere to work in most of the cases. This could be a dimension provided by the source of metrics itself (i.e. the agent monitor) or reserved dimensions globally available like host.

But avoid to use any dimensions specific to an environment or a configuration or you will have to explain why and how in the corresponding readme.

For example, we will not use container_id dimension which works only in containerized environment except if the module is dedicated to this kind of environment.

Transformation

You must use the transformation_function variable in the signalflow program even if you do not define a value by default. This could help the user to adapt sensitivity of alerting rules depending on the situation.

That said, it is in general recommended to set default aggregation because we do not want to raise alert as soon as conditions are meet. Transform all datapoints on a longer timeframe to one value gives a more reliable way to define rules.

For example, you can use percentile method to 90 for a latency to ignore few high MTS.

Notifications

You must use notifications variable to define the rule notifications.

It allows the user to bind different alerts recipients list depending on each possible severity. See the recommended Notifications binding to follow best practices.

Playing with severities will help to judge the criticality, the purpose and the scope of an alert.

No data

Think to no data and handle it. Indeed, often we create detectors on real situations. It is good, but it is also important to think at how they behave in a no or few/irregular data presence.

Take a load balancer 5xx errors percentage as example:

  • In most of the cases, load balancer will continually receive few requests.
  • however in some cases, the traffic be very rare and erratic. For example, in testing environment, we could expect to receive irregular traffic depending on developers tests.

Now, imagine a load balancer receives only one request resulting in 5xx error and no more traffic after that during long time, says one day.

  • The percentage of errors will be considered as high (one request received, one request in 5xx => 100% errors)
  • But with only one request in error, not persisted on long enough period we do not want to raise alert.
  • however without traffic there is no more data coming and the evaluation of the detector will be done as well even if it is based on only one irrelevant datapoint.
  • Here this single datapoint is in error so the detector will raise an alert and it could remain until new data come "to reset" the evaluation, here one day.

To avoid this kind of undesired behavior we encourage to use extrapolation policy or fill SignalFlow method.