Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metrics UI] Fix No Data alerts on Metric Threshold with groupBy #111465

Merged
merged 17 commits into from
Sep 23, 2021

Conversation

Zacqary
Copy link
Contributor

@Zacqary Zacqary commented Sep 7, 2021

Summary

Fixes #111772

This fixes the Alert me when there's no data option on a Metric Threshold alert that uses an Alert per setting. Any previously detected groups will be stored in the rule state, and if they fail to be detected on the next execution, they'll be backfilled with null values so that they can report no data.

UPDATE: Document count aggregated values will be backfilled with 0 instead, fixing the behavior of doc_count <= 0 rules.

Testing

The integration test should cover this scenario, but to ensure it works with the actual Kibana stack:

  1. Start ingesting data from Metricbeat. Ideally with multiple nodes, but one node is sufficient to ensure this is working.
  2. Create a metric threshold alert rule with a groupBy value, e.g. host.name, and turn on Alert Me When There's No Data. Add a server log action for the Alert state. Either set a threshold that will not fire, or turn off the Alert on Status Change throttling, to ensure that you'll see a No Data alert (as No Data and Alert use the same status).
  3. Stop ingesting data. If you're ingesting data with multiple nodes, turn off only some of them, if possible.
  4. Ensure that you get a No Data alert from only the nodes you stopped ingesting data from.

Checklist

@Zacqary Zacqary added release_note:fix Feature:Metrics UI Metrics UI feature v8.0.0 Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v7.16.0 auto-backport Deprecated - use backport:version if exact versions are needed labels Sep 7, 2021
@Zacqary Zacqary marked this pull request as ready for review September 8, 2021 16:27
@Zacqary Zacqary requested a review from a team as a code owner September 8, 2021 16:27
@elasticmachine
Copy link
Contributor

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

@Zacqary Zacqary enabled auto-merge (squash) September 8, 2021 19:37
@Zacqary Zacqary added the v7.15.1 label Sep 9, 2021
@Zacqary
Copy link
Contributor Author

Zacqary commented Sep 9, 2021

@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

user doesn't have permission to update head repository

@Zacqary
Copy link
Contributor Author

Zacqary commented Sep 9, 2021

Yes I do kibanamachine what are you talking about

@Zacqary
Copy link
Contributor Author

Zacqary commented Sep 9, 2021

@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

user doesn't have permission to update head repository

@Zacqary
Copy link
Contributor Author

Zacqary commented Sep 9, 2021

Well imma break the law then whaddaya think of that

@Zacqary
Copy link
Contributor Author

Zacqary commented Sep 9, 2021

Lol kibanamachine you don't own me

@tylersmalley
Copy link
Contributor

@Zacqary ensure that Allow edits and access to secrets by maintainers is checked on this PR. It should be located at the bottom of the right column.

@Zacqary
Copy link
Contributor Author

Zacqary commented Sep 9, 2021

@tylersmalley Ahh thank you, not sure how that got unchecked

I blame kibanamachine. It has it out for me

@neptunian
Copy link
Contributor

neptunian commented Sep 14, 2021

Some behavior I encountered:

  • This is probably known issue: If I create a rule with no groupBy and select "alert me if there's no data", I receive the no data alert for * when there is no data. However if I change the groupBy to something like host.name I continue to receive the "No data" alert for * and do not get the "no data alert" for host.name. So no data alert when using groupBy only works if there was data previously for the field in groupBy.

  • When creating a rule where the threshold is being met to trigger an alert I always receive the warnings whether or not using groupBy: server log [10:03:16.642] [warning][infra][metricsRules][plugins] Could not find alert data for host-0

  • If an alert is triggered when using groupBy and "alert me if there's no data" and the groupBy and the condition is met, I get the threshold alert. If i change the groupBy to something different I now receive a "no data" alert on the previous group which is probably unexpected behavior to the user since there is data and we are just not using that groupBy anymore. Here is an example of switching groupBy from host.name to host.id and there is always data:
    Screen Shot 2021-09-14 at 10 34 35 AM

  • If an alert is triggered when using groupBy and "alert me if there's no data" and the groupBy field is changed to something different, it will continue to always alert the old/missing groupBy field alert for "no data". You will have to go into Stack Management to mute the individual alerts for each groupBy.

@neptunian
Copy link
Contributor

Looks good. Something that came up during a meeting:

If there is no data for a rule, the consensus seemed to be that they don't expect a "no data" alert per group. That they would just expect the single no data alert. @jasonrhodes correct? Sounds more like a separate rule with this behavior.

@jasonrhodes
Copy link
Member

@neptunian yeah, this is why I think clarifying no-data + group by vs document count 0 + group by is going to be essential before we merge anything. Let's hash it out in a design doc and come back here to summarize.

@Zacqary
Copy link
Contributor Author

Zacqary commented Sep 15, 2021

Let me see if I understand the expected behavior:

1. When alerting on a regular aggregation

  • When avg of system.cpu.user.pct is above 1.0 for the last 1 minute
  • Alert per: host.name
  • Every 1 minute

00:00:00

  • host-1: CPU usage at 0.6
  • host-2: CPU usage at 1.2
  • host-3: CPU usage at 0.4
  • Expected result: host-1 and host-3 are OK, host-2 is Active (Fired)

00:01:00

  • host-1: CPU usage at 0.4
  • host-2: CPU usage at 1.5
  • host-3: No data received
  • Actual result from this PR: host-1 is OK, host-2 is Active (Fired), host-3 is Active (No Data)
  • Expected result as per discussion: * is Active (No Data)

2. When alerting on a document count

  • When document count is below or equal to 0 for the last 1 minute
  • Alert per: host.name
  • Every 1 minute

00:00:00

  • host-1: 35 docs
  • host-2: 35 docs
  • host-3: 35 docs
  • Expected result: host-1, host-2, host-3 are OK

00:01:00

  • host-1: 35 docs
  • host-2: 35 docs
  • host-3: 0 docs
  • Actual result from this PR: host-1 and host-2 are OK, host-3 is Alert (No Data)
  • Expected result as per discussion: host-1 and host-2 are OK, host-3 is Alert (Fired)

The expected result of scenario 2 is what I want to see, and this PR doesn't get us there. I'm not sure I understand why the expected result of scenario 1 is more desirable than the actual result from this PR.

@neptunian
Copy link
Contributor

@neptunian yeah, this is why I think clarifying no-data + group by vs document count 0 + group by is going to be essential before we merge anything. Let's hash it out in a design doc and come back here to summarize.

Agreed. I do find having this PR helpful, for me anyway, to understand all the gotchas and technical limitations in coming up with a solution.

@Zacqary I agree. I think we need to define what "alert me if there's no data" means to the user. Whether its for no data across the board or for no documents for a particular group.

@Zacqary
Copy link
Contributor Author

Zacqary commented Sep 16, 2021

@neptunian For what it's worth, I know of at least one other prominent, widely-used alerting stack in which No Data alerts fire on a per-group basis when you're dealing with a groupBy alert. I don't think we need to decide what "alert me on no data" means in the vacuum of our own software, I think we need to decide if we're deviating from the behavior that users in the space already expect, and why.

@Zacqary
Copy link
Contributor Author

Zacqary commented Sep 16, 2021

So as it turns out, getting Document Count on a group to correctly report 0 instead of reporting as No Data is a one-line fix to this PR. I'm going to commit that just so a doc_count = 0 threshold functions as expected (since No Data isn't enabled by default in the UI).

This change would make the No Data option well and truly irrelevant for the Document Count aggregator. Not sure if disabling that checkbox in the UI is within the scope of this PR.

@Zacqary
Copy link
Contributor Author

Zacqary commented Sep 20, 2021

@elasticmachine merge upstream

@neptunian
Copy link
Contributor

So as it turns out, getting Document Count on a group to correctly report 0 instead of reporting as No Data is a one-line fix to this PR. I'm going to commit that just so a doc_count = 0 threshold functions as expected (since No Data isn't enabled by default in the UI).

This change would make the No Data option well and truly irrelevant for the Document Count aggregator. Not sure if disabling that checkbox in the UI is within the scope of this PR.

Were you going to commit that in a separate PR?

@Zacqary
Copy link
Contributor Author

Zacqary commented Sep 21, 2021

@neptunian Sorry looks like it didn't merge when I tried to push it last week. It's committed now.

Copy link
Member

@simianhacker simianhacker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I created an alert for doc_count <= 0 with Alert on no data. Then I setup Slingshot to create data for 10 hosts, reduce it down to 5, and then finally to 0 (no data). After a bit, I started 5 back up, then 10, and back to 0; everything worked as expected.

image

@Zacqary
Copy link
Contributor Author

Zacqary commented Sep 22, 2021

@elasticmachine merge upstream

@Zacqary
Copy link
Contributor Author

Zacqary commented Sep 22, 2021

@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@Zacqary Zacqary merged commit 4ac8bd2 into elastic:master Sep 23, 2021
kibanamachine added a commit to kibanamachine/kibana that referenced this pull request Sep 23, 2021
…stic#111465)

* [Metrics UI] Fix No Data alerts on Metric Threshold with groupBy

* Fix typecheck

* Add integration test for no data groupBy scenario

* Uncomment test files

* Uncomment test files

* Reset stored groups when groupBy parameter changes

* Test for groupBy arrays

* Fix initial No Data result when no groups are detected yet

* Fix linting error

* Fix detecting groups with doc count 0

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
kibanamachine added a commit to kibanamachine/kibana that referenced this pull request Sep 23, 2021
…stic#111465)

* [Metrics UI] Fix No Data alerts on Metric Threshold with groupBy

* Fix typecheck

* Add integration test for no data groupBy scenario

* Uncomment test files

* Uncomment test files

* Reset stored groups when groupBy parameter changes

* Test for groupBy arrays

* Fix initial No Data result when no groups are detected yet

* Fix linting error

* Fix detecting groups with doc count 0

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
@kibanamachine
Copy link
Contributor

💚 Backport successful

Status Branch Result
7.x
7.15

The backport PRs will be merged automatically after passing CI.

@Zacqary Zacqary deleted the groupby-nodata-fix branch September 23, 2021 18:22
kibanamachine added a commit that referenced this pull request Sep 23, 2021
…1465) (#113022)

* [Metrics UI] Fix No Data alerts on Metric Threshold with groupBy

* Fix typecheck

* Add integration test for no data groupBy scenario

* Uncomment test files

* Uncomment test files

* Reset stored groups when groupBy parameter changes

* Test for groupBy arrays

* Fix initial No Data result when no groups are detected yet

* Fix linting error

* Fix detecting groups with doc count 0

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

Co-authored-by: Zacqary Adam Xeper <Zacqary@users.noreply.github.com>
kibanamachine added a commit that referenced this pull request Sep 23, 2021
…1465) (#113021)

* [Metrics UI] Fix No Data alerts on Metric Threshold with groupBy

* Fix typecheck

* Add integration test for no data groupBy scenario

* Uncomment test files

* Uncomment test files

* Reset stored groups when groupBy parameter changes

* Test for groupBy arrays

* Fix initial No Data result when no groups are detected yet

* Fix linting error

* Fix detecting groups with doc count 0

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

Co-authored-by: Zacqary Adam Xeper <Zacqary@users.noreply.github.com>
lykkin pushed a commit to lykkin/kibana that referenced this pull request Sep 28, 2021
…stic#111465)

* [Metrics UI] Fix No Data alerts on Metric Threshold with groupBy

* Fix typecheck

* Add integration test for no data groupBy scenario

* Uncomment test files

* Uncomment test files

* Reset stored groups when groupBy parameter changes

* Test for groupBy arrays

* Fix initial No Data result when no groups are detected yet

* Fix linting error

* Fix detecting groups with doc count 0

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Deprecated - use backport:version if exact versions are needed Feature:Metrics UI Metrics UI feature release_note:fix Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v7.15.1 v7.16.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Metrics UI] Metric threshold rule type: fix group by + 0 check
7 participants