Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metrics UI] Create new Inventory Anomaly alert #74809

Closed
phillipb opened this issue Aug 11, 2020 · 10 comments · Fixed by #89244
Closed

[Metrics UI] Create new Inventory Anomaly alert #74809

phillipb opened this issue Aug 11, 2020 · 10 comments · Fixed by #89244
Assignees
Labels
Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services

Comments

@phillipb
Copy link
Contributor

phillipb commented Aug 11, 2020

Add a new alert for ML anomaly data.

Screen Shot 2020-08-11 at 3 21 29 PM

Acceptance Criteria:

  • Should be able to create an alert from the dropdown in the header on the inventory screen
  • Alert should be easily configurable based on severity threshold (warning, minor, major, critical), ML job, and node type (Host, Kubernetes)
  • Should be able to filter for a specific host
  • Should be able to preview anomaly alerts
@phillipb phillipb added Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v7.10.0 labels Aug 11, 2020
@phillipb phillipb added this to the Metrics UI 7.10 milestone Aug 11, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

@Zacqary Zacqary self-assigned this Aug 13, 2020
@Zacqary
Copy link
Contributor

Zacqary commented Aug 25, 2020

For the condition editor, the screenshot indicates:

WHEN [hosts-cpu-usage]

Should it instead be:

WHEN [Hosts] [CPU Usage]

like the Inventory alert already supports?

@phillipb
Copy link
Contributor Author

@Zacqary good question. It should be for HOSTS on the first line. Then it should have the when after that. Similar to inventory alerts.

@Zacqary
Copy link
Contributor

Zacqary commented Aug 25, 2020

Alert should be easily configurable based on severity threshold (warning, minor, major, critical), ML job

Are we using numbers, as in the screenshot, or Warning/Minor/Major/Critical? What's the desired input?

Also I don't see a field for ML job. Or is that [severity score]? And is the ML job per condition, or the same ML job for all conditions (like the FOR THE LAST param of threshold alerts)

@phillipb
Copy link
Contributor Author

phillipb commented Aug 26, 2020

@Zacqary We're not using numbers, we're using the 4 severity levels.

To my knowledge, this alert shouldn't have multiple conditions.

The ML job part is the host-cpu-usage. We should just call it CPU though. No need to use the exact job name.

@sorantis
Copy link

sorantis commented Oct 22, 2020

Isi it possible to add anomaly scores as metrics to Inventory alerts flyout?
E.g. for Hosts, added to the dropdown
Screen Shot 2020-10-22 at 14 49 11

@Zacqary
Copy link
Contributor

Zacqary commented Jan 20, 2021

Apologies for not picking up on this earlier, but now that I've started implementing this alert I realize I'm not sure about its action messaging behaviors.

  • What do we send to the context variables? Anomaly score, timestamp of the anomaly, anything else?
  • I'm assuming it's supposed to detect anomalies that happened since its last execution, e.g. a "check every: 5 minute alert looks at anomalies that happened over the past 5 minutes? What happens if there were multiple anomalies? Do we send all of them to the context variables, and if so, how do we want to format that? Or do we just send the most recent one? Or the most severe anomaly? What if the most recent anomaly isn't the most severe anomaly?

I'm worried that the alerting plugin's hyper-customizable action messages are driving us to neglect thinking about action messaging as a product design decision. Since they're hidden beneath a clickthrough they feel like a second-class part of the UI and it's easy for both designer and engineer to forget about them until very late in the process like this.

@sorantis
Copy link

sorantis commented Jan 21, 2021

@Zacqary I agree, we should be more thoughtful about details like this.
The context variables:

  • timestamp
  • anomaly score (severity)
  • metric (detector)
  • influencer
  • summary (e.g. 29x higher)
  • actual value
  • typical value

Regarding the second question, @grabowskit, can you advise?

@Zacqary
Copy link
Contributor

Zacqary commented Jan 21, 2021

Should be able to filter for a specific host

I'm not seeing an obvious way to do this using KQL queries the way we do them with other alerts. Should we just allow you to filter by influencer? i.e. limit the filter field to either host.hostname or kubernetes.pod.uid, depending on node type?

@Zacqary
Copy link
Contributor

Zacqary commented Jan 21, 2021

Did some research, and given the limited number of influencer fields, I'm not sure if it makes sense to use the KQL search bar for this alert. I think it would be better to use a dropdown where you can select [kubernetes.node.name, 'kubernetes.pod.uid, kubernetes.namespace] for k8s, or ['host.name'] for hosts, and then enter a search string for the field value.

Using the KQL component would allow the user to easily enter filter queries that could never possibly return anything, and be a headache to parse on the backend since the filter JSON it outputs can't be slipped right into the anomalies query. We need to separate the field name and the query string, and the KQL outputs these in an inconsistent format depending on whether wildcards are present or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants