Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.0 only] Should the alerting services plugin always be enabled? #90934

Closed
mikecote opened this issue Feb 10, 2021 · 28 comments · Fixed by #113461
Closed

[8.0 only] Should the alerting services plugin always be enabled? #90934

mikecote opened this issue Feb 10, 2021 · 28 comments · Fixed by #113461
Assignees
Labels
Breaking Change core services Issues related to enabling features across Kibana to leverage core services across domains discuss estimate:medium Medium Estimated Level of Effort Feature:Actions Feature:Alerting Feature:EventLog Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

There don't seem to be use cases where users of Kibana would want to disable any of our plugins. If we can't find a valid reason, it feels like we should prevent our plugins from being disabled by deprecating the setting (ex: xpack.actions.enabled) in 7.x and removing the capability in 8.0. This will make things much more straightforward and allow other plugins to make the alerting-related plugins a required dependency.

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@mikecote mikecote changed the title Should the alerting services plugin be allowed to be disabled? Should the alerting services plugin always be enabled? Feb 10, 2021
@mikecote
Copy link
Contributor Author

cc @kobelb

@pmuellr
Copy link
Member

pmuellr commented Feb 10, 2021

If there's still some requirement that customers be able to run Kibana without alerts "enabled", we could implement a "soft-disable" config key, that does not run alerts at all - perhaps disable CRUD operations or something as well, maybe still let you read though.

Seems preferable to making other plugins treat alerting as an optional plugin, with all that entails.

You can already "soft-disable" actions via the enabledActionTypes config key - you can just set it to an empty array to disable all the action types.

@Crazybus
Copy link

In the mean time would it make sense to update the wording in the documentation to make it clearer what this setting does?

https://www.elastic.co/guide/en/kibana/current/alert-action-settings-kb.html#general-alert-action-settings

xpack.actions.enabled | Feature toggle that enables Actions in Kibana. Defaults to true.

The current wording makes it sound like this setting is a feature toggle that will disable Kibana alerting. When the reality is that this setting will disable the plugin and any other Kibana applications that depend on it.

@mikecote mikecote changed the title Should the alerting services plugin always be enabled? [8.0 only] Should the alerting services plugin always be enabled? Jun 15, 2021
@gmmorris gmmorris added the loe:large Large Level of Effort label Jul 6, 2021
@mikecote
Copy link
Contributor Author

Based on #89584, I could not find explicit requirements to have our plugins disable-able today so I'm moving this discussion in the direction of removing .enabled in all alerting plugins (event log, task manager, alerting, actions).

Having dedicated Kibanas for alerting should be researched to ensure it addresses the requirements while thinking about concerns like having 0 Kibanas for alerting (by accident), situations where we'd be unable to schedule tasks / create rules in certain Kibanas, etc.

@pmuellr
Copy link
Member

pmuellr commented Aug 10, 2021

Having dedicated Kibanas for alerting should be researched to ensure it addresses the requirements while thinking about concerns like having 0 Kibanas for alerting (by accident), situations where we'd be unable to schedule tasks / create rules in certain Kibanas, etc.

I think "Kibanas that only run alerts" and "Kibanas that don't run alerts" kind of story is probably independent of whether the base plugins should be enabled. You'd probably still want the alerting UIs to work in "Kibanas that don't run alerts", you just don't run the tasks on those Kibanas. But of course, we're nowhere near figuring out how this would really work, so ... just my 2¢.

@mikecote
Copy link
Contributor Author

I think "Kibanas that only run alerts" and "Kibanas that don't run alerts" kind of story is probably independent of whether the base plugins should be enabled. You'd probably still want the alerting UIs to work in "Kibanas that don't run alerts", you just don't run the tasks on those Kibanas. But of course, we're nowhere near figuring out how this would really work, so ... just my 2¢.

++ I agree with everything mentioned here.

@chrisronline
Copy link
Contributor

To clarify here, it sounds like the plan is to:

  • Deprecate the ability to disable these plugins in 7.x
  • Remove the .enable config for these plugins in 8.0

Are we still discussing this, or are we happy with the above approach?

@mikecote
Copy link
Contributor Author

Are we still discussing this, or are we happy with the above approach?

I'm 👍 to say we're happy to remove the above approach. The reason we originally added support for .enabled was to comply with the standards of being able to disable any plugin in Kibana. Now that the narrative has changed, I'm thinking we follow along as there wasn't prior requirements for this.

To clarify here, it sounds like the plan is to:

  • Deprecate the ability to disable these plugins in 7.x
  • Remove the .enable config for these plugins in 8.0

Yup that's the correct plan 👍

@gmmorris gmmorris added the core services Issues related to enabling features across Kibana to leverage core services across domains label Aug 16, 2021
@chrisronline chrisronline self-assigned this Aug 17, 2021
@gmmorris gmmorris added the estimate:medium Medium Estimated Level of Effort label Aug 18, 2021
@gmmorris gmmorris removed the loe:large Large Level of Effort label Sep 2, 2021
@mikecote
Copy link
Contributor Author

mikecote commented Sep 8, 2021

@kobelb @stacey-gammon @lukeelmers since deprecating the .enabled flags in our plugins (#108281), we've started hearing use-cases that our customers are using these flags for.

The most recent one is about preventing certain Kibana instances from running alerting or action plugin(s) because they don't want them running actions or alerting tasks. We've never supported dedicated alerting instances at this time because it comes with side effects where some plugins (like Observability) become completely disabled and also the non-alerting instances cannot enqueue rule or action tasks for other Kibanas to pick up.

It seems removing the .enabled flag will cause some friction to some users who relied on such a flag to set up their deployment but it's also something we don't recommend / support. We are transitioning towards an internal xpack.task_manager.internal.exclude_task_types flag (#111036) that allows us to temporarily disable certain task types to debug Kibana, but we feel it will also become the new norm and prevent us from keeping it internal / removing such configuration at a future time.

Question: With the context above, we will be making this a breaking change and cause some friction, is your recommendation to still move forward with the removal of .enabled? And create an ER on ourselves to officially support dedicated alerting Kibanas? (which some users may be waiting for to upgrade)

@kobelb
Copy link
Contributor

kobelb commented Sep 8, 2021

Question: With the context above, we will be making this a breaking change and cause some friction, is your recommendation to still move forward with the removal of .enabled? And create an ER on ourselves to officially support dedicated alerting Kibanas? (which some users may be waiting for to upgrade)

Based on all of the information in this thread, it's my understanding that we do see benefit from preventing a specific Kibana node from executing alerts; however, we don't want to make all consumers of the alerting framework have to deal with the complexities of the plugin being entirely disabled. If this is this case, I think it'd make sense for us to remove xpack.alerting.enabled and xpack.actions.enabled settings that disable the plugins entirely and instead add a xpack.alerting.rule_execution.enabled flag that only prevents the task manager from executing the alerting rules themselves.

Is this feasible?

@chrisronline
Copy link
Contributor

@mikecote Does it feel possible that users might want to keep rule execution but disable action execution? A config like xpack.task_manager.internal.exclude_task_types gives us this flexibility out of the box - if we went down the route of xpack.alerting.rule_execution.enabled, we'd have to potentially duplicate this config across, at least, the actions plugin too (which is doable too)

@mikecote
Copy link
Contributor Author

mikecote commented Sep 9, 2021

Thanks @kobelb for input! After chatting with @chrisronline offline (chrisroffline?) we're still not sure how a customer could manage to make Kibana alerting work on dedicated instances. So we've decided to pursue the path of having an internal xpack.task_manager.internal.exclude_task_types configuration for now (for ourselves, debugging purposes, etc) and take space/time to properly develop xpack.alerting.rule_execution.enabled and xpack.actions.action_execution.enabled in a way we would be comfortable to support when it becomes a priority.

@chrisronline
Copy link
Contributor

@YulNaumenko @ymao1 @pmuellr Does anyone else have a strong opinion about the above direction?

@pmuellr
Copy link
Member

pmuellr commented Sep 9, 2021

General direction seems fine to me.

w/r/t xpack.task_manager.internal.exclude_task_types - we already have xpack.actions.enabledActionTypes:

A list of action types that are enabled. It defaults to [*], enabling all types. The names for built-in Kibana action types are prefixed with a . and include: .server-log, .slack, .email, .index, .pagerduty, and .webhook. An empty list [] will disable all action types.

Disabled action types will not appear as an option when creating new connectors, but existing connectors and actions of that type will remain in Kibana and will not function.

Would it make sense to have a similar setting for rule types? Customers could then essentially disable running rules by setting the value to [].

Then we'd be left with non-alerting tasks, and whether we'd want to disable those as well. Presumably task manager could have a similar setting (or does it already?).

I'm also wondering about our alerting tasks that aren't rule / connector execution - api key invalidation, etc - I'm assuming these won't have issues if "alerting is disabled", but not sure.

Also wondering what happens to rules already scheduled, when "alerting is disabled". Presumably the task docs still exist, the rules are still executed from a task manager POV, but then the rule executor would basically just return immediately. Could log a warning, maybe, since presumably the customer should go ahead and disable these.

@gmmorris
Copy link
Contributor

gmmorris commented Sep 9, 2021

There's a potential unintended consequence to the ability of disabling specific task types on a Kibana instance - it can break features like "Run Now".

For example:
Say a customer has two Kibana:

  1. Kibana A is configured to run all tasks
  2. Kibana B is configured to disable all alerting task types

User uses Stack Management to update an existing rule from interval:1h to interval:5m.
When they click "save" the API call hits Kibana B which updates the rule SO and then calls runNow on the rule in order to force the task to pull a fresh rule configuration from the SO. The runNow fails because that task type is disabled in Kibana B, but I think this failure is silent (limited to Server Log) as runNow is async.
The user is given the feedback that the updated was successful, and the UX reflects back that the interval is now 5m because the rule SO was updated. Sadly, the task was not, and the rule doesn't run until the original schedule is reached which might be a whole hour away.

In general, I worry this config could be naively used to create "alerting only kibana" instacnes and this will likely have unintended consequences.
If we add this config it has to be marked experimental and unsafe somehow.

: it might actually fail the update, but that still sucks, as the user randomly fails to update the rule 50% of the time.

@pmuellr
Copy link
Member

pmuellr commented Sep 9, 2021

Ya, it's complicated. RFC?

Another thing I happened to think of, is that whatever we come up with here to "disable alerting", could also potentially be used to prevent the "rando Kibana with different encryption key screws up all the ESOs". Like, you should really opt-in to alerting (or task manager) via some pre-arranged "id", so that rando Kibanas wouldn't actually upset alerting / task manager the way they do today. Not saying it HAS to, but it would be nice to solve that issue as well, and whatever we do in general regarding this issue might help :-)

@chrisronline
Copy link
Contributor

I'd like to revisit the problem statement as I understand it.

Problem

A user is experiencing a performance issue in Kibana and isn't sure where it's coming from. They'd like to disable various features of Kibana to isolate the issue. Because we deprecated the ability to disable the various plugins alerting owns (with the goal to remove this functionality in 8.x), users are unable to determine if the performance issue stems from background work or something else.

To ensure users still have a way to isolate performance problems, we need to expose a config (that is not a feature, not supported, and inherently unsafe to use for more than a short period of time while debugging) that allows them to stop all/some background tasks while they are debugging the performance issue. This config should exist at the task manager level so they have better control over what they want to disable (everything, just actions, just rules, some other background task, etc)

Concerns

A concern is if users start using this config to enable different use cases (such as a dedicated Kibana to execute rules/actions). This is not the intended use case and therefore the config will explicitly state that it is "not safe" to use outside of debugging/troubleshooting. I know there are varying opinions about how much control we should give users to "shoot themselves in the foot" but I feel we can counteract that in two ways:

  • Be very explicit that this config is unsafe/unstable
  • Ensure this config is "supportable", meaning it will not take long to solve a support case where the root cause is this config (logging, additional field in TM. health api, etc)

I'm fairly sure our current solution solves this well and I don't think we need to worry about edge cases, as we're not building an supported feature.

Thoughts?

@kobelb
Copy link
Contributor

kobelb commented Sep 9, 2021

I had a quick call with @chrisronline and @mikecote on Zoom, and I'll summarize my conclusions. I think that all of the proposed options are tolerable:

  1. Add a xpack.task_manager.unsafe.exclude_task_types setting
  2. Don't add new settings and tell users to configure xpack.task_manager.poll_interval: 12345678901234567890 xpack.task_manager.max_workers: 1
  3. Add xpack.alerting.rule_execution.enable and xpack.actions.action_execution.enabled.

I will support whatever decision the team makes.

@mikecote
Copy link
Contributor Author

mikecote commented Sep 9, 2021

We will be going with Option 1 as @kobelb mentioned above. The reason we are working on such a flag is what @chrisronline described here: #90934 (comment). The other discussion items provide use cases if this was an official feature but we don't have time or scope to develop an official feature for 7.16 as we remove the ability to make our plugins disableable. To mitigate the concerns, we've documented in the Task Manager's README that This configuration is experimental, unsupported and can only be used for temporary debugging purposes as it will cause Kibana to behave in unexpected ways. and seems to capture the concerns mentioned above. A warning will also appear when starting Kibana.

we already have xpack.actions.enabledActionTypes

This configuration would allow Task Manager to claim tasks and fail them during executions (due to being disabled). It would be preferable not to have any part of the task run.

I'm also wondering about our alerting tasks that aren't rule / connector execution - api key invalidation, etc - I'm assuming these won't have issues if "alerting is disabled", but not sure.

The xpack.task_manager.unsafe.exclude_task_types route can handle this by setting the relative task types as values.

Also wondering what happens to rules already scheduled, when "alerting is disabled"

In the xpack.task_manager.unsafe.exclude_task_types use case, Kibana will not be claiming as many or any tasks, so no claiming mutations will happen to those task documents, and they would remain untouched from a running perspective (Kibana can still schedule tasks, etc).

it can break features like "Run Now".

We've aimed our README documentation to mention that it will make Kibana behave in unexpected ways (This configuration is experimental, unsupported and can only be used for temporary debugging purposes as it will cause Kibana to behave in unexpected ways.). So we are covered under that use case. We're also refraining from officially documenting this setting so we can remove it at a future time when we have a better idea of how to officially support dedicated Kibanas to run alerting and address these types of problems.

@leandrojmp
Copy link

Hello, did this went anywhere?

So we are covered under that use case. We're also refraining from officially documenting this setting so we can remove it at a future time when we have a better idea of how to officially support dedicated Kibanas to run alerting and address these types of problems.

We are having some performance issues regarding to alerts and would like to have dedicated Kibana instances to run the alert tasks because it is impacting the overall usage of Kibana.

I couldn't find anything in the documentation about this, so I'm assuming that this not moved forward.

Is this still in the roadmap?

@shanisagiv1
Copy link

After discussing this internally recently, unfortunately, there is no plan to support that in the near future.

@pmuellr
Copy link
Member

pmuellr commented Jul 23, 2024

We are having some performance issues regarding to alerts and would like to have dedicated Kibana instances to run the alert tasks because it is impacting the overall usage of Kibana.

I couldn't find anything in the documentation about this, so I'm assuming that this not moved forward.

Sounds like the node.roles configuration may work for you: https://www.elastic.co/guide/en/kibana/current/settings.html

The default setting is * which means Kibana operates in both ui and background_tasks roles. But you can also run a set of Kibanas where they only have have one role. Anything scheduled by task manager (like alerting rules) will run in the Kibanas with a background_tasks role, and HTTP requests are only served by Kibanas with a ui role.

@leandrojmp
Copy link

Hello @pmuellr,

I'm not sure this would work, I couldn't find description for what tasks kibana executes when using the role ui and when using the role background_tasks, not sure this is documented.

But according to support it is not possible to have Kibana instances dedicates to Alerts only, even if I add more instances and do not put them behind our current LB of Kibana they would still run all tasks.

Support also said that there are already some enhancement requests about this and they opened another one with the number #22253.

We had a call with some Elastic Engineers today about other stuff and we also mentioned this, being able to uncouple the alerting functions from the other kibana functions.

@pmuellr
Copy link
Member

pmuellr commented Jul 24, 2024

But according to support it is not possible to have Kibana instances dedicates to Alerts only

True, but only because there are other tasks (clean up, telemetry, etc) that run in the background_tasks nodes, besides alerting rules (and connector executions for alert notifications). Some of those can be expensive as well, so actually good to have all background tasks separate from "ui" processing.

I'm not sure this would work ...

But according to support it is not possible to have Kibana instances dedicates to Alerts only, even if I add more instances and do not put them behind our current LB of Kibana they would still run all tasks.

With the default config, for on-prem, true. For ESS, you'll start getting background_node and ui nodes generated once you go past 8GB in stateful. For serverless, there are always separate background_task and ui nodes, which are autoscaled separately.

@pmuellr
Copy link
Member

pmuellr commented Jul 24, 2024

And ya, sorry, we don't have much in the way of docs on this. I've opened issue add doc for node.roles #189116 to track ...

@leandrojmp
Copy link

As noted in ER Dedicated instances for Kibana rules #22253, this is already in use in stateful and serverless in ESS.

This issue is private, but is this would also be the case for on-prem deployments? Because the answer I got from support is that this is not possible.

Not sure now if I should reopen the case linking this thread to get more information.

@pmuellr
Copy link
Member

pmuellr commented Jul 24, 2024

Yes, this works for on-prem as well.

If you already had a support case open on this topic, I'd suggest re-opening or open a new one.

Sorry about the inaccessible link. It doesn't really say much more than my comment on ESS / serverless above ... (figured I'd just duplicate the info here).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Breaking Change core services Issues related to enabling features across Kibana to leverage core services across domains discuss estimate:medium Medium Estimated Level of Effort Feature:Actions Feature:Alerting Feature:EventLog Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants