Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task Manager] Investigate better handling of unregistered task types #122389

Closed
ymao1 opened this issue Jan 5, 2022 · 9 comments · Fixed by #123963
Closed

[Task Manager] Investigate better handling of unregistered task types #122389

ymao1 opened this issue Jan 5, 2022 · 9 comments · Fixed by #123963
Assignees
Labels
estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@ymao1
Copy link
Contributor

ymao1 commented Jan 5, 2022

Currently it is possible for clustered Kibanas to get out of sync based on configuration and erroneously mark a task as “unrecognized”, causing it to never get claimed again. This can happen if a plugin is enabled in one Kibana node and disabled in another (like reporting). Currently, whichever Kibana is making its claim query will mark any task type that is not registered with it to be "unrecognized". This was done to explicitly identify task types that existed but were explicitly removed from Kibana. Any solution should ideally also allow actual deprecated task types to stop getting claimed.

Towards #119650

@ymao1 ymao1 added estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jan 5, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@gmmorris
Copy link
Contributor

gmmorris commented Jan 10, 2022

It came up in planning that there is actually a potential bug here where disabling a task type can cause tasks of those type to become unrecognized which means they won't run anywhere in the system.
While this is an unsafe config we do not allow users to use, it can be used by Support, so we should address this.

That makes this a higher impact issue than I originally realised. 👍

@ymao1
Copy link
Contributor Author

ymao1 commented Jan 10, 2022

Upon further investigation, using xpack.task_manager.unsafe.exclude_task_types does work as advertised and will not put us in a position where alerting rules are marked as "unrecognized". I've updated the issue description and we can re-prioritize this issue lower.

@pmuellr
Copy link
Member

pmuellr commented Jan 12, 2022

Have we looked at using a terms query to filter out task types a particular TM doesn't know about? That seems like a simple answer, so there must be some reason we're not doing it :-)

One problem is that it won't be obvious why these tasks aren't running, when we look at task manager. But I think we'd be in a similar situation even if we used exclude_task_types, since we'd have to know the Kibana config of all the connected Kibanas, to understand why a particular task wasn't running.

@mikecote
Copy link
Contributor

I've added this issue to To-Do because it affects our friends in reporting (#120995) and the impact can be pretty high if ever an alerting rule ends up in such state (mismatched Kibana versions, temporarily disabled plugin, future ERs that may touch this, etc.)

@pmuellr
Copy link
Member

pmuellr commented Jan 24, 2022

Just added a reference to an SDH ^^^, with another interesting scenario.

APM has it's own config to allow telemetry to be enabled for APM, which is by default true, but can be set to false. When set to false, the APM telemetry task will not be registered, and so presumably it will become unregistered if it did previously get registered. Looking at the flow, it doesn't appear there's anyway the task will ever escape "unregistered" status, even if the APM config turns true again.

Relevant code starts here:

if (
plugins.taskManager &&
plugins.usageCollection &&
currentConfig.telemetryCollectionEnabled
) {
createApmTelemetry({
core,
config$,
usageCollector: plugins.usageCollection,
taskManager: plugins.taskManager,
logger: this.logger,
kibanaVersion: this.initContext.env.packageInfo.version,
});
}

Makes me wonder how many other task manager clients have similar "optional registration" capabilities. Seems like a problem area. Seems like you should never conditionally register tasks?

@ymao1 ymao1 self-assigned this Jan 26, 2022
@ymao1
Copy link
Contributor Author

ymao1 commented Jan 27, 2022

We have a few options:

Provide a way to explicitly de-register a task
Stop marking tasks as unrecognized due to lack of registration. It sounds like the problem the original "fix" was addressing was an internal dev-only problem when a plugin accidentally stopped registering a task type during development and then couldn't figure out why the task wasn't running when the task document existed. If we did this, we would need to add a migration to task manager to change any unrecognized tasks back to idle. We could (should?) also provide a way to explicitly de-register a task type. I believe core provides a way to de-register saved object types so we could see how they are doing it. (Unused SO types: https://github.com/elastic/kibana/blob/main/src/core/server/saved_objects/migrations/core/unused_types.ts). With explicit de-registration, we can update the claim query to only mark explicitly de-registered task types as unregistered.

Claim unregistered task types
Continue marking tasks as unregistered during the claiming process but allow task manager to query for unregistered tasks. This means if one Kibana has turned off reporting but another has it turned on, Kibana A might mark the task as unregistered but Kibana B will still retrieve the task in its query and claim it. This is probably the less desirable choice because it eats up task manager resources.

Curious if @elastic/response-ops-execution has any thoughts? I'm inclined to go for explicit de-registration.

@mikecote
Copy link
Contributor

I think the de-registration path makes sense to me too. That way, we can support cases like reporting where only dedicated Kibanas would claim tasks. Based on the original issue #84088, we should make sure we're not bringing back those problems as we change functionality 👍

@mikecote
Copy link
Contributor

In the future, if we need to go a step further, we could put a flag on the tasks that are sometimes unrecognized to help us know which tasks don't get claimed by Kibana when looking at the documents. We could keep the flagging process efficient by making Kibanas claim tasks where unrecognized === false || taskTypes in (known task types). I don't feel we need to go that far at this time.

@kobelb kobelb removed the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label Jan 31, 2022
@botelastic botelastic bot added the needs-team Issues missing a team label label Jan 31, 2022
@kobelb kobelb added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

6 participants