Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hook up discovery service to Task Manager health #194113

Conversation

mikecote
Copy link
Contributor

@mikecote mikecote commented Sep 26, 2024

Resolves #192568

In this PR, I'm solving the issue where the task manager health API is unable to determine how many Kibana nodes are running. I'm doing so by leveraging the Kibana discovery service to get a count instead of calculating it based on an aggregation on the .kibana_task_manager index where we count the unique number of ownerId, which requires tasks to be running and a sufficient distribution across the Kibana nodes to determine the number properly.

Note: This will only work when mget is the task claim strategy

To verify

  1. Set xpack.task_manager.claim_strategy: mget in kibana.yml
  2. Startup the PR locally with Elasticsearch and Kibana running
  3. Navigate to the /api/task_manager/_health route and confirm observed_kibana_instances is 1
  4. Apply the following code and restart Kibana
diff --git a/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts b/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts
index 090847032bf..69dfb6d1b36 100644
--- a/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts
+++ b/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts
@@ -59,6 +59,7 @@ export class KibanaDiscoveryService {
     const lastSeen = lastSeenDate.toISOString();
     try {
       await this.upsertCurrentNode({ id: this.currentNode, lastSeen });
+      await this.upsertCurrentNode({ id: `${this.currentNode}-2`, lastSeen });
       if (!this.started) {
         this.logger.info('Kibana Discovery Service has been started');
         this.started = true;
  1. Navigate to the /api/task_manager/_health route and confirm observed_kibana_instances is 2

@mikecote mikecote added release_note:skip Skip the PR/issue when compiling release notes Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v9.0.0 backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) v8.16.0 labels Sep 26, 2024
@mikecote mikecote self-assigned this Sep 26, 2024
@mikecote mikecote marked this pull request as ready for review September 26, 2024 11:44
@mikecote mikecote requested a review from a team as a code owner September 26, 2024 11:44
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@mikecote
Copy link
Contributor Author

@elasticmachine merge upstream

@ymao1 ymao1 added the ci:cloud-deploy Create or update a Cloud deployment label Sep 26, 2024
… of github.com:mikecote/kibana into task-manager/hook-up-discovery-service-to-health-api-2
@mikecote
Copy link
Contributor Author

PR that will deploy to Cloud: #194289

Copy link
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Verified it works as expected using a cloud deployment.

@@ -237,7 +237,6 @@ export default function ({ getService }: FtrProviderContext) {
expect(typeof workload.overdue).to.eql('number');

expect(typeof workload.non_recurring).to.eql('number');
expect(typeof workload.owner_ids).to.eql('number');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should still be a valid assertion right? we're not removing it from the health report?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The owner_ids value got removed from the health report when I removed the aggregation in x-pack/plugins/task_manager/server/monitoring/workload_statistics.ts. I figured it was no longer worth it given it always returns 0 as a value. I think that's ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha 👍

@mikecote
Copy link
Contributor Author

mikecote commented Oct 1, 2024

@elasticmachine merge upstream

@mikecote
Copy link
Contributor Author

mikecote commented Oct 1, 2024

@elasticmachine merge upstream

@kibana-ci
Copy link
Collaborator

kibana-ci commented Oct 1, 2024

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @mikecote

@mikecote mikecote merged commit d0d2032 into elastic:main Oct 2, 2024
38 checks passed
@kibanamachine
Copy link
Contributor

Starting backport for target branches: 8.x

https://github.com/elastic/kibana/actions/runs/11142828625

kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request Oct 2, 2024
Resolves elastic#192568

In this PR, I'm solving the issue where the task manager health API is
unable to determine how many Kibana nodes are running. I'm doing so by
leveraging the Kibana discovery service to get a count instead of
calculating it based on an aggregation on the `.kibana_task_manager`
index where we count the unique number of `ownerId`, which requires
tasks to be running and a sufficient distribution across the Kibana
nodes to determine the number properly.

Note: This will only work when mget is the task claim strategy

## To verify
1. Set `xpack.task_manager.claim_strategy: mget` in kibana.yml
2. Startup the PR locally with Elasticsearch and Kibana running
3. Navigate to the `/api/task_manager/_health` route and confirm
`observed_kibana_instances` is `1`
4. Apply the following code and restart Kibana
```
diff --git a/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts b/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts
index 090847032bf..69dfb6d1b36 100644
--- a/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts
+++ b/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts
@@ -59,6 +59,7 @@ export class KibanaDiscoveryService {
     const lastSeen = lastSeenDate.toISOString();
     try {
       await this.upsertCurrentNode({ id: this.currentNode, lastSeen });
+      await this.upsertCurrentNode({ id: `${this.currentNode}-2`, lastSeen });
       if (!this.started) {
         this.logger.info('Kibana Discovery Service has been started');
         this.started = true;
```
5. Navigate to the `/api/task_manager/_health` route and confirm
`observed_kibana_instances` is `2`

---------

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
(cherry picked from commit d0d2032)
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.x

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kibanamachine added a commit that referenced this pull request Oct 2, 2024
…4685)

# Backport

This will backport the following commits from `main` to `8.x`:
- [Hook up discovery service to Task Manager health
(#194113)](#194113)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Mike
Côté","email":"mikecote@users.noreply.github.com"},"sourceCommit":{"committedDate":"2024-10-02T11:19:06Z","message":"Hook
up discovery service to Task Manager health (#194113)\n\nResolves
https://github.com/elastic/kibana/issues/192568\r\n\r\nIn this PR, I'm
solving the issue where the task manager health API is\r\nunable to
determine how many Kibana nodes are running. I'm doing so
by\r\nleveraging the Kibana discovery service to get a count instead
of\r\ncalculating it based on an aggregation on the
`.kibana_task_manager`\r\nindex where we count the unique number of
`ownerId`, which requires\r\ntasks to be running and a sufficient
distribution across the Kibana\r\nnodes to determine the number
properly.\r\n\r\nNote: This will only work when mget is the task claim
strategy\r\n\r\n## To verify\r\n1. Set
`xpack.task_manager.claim_strategy: mget` in kibana.yml\r\n2. Startup
the PR locally with Elasticsearch and Kibana running\r\n3. Navigate to
the `/api/task_manager/_health` route and
confirm\r\n`observed_kibana_instances` is `1`\r\n4. Apply the following
code and restart Kibana\r\n```\r\ndiff --git
a/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts
b/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts\r\nindex
090847032bf..69dfb6d1b36 100644\r\n---
a/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts\r\n+++
b/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts\r\n@@
-59,6 +59,7 @@ export class KibanaDiscoveryService {\r\n const lastSeen
= lastSeenDate.toISOString();\r\n try {\r\n await
this.upsertCurrentNode({ id: this.currentNode, lastSeen });\r\n+ await
this.upsertCurrentNode({ id: `${this.currentNode}-2`, lastSeen });\r\n
if (!this.started) {\r\n this.logger.info('Kibana Discovery Service has
been started');\r\n this.started = true;\r\n```\r\n5. Navigate to the
`/api/task_manager/_health` route and
confirm\r\n`observed_kibana_instances` is
`2`\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine
<elasticmachine@users.noreply.github.com>","sha":"d0d2032f18a37e4c458a26d92092665453b737b0","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Feature:Task
Manager","Team:ResponseOps","v9.0.0","backport:prev-minor","ci:cloud-deploy","v8.16.0"],"title":"Hook
up discovery service to Task Manager
health","number":194113,"url":"https://github.com/elastic/kibana/pull/194113","mergeCommit":{"message":"Hook
up discovery service to Task Manager health (#194113)\n\nResolves
https://github.com/elastic/kibana/issues/192568\r\n\r\nIn this PR, I'm
solving the issue where the task manager health API is\r\nunable to
determine how many Kibana nodes are running. I'm doing so
by\r\nleveraging the Kibana discovery service to get a count instead
of\r\ncalculating it based on an aggregation on the
`.kibana_task_manager`\r\nindex where we count the unique number of
`ownerId`, which requires\r\ntasks to be running and a sufficient
distribution across the Kibana\r\nnodes to determine the number
properly.\r\n\r\nNote: This will only work when mget is the task claim
strategy\r\n\r\n## To verify\r\n1. Set
`xpack.task_manager.claim_strategy: mget` in kibana.yml\r\n2. Startup
the PR locally with Elasticsearch and Kibana running\r\n3. Navigate to
the `/api/task_manager/_health` route and
confirm\r\n`observed_kibana_instances` is `1`\r\n4. Apply the following
code and restart Kibana\r\n```\r\ndiff --git
a/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts
b/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts\r\nindex
090847032bf..69dfb6d1b36 100644\r\n---
a/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts\r\n+++
b/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts\r\n@@
-59,6 +59,7 @@ export class KibanaDiscoveryService {\r\n const lastSeen
= lastSeenDate.toISOString();\r\n try {\r\n await
this.upsertCurrentNode({ id: this.currentNode, lastSeen });\r\n+ await
this.upsertCurrentNode({ id: `${this.currentNode}-2`, lastSeen });\r\n
if (!this.started) {\r\n this.logger.info('Kibana Discovery Service has
been started');\r\n this.started = true;\r\n```\r\n5. Navigate to the
`/api/task_manager/_health` route and
confirm\r\n`observed_kibana_instances` is
`2`\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine
<elasticmachine@users.noreply.github.com>","sha":"d0d2032f18a37e4c458a26d92092665453b737b0"}},"sourceBranch":"main","suggestedTargetBranches":["8.x"],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/194113","number":194113,"mergeCommit":{"message":"Hook
up discovery service to Task Manager health (#194113)\n\nResolves
https://github.com/elastic/kibana/issues/192568\r\n\r\nIn this PR, I'm
solving the issue where the task manager health API is\r\nunable to
determine how many Kibana nodes are running. I'm doing so
by\r\nleveraging the Kibana discovery service to get a count instead
of\r\ncalculating it based on an aggregation on the
`.kibana_task_manager`\r\nindex where we count the unique number of
`ownerId`, which requires\r\ntasks to be running and a sufficient
distribution across the Kibana\r\nnodes to determine the number
properly.\r\n\r\nNote: This will only work when mget is the task claim
strategy\r\n\r\n## To verify\r\n1. Set
`xpack.task_manager.claim_strategy: mget` in kibana.yml\r\n2. Startup
the PR locally with Elasticsearch and Kibana running\r\n3. Navigate to
the `/api/task_manager/_health` route and
confirm\r\n`observed_kibana_instances` is `1`\r\n4. Apply the following
code and restart Kibana\r\n```\r\ndiff --git
a/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts
b/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts\r\nindex
090847032bf..69dfb6d1b36 100644\r\n---
a/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts\r\n+++
b/x-pack/plugins/task_manager/server/kibana_discovery_service/kibana_discovery_service.ts\r\n@@
-59,6 +59,7 @@ export class KibanaDiscoveryService {\r\n const lastSeen
= lastSeenDate.toISOString();\r\n try {\r\n await
this.upsertCurrentNode({ id: this.currentNode, lastSeen });\r\n+ await
this.upsertCurrentNode({ id: `${this.currentNode}-2`, lastSeen });\r\n
if (!this.started) {\r\n this.logger.info('Kibana Discovery Service has
been started');\r\n this.started = true;\r\n```\r\n5. Navigate to the
`/api/task_manager/_health` route and
confirm\r\n`observed_kibana_instances` is
`2`\r\n\r\n---------\r\n\r\nCo-authored-by: Elastic Machine
<elasticmachine@users.noreply.github.com>","sha":"d0d2032f18a37e4c458a26d92092665453b737b0"}},{"branch":"8.x","label":"v8.16.0","branchLabelMappingKey":"^v8.16.0$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Mike Côté <mikecote@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) ci:cloud-deploy Create or update a Cloud deployment Feature:Task Manager release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.16.0 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kibana Task Manager capacity estimation doesn't not observe the right amount of Kibana nodes anymore
5 participants