Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make TaskManager health status error/warning reasons more visible #154045

Merged
merged 9 commits into from
Apr 18, 2023

Conversation

ersin-erdal
Copy link
Contributor

@ersin-erdal ersin-erdal commented Mar 30, 2023

Resolves: #152289

With this PR we make some of the debug logs warn and return the message to the health API as reason to add in the status summary message.

@ersin-erdal ersin-erdal added release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.8.0 labels Mar 30, 2023
@ersin-erdal ersin-erdal self-assigned this Mar 30, 2023
@@ -41,16 +41,30 @@ describe('logHealthMetrics', () => {
const { calculateHealthStatus } = jest.requireMock('./calculate_health_status');

// We must change from OK to Warning
(calculateHealthStatus as jest.Mock<HealthStatus>).mockImplementation(() => HealthStatus.OK);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These mocks were no-op, never stubbed the real function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like we changed anything though - except to change the return type to an object and not just the status. The parenthesized function invocations to mockImplementation() definitely look a little weird, but was there some kind of problem with them? They look like they could work.

But they do look confusing! :-). I wonder if there's a cleanup we can do here - in another issue/PR ...

@ersin-erdal ersin-erdal marked this pull request as ready for review March 31, 2023 10:33
@ersin-erdal ersin-erdal requested a review from a team as a code owner March 31, 2023 10:33
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@pmuellr
Copy link
Member

pmuellr commented Mar 31, 2023

\o/

Hacked a "bad reason" message into the code (because I said so!), saw this from /api/status:

{
  "status": {
    "...": "...",
    "plugins": {
      "...": "...",
      "taskManager": {
        "level": "degraded",
        "summary": "Task Manager is unhealthy - Reason: because I said so!"
      }
    }
  }
}

@ersin-erdal
Copy link
Contributor Author

\o/

Hacked a "bad reason" message into the code (because I said so!), saw this from /api/status:

{
  "status": {
    "...": "...",
    "plugins": {
      "...": "...",
      "taskManager": {
        "level": "degraded",
        "summary": "Task Manager is unhealthy - Reason: because I said so!"
      }
    }
  }
}

Sorry, I forgot to ping you Patrick.
I pushed a fix for this last week. Now it uses only the reasons we defined in the code.

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

`Health Status error threshold has been exceeded, resultFrequencySummary.Failed (${resultFrequencySummary.Failed}) is greater than error_threshold (${executionErrorThreshold.error_threshold})`
);
} else {
logger.debug(
logger.warn(
`Health Status warn threshold has been exceeded, resultFrequencySummary.Failed (${resultFrequencySummary.Failed}) is greater than warn_threshold (${executionErrorThreshold.warn_threshold})`
);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting that we don't return a negative HealthStatus here. But feels like it makes sense, since the task errors could very likely be caused by the user somehow, and not systemic (invalid parameters used in a rule / connector) - we shouldn't set the TM plugin status based on that. But I bet we used to :-)

@@ -41,16 +41,30 @@ describe('logHealthMetrics', () => {
const { calculateHealthStatus } = jest.requireMock('./calculate_health_status');

// We must change from OK to Warning
(calculateHealthStatus as jest.Mock<HealthStatus>).mockImplementation(() => HealthStatus.OK);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like we changed anything though - except to change the return type to an object and not just the status. The parenthesized function invocations to mockImplementation() definitely look a little weird, but was there some kind of problem with them? They look like they could work.

But they do look confusing! :-). I wonder if there's a cleanup we can do here - in another issue/PR ...

@@ -375,24 +375,24 @@ describe('Task Run Statistics', () => {
{ Success: 40, RetryScheduled: 40, Failed: 20, status: 'OK' },
]);

expect(logger.debug).toHaveBeenCalledTimes(5);
expect(logger.debug).toHaveBeenNthCalledWith(
expect(logger.warn).toHaveBeenCalledTimes(5);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm remembering now that these MIGHT have been warns before, but they were SDH generators so we changed them to debug. But even if so, seems fair to me!

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Unknown metric groups

ESLint disabled line counts

id before after diff
securitySolution 432 435 +3

Total ESLint disabled count

id before after diff
securitySolution 512 515 +3

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @ersin-erdal

@ersin-erdal ersin-erdal merged commit c76a68b into elastic:main Apr 18, 2023
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Apr 18, 2023
@ersin-erdal ersin-erdal deleted the 152289-tm-health-logs branch April 18, 2023 12:57
navarone-feekery pushed a commit to navarone-feekery/kibana that referenced this pull request Apr 18, 2023
…astic#154045)

Resolves: elastic#152289

With this PR we make some of the `debug` logs `warn` and return the
message to the health API as reason to add in the status summary
message.
saarikabhasi pushed a commit to saarikabhasi/kibana that referenced this pull request Apr 19, 2023
…astic#154045)

Resolves: elastic#152289

With this PR we make some of the `debug` logs `warn` and return the
message to the health API as reason to add in the status summary
message.
nikitaindik pushed a commit to nikitaindik/kibana that referenced this pull request Apr 25, 2023
…astic#154045)

Resolves: elastic#152289

With this PR we make some of the `debug` logs `warn` and return the
message to the health API as reason to add in the status summary
message.
pmuellr added a commit to pmuellr/kibana that referenced this pull request May 15, 2023
resolves elastic#156112

Change task manager logging on status errors from `warn` to `debug.
Making this change as we recently changed from `debug` to `warn`
in elastic#154045 .  But this ended
up too noisy, especially at Kibana startup.
pmuellr added a commit that referenced this pull request May 15, 2023
…157762)

resolves #156112

Change task manager logging on status errors from `warn` to `debug.
Making this change as we recently changed from `debug` to `warn` in
#154045 . But this ended up too
noisy, especially at Kibana startup.
kibanamachine pushed a commit to kibanamachine/kibana that referenced this pull request May 15, 2023
…lastic#157762)

resolves elastic#156112

Change task manager logging on status errors from `warn` to `debug.
Making this change as we recently changed from `debug` to `warn` in
elastic#154045 . But this ended up too
noisy, especially at Kibana startup.

(cherry picked from commit b542862)
kibanamachine added a commit that referenced this pull request May 15, 2023
…anges (#157762) (#157800)

# Backport

This will backport the following commits from `main` to `8.8`:
- [[ResponseOps][Task Manager] stop spamming the logs on status changes
(#157762)](#157762)

<!--- Backport version: 8.9.7 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Patrick
Mueller","email":"patrick.mueller@elastic.co"},"sourceCommit":{"committedDate":"2023-05-15T20:21:16Z","message":"[ResponseOps][Task
Manager] stop spamming the logs on status changes (#157762)\n\nresolves
https://github.com/elastic/kibana/issues/156112\r\n\r\nChange task
manager logging on status errors from `warn` to `debug.\r\nMaking this
change as we recently changed from `debug` to `warn`
in\r\nhttps://github.com//pull/154045 . But this ended up
too\r\nnoisy, especially at Kibana
startup.","sha":"b542862904073982a00ecd7418cc77dbe567b2d0","branchLabelMapping":{"^v8.9.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","Feature:Task
Manager","Team:ResponseOps","v8.8.0","v8.9.0"],"number":157762,"url":"https://github.com/elastic/kibana/pull/157762","mergeCommit":{"message":"[ResponseOps][Task
Manager] stop spamming the logs on status changes (#157762)\n\nresolves
https://github.com/elastic/kibana/issues/156112\r\n\r\nChange task
manager logging on status errors from `warn` to `debug.\r\nMaking this
change as we recently changed from `debug` to `warn`
in\r\nhttps://github.com//pull/154045 . But this ended up
too\r\nnoisy, especially at Kibana
startup.","sha":"b542862904073982a00ecd7418cc77dbe567b2d0"}},"sourceBranch":"main","suggestedTargetBranches":["8.8"],"targetPullRequestStates":[{"branch":"8.8","label":"v8.8.0","labelRegex":"^v(\\d+).(\\d+).\\d+$","isSourceBranch":false,"state":"NOT_CREATED"},{"branch":"main","label":"v8.9.0","labelRegex":"^v8.9.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/157762","number":157762,"mergeCommit":{"message":"[ResponseOps][Task
Manager] stop spamming the logs on status changes (#157762)\n\nresolves
https://github.com/elastic/kibana/issues/156112\r\n\r\nChange task
manager logging on status errors from `warn` to `debug.\r\nMaking this
change as we recently changed from `debug` to `warn`
in\r\nhttps://github.com//pull/154045 . But this ended up
too\r\nnoisy, especially at Kibana
startup.","sha":"b542862904073982a00ecd7418cc77dbe567b2d0"}}]}]
BACKPORT-->

Co-authored-by: Patrick Mueller <patrick.mueller@elastic.co>
jasonrhodes pushed a commit that referenced this pull request May 17, 2023
…157762)

resolves #156112

Change task manager logging on status errors from `warn` to `debug.
Making this change as we recently changed from `debug` to `warn` in
#154045 . But this ended up too
noisy, especially at Kibana startup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.8.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ResponseOps][Task Manager] when changing plugin status, indicate the reason in a logged message
5 participants