Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Alerting][Security] Rules fail due to a security exception: missing authentication credentials for REST request #118520

Open
gmmorris opened this issue Nov 15, 2021 · 15 comments
Labels
bug Fixes for quality problems that affect the customer experience estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting/RuleTypes Issues related to specific Alerting Rules Types Feature:Stack Monitoring Team:Monitoring Stack Monitoring team Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc.

Comments

@gmmorris
Copy link
Contributor

Kibana version: 7.15.0

Looking at Kibana Server logs on cloud I've noticed a high rate of security errors causing many of our Rule Types to fail.

Specifically:

Executing Alert default:.es-query:{uuid} has resulted in Error: security_exception: [security_exception] Reason: missing authentication credentials for REST request [/_security/user/_has_privileges], caused by: ""

...appears a lot and accounts for around 200 rule execution failures per minute.

Interestingly, this seems to happen predominantly to the following Rule Types:

  1. monitoring_shard_size
  2. .es-query
  3. siem.signals

So this is likely not something that's happening at the platform level, but rather specific to the implementation of these three rule types.

@gmmorris gmmorris added bug Fixes for quality problems that affect the customer experience Team:Security Team focused on: Auth, Users, Roles, Spaces, Audit Logging, and more! Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Feature:Alerting/RuleTypes Issues related to specific Alerting Rules Types estimate:needs-research Estimated as too large and requires research to break down into workable issues labels Nov 15, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@elasticmachine
Copy link
Contributor

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-security (Team:Security)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@dhurley14
Copy link
Contributor

The security solution executor utilizes the _has_privileges api to determine if the rule can query the index patterns provided so this is probably coming from our rules. In 7.16 if the _has_privileges api yields an error, we display a partial failure banner in the rule details page of the security solution.

@gmmorris
Copy link
Contributor Author

The security solution executor utilizes the _has_privileges api to determine if the rule can query the index patterns provided so this is probably coming from our rules. In 7.16 if the _has_privileges api yields an error, we display a partial failure banner in the rule details page of the security solution.

Ah, that makes sense, thanks @dhurley14 !
So from your perspective, this is a valid error? As in, it's intentional in nature rather than an exception. 🤔

@pmuellr
Copy link
Member

pmuellr commented Nov 17, 2021

I'd been thinking this could be caused by the api key / task doc race condition issues: #106292 and #110096 .

Another source of this could be (see linked SDH issue above) a bad upgrade, where the original encryption key isn't available during the migration. It appears in such cases we migrate the rule with the API key set to null. Clearly we want to "disable" the rule, but we can't really, since the task document still exists and we need to delete it, but can't during the migration. We also presumably have an API key that should be invalidated, but we can't since we couldn't recover it.

This gets complicated to reason about, because for "no security" deployments, the API key WILL BE null. Somehow we need a better guard rail here - check the API key after extraction, and if it's null and it's not supposed to be null (however we check "no security"), we should disable it then and there - with hopefully some kind of notification to the user. Maybe we need a disableReason or such ... relevant code here: alerting/server/task_runner/task_runner.ts

In a Slack conversation, @ymao1 noted:

  • because we can't delete the task document in migration, we also can't set the rule to disabled in migration, as that would create another task document when it's later enabled, and then there would be two tasks for the rule

  • the migration logic was last updated in PR Gracefully handle decryption errors during ESO migrations #105968 - and there was some question whether we would fail the migration for cases like this. If the only problem was the correct encryption key was not set during an upgrade, and a second migration could be run with the correct encryption key, then this would be the best solution (fail the migration). I still feel like there's too many "if's" in that logic, and we could be causing migration failures that we don't need to be, if we did just fail on every decrypt failure.

@pmuellr
Copy link
Member

pmuellr commented Nov 18, 2021

I was able to repro changing the encryption key on a migration will cause this error.

During migration, this was logged for every rule:

[error][encryptedSavedObjects][plugins] Failed to decrypt "apiKey" attribute: Unsupported state or unable to authenticate data
[WARN ][savedobjects-service] Decryption failed for encrypted Saved Object 
  "fbca0b70-4887-11ec-9ff1-157c47ec9f4a" of type "alert" with error: 
  Unable to decrypt attribute "apiKey". Encrypted attributes have 
  been stripped from the original document and migration will be applied but 
  this may cause errors later on.

It didn't lie! It did cause problems seconds later:

[plugins.alerting] Executing Alert default:.index-threshold:fe9efd60-4887-11ec-9ff1-157c47ec9f4a 
  has resulted in Error: security_exception: 
  [security_exception] Reason: missing authentication credentials for REST request [/_security/user/_has_privileges], 
  caused by: ""

Seems like we need to do better than logging during migration. I think we need to mark these somehow as not-runnable, and then disable them sometime after startup. I wonder if we could even do it DURING startup? Or does it need to be a cleanup task so not every Kibana will try to "fix" these?

Another possibility is fixing these as-needed - if we recognize we'll get this error because there SHOULD be an API key, but isn't, disable the rule instead of running it. But then you won't know till you try to run it.

@mikecote
Copy link
Contributor

if we recognize we'll get this error because there SHOULD be an API key, but isn't

One theory that may cause this.. if a user sets up alerting rules with security disabled (xpack.security.enabled: false) and later on enables security, their alerting rules would run into this problem because the apiKey field is empty and now that security is enabled, it expects a value there.

Though, I don't think this scenario is possible on Cloud.. (security always enabled?).

@pmuellr
Copy link
Member

pmuellr commented Nov 18, 2021

Ya, security is always on for cloud, but this could obviously happen on-prem. Thought about that for a second when I was doing my repro, but shoved it to the back of my mind. Obviously we need to take this into account though. We want to disable these, because they NEED an API key at that point, but I guess the question is - when do we make that call and actually disable them. And how do we notify the user that we disabled them.

@mikecote
Copy link
Contributor

Obviously we need to take this into account though. We want to disable these, because they NEED an API key at that point, but I guess the question is - when do we make that call and actually disable them. And how do we notify the user that we disabled them.

This overlaps well with upcoming efforts to ensure alerting rules run continuously. This becomes a scenario where rules stop running indefinitely until a user intervenes. And we'll need to find a way to notify the user in these cases. So lots TBD :)

@banderror
Copy link
Contributor

@deepikakeshav-qasource reproduced this issue in her test Cloud environment in #120872 without doing any Kibana upgrades - this was a fresh 8.0.0 deployment.

Could this mean that the race condition mentioned by @pmuellr might be the root cause in this case?

I'd been thinking this could be caused by the api key / task doc race condition issues: #106292 and #110096 .

I wasn't able to reproduce it though, even in the same Cloud environment where she managed to do that.

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
@jugsofbeer
Copy link

jugsofbeer commented Mar 3, 2022

Sadly this had the side effect of killing our kibana nodes connection to elasticsearch and requests to kibana would display tls handshake errors.

If we restarted kibana it would work for 3 or 4 minutes then tls errors.

after 2days of chaos, we disabled all alerts... and problem resolved temporarily and all errors stopped in our logs.

We have a few alerts that are throwing the has privilages error, so more investigation needed.

We are running v7.16.3 onpremise. Began life as v6.4.2 and upgraded versions over past 3years.

Support case was opened today as well. Let me know if you want the number.

@jportner
Copy link
Contributor

jportner commented Mar 3, 2022

It sounds like this isn't a Platform Security issue, so I'll remove our team's label.

Support case was opened today as well. Let me know if you want the number.

@jugsofbeer Thanks for chiming in, it's helpful to know on this issue if users are affected, and we will get the right eyes on the support case!

@fopson
Copy link

fopson commented Oct 26, 2022

We had the same issue with our On-Prem deployment of 8.4.x. Some rules would produce this error for weeks. We found that if you edit the rule and re-save it, it stops failing.

Hope this helps.

@smith smith added Team:Monitoring Stack Monitoring team and removed Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting/RuleTypes Issues related to specific Alerting Rules Types Feature:Stack Monitoring Team:Monitoring Stack Monitoring team Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc.
Projects
No open projects
Development

No branches or pull requests