Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discuss] Handle Saved Objects migrations for incompatible/conflicting Kibana configurations #131525

Open
azasypkin opened this issue May 4, 2022 · 8 comments
Labels
discuss Feature:Migrations Feature:Security/Encrypted Saved Objects NeededFor:Alerting Services Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@azasypkin
Copy link
Member

Summary

Based on the #101582 and in the context of Encrypted Saved Objects, if the Kibana that runs the migration has an incompatible encryption key, the Encrypted Saved Objects plugin will throw an error during the migration process (cannot decrypt error) and leave the saved object untouched and unmigrated.

Alerting and actions plugins added a code to strip the attributes whenever such a scenario occurs so that the migration can continue at least for the unencrypted attributes. This leaves Kibana in a sub-optimal state (incompatible AAD and users will have to enable/disable the rule rather than running an unmigrated rule).

Ideally, in the case of migration failures caused by the conflicting configuration, the users should abort the upgrade process, synchronize encryption keys and try the migration process again.

Potential solutions

We have an RFC-in-progress (Detect incompatible configurations) that would allow Encrypted Saved Objects plugin to detect and report incompatible/conflicting encryption keys, but the detection only happens after saved objects migrations. The reason is that the detection mechanism relies on the saved objects on its own, and it's not possible currently to plug into saved object migration pipeline earlier in the Kibana lifecycle.

These are very rough ideas of how we can handle that in a way that wouldn't require significant changes in the proposed configuration conflict checks design. The implementation and maintenance might be far from trivial though.

Use a separate index for the configuration conflict checks

As discussed in the aforementioned RFC it's possible to store configuration state used for the conflict detection in a separate index (see the RFC for pros and cons). This way the conflict detection can happen in start (as soon as Elasticsearch service is ready), but before the saved object migration kicks in.

Note: Technically, decoupling from the saved objects makes it possible to run config conflict checks even at the preboot stage where we already have access to configuration, Elasticsearch service and even UI if we need to render anything.

Extend Saved Objects migration sub-system

Alternatively we can modify Saved Objects migration sub-system to somehow perform migrations in multiple stages: at the initial stage we would migrate "service/critical/core" Saved Objects and the rest at the later stages. This would allow us to run configuration state checks between these two stages.

@mikecote @kobelb @lukeelmers @pgayvallet

@azasypkin azasypkin added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc NeededFor:Alerting Services Feature:Migrations Feature:Security/Encrypted Saved Objects labels May 4, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@azasypkin azasypkin changed the title Handle Saved Objects migrations for incompatible/conflicting Kibana configurations [Discuss] Handle Saved Objects migrations for incompatible/conflicting Kibana configurations May 4, 2022
@rudolf
Copy link
Contributor

rudolf commented Sep 6, 2022

Alerting and actions plugins added a code to strip the attributes whenever such a scenario occurs so that the migration can continue at least for the unencrypted attributes.

Is this really the behaviour we're looking for? When a users clicks "upgrade Kibana" they should know that if the upgrade succeeded everything is in a working state. Do they need to comb through the upgrade logs to verify that the upgrade really was successful and wouldn't have an impact on their business? It's much more painful to discover three weeks later that a certain rule was left disabled since your last "successful" upgrade.

When upgrades fail it causes downtime and this has a high impact on users business. But we shouldn't mitigate that by leaving them with a Kibana that's in an undefined state without any way to rollback without data loss. In my opinion we should just make it much simpler (automatic on Cloud) to rollback https://github.com/elastic/cloud/issues/88885

@azasypkin
Copy link
Member Author

Alerting and actions plugins added a code to strip the attributes whenever such a scenario occurs so that the migration can continue at least for the unencrypted attributes.

Is this really the behaviour we're looking for? When a users clicks "upgrade Kibana" they should know that if the upgrade succeeded everything is in a working state. Do they need to comb through the upgrade logs to verify that the upgrade really was successful and wouldn't have an impact on their business? It's much more painful to discover three weeks later that a certain rule was left disabled since your last "successful" upgrade.

I don't think we ever considered this as an ideal solution. Based on #101582 (comment) it was picked as an intermediate "lesser evil" solution at that time. Moreover, ESS should be immune to this issue (incompatible encryption key during upgrade) since encryption keys are managed by ESS and users cannot change them. It's mostly about self-managed setups where we don't have control over the encryption keys.

@mikecote @kobelb can probably share more thoughts on that specific case and ideal future state though.

@rudolf
Copy link
Contributor

rudolf commented Sep 9, 2022

This is a fairly recent addition but I believe #132984 introduced everything we need for the "ideal solution".

When a migration transform function throws, migrations will collect all the failures and log a list of all the documents that failed to successfully transform, as well as the thrown error message + stack.

Users can then decide to ignore these objects by specifying migrations.discardCorruptObjects: "8.4.0" which will delete the failed documents just for the upgrade to 8.4.0, meaning if they leave their configuration untouched and upgrade to 8.5.0 no failures will be ignored automatically.

So if encryption just threw a descriptive enough error and a user had "a few encrypted saved objects that can't be decrypted" then this would allow them to basically delete those objects and successfully upgrade. BUT if they notice all their rules and all their other encrypted saved objects were failing they would have the opportunity to fix their configuration file.

@kobelb
Copy link
Contributor

kobelb commented Sep 9, 2022

The current behavior is not ideal. This current behavior was indeed chosen because it was seen as being the "lesser evil" because it's fairly common for users to have saved-objects that are encrypted with different encryption keys, even in ESS. On-prem, encryption keys can be mismatched for any number of reasons and it's not obvious to our users when this is the case. In ESS, this happens when users connect a local Kibana node to an ESS cluster without terminating the ESS Kibana node.

It's common for our users to eventually discover that they need to have their encryption keys synchronized or else various features just don't work. However, by this time they will have created saved-objects that have different encryption keys with no good way to make sure that they're all encrypted with the same key.

I'm a strong advocate of failing upgrades in situations like these. However, I don't think that we have all of the features that we need for the "ideal solution" and are missing the following:

  1. Graceful rollbacks in ECE/ESS when upgrades fail
  2. A way to prevent mismatched keys in the first place, at least in ESS where we don't support local Kibana nodes without terminating the ESS Kibana node

@azasypkin
Copy link
Member Author

In ESS, this happens when users connect a local Kibana node to an ESS cluster without terminating the ESS Kibana node.

I thought that Cloud has explicitly/officially decided to not support this anymore? Or you just mean that it'd be great to protect our users even if they're violating the "support terms"?

@kobelb
Copy link
Contributor

kobelb commented Sep 9, 2022

I thought that Cloud has explicitly/officially decided to not support this anymore? Or you just mean that it'd be great to protect our users even if they're violating the "support terms"?

Correct. Cloud has officially decided to not support this anymore, but there's nothing protecting our users from violating this and it's not obvious to our users that this isn't supported until they start to run into complications.

@mikecote
Copy link
Contributor

mikecote commented Sep 9, 2022

+1 to what @kobelb said. I also wanted to note our current migrations fail silently by removing encrypted attributes and logging an error message. This way the SO still gets migrated on the unencrypted portions and the users can, for example, re-generate an API key to fix the rule. To make the discardCorruptObjects feature work, the response ops team will need to change our migration logic to throw errors instead of swallowing them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Feature:Migrations Feature:Security/Encrypted Saved Objects NeededFor:Alerting Services Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

No branches or pull requests

5 participants