Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail v2 Migrations when unknown document types are found #101052

Closed
joshdover opened this issue Jun 1, 2021 · 11 comments
Closed

Fail v2 Migrations when unknown document types are found #101052

joshdover opened this issue Jun 1, 2021 · 11 comments
Labels
project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@joshdover
Copy link
Contributor

In our initial design of the v2 Migrations, we decided that the ideal way to handle documents from disabled plugins was to fail migrations. See the option (3) in the RFC section describing why. However, in some scenarios, v1 Migrations allowed documents of unknown types to be preserved, so we decided to defer on this change until 8.0 since it was considered breaking.

During testing in #100171, we found that in some scenarios both v1 and v2 Migrations actually already exhibits this behavior, though somewhat by accident. If a customer disables a plugin, it's been growing more and more likely that this scenario would be encountered. With v2 migrations, this situation can actually be encountered on any upgrade where there are documents of an unknown type (or belonging to a disabled plugin) that have a migrationVersion specified. More details about the observed behavior going back to 6.8 can be viewed here: #100171 (comment).

We've decided that we could go ahead and commit to option (3) from the RFC in 7.x since this is already the behavior in most scenarios but not all:

3. Refuse to start a migration until the plugin is enabled or it's data deleted

Advantages:

  • We force users to enable a plugin or delete the documents which prevents these documents from creating future problems like a mapping update not being compatible because there are fields which are assumed to have been migrated.
  • We keep the index “clean”.

Disadvantages:

  • Since users have to take down outdated nodes before they can start the upgrade, they have to enter the downtime window before they know about this problem. This prolongs the downtime window and in many cases might cause an operations team to have to reschedule their downtime window to give them time to investigate the documents that need to be deleted. Logging an error on every startup could warn users ahead of time to mitigate this.
  • We don’t expose Kibana logs on Cloud so this will have to be escalated to support and could take 48hrs to resolve (users can safely rollback, but without visibility into the logs they might not know this). Exposing Kibana logs is on the cloud team’s roadmap.
  • It might not be obvious just from the saved object type, which plugin created these objects.

Committing to this option entails that we:

The primary risk I see to this change is that it effectively no longer allows users to disable plugins that have any stored Saved Objects, since Kibana will fail to startup. This seems like a major issue for cases where disabling things like task_manager or telemetry would are helpful for debugging a problem.

An alternative could be to only fail if the migrationVersion is newer than the current Kibana version, rather than failing if we don't have a known migration, as @pgayvallet shared here: #100171 (comment). This was previously considered, but discarded:

I was a while ago since we discussed that with @rudolf, and I don't remember exactly why, but I think this had some subtle implications though.

@kobelb Do you remember what the risks were to this option?

@joshdover joshdover added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient labels Jun 1, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@kobelb
Copy link
Contributor

kobelb commented Jun 1, 2021

@joshdover, I don't recall being part of a conversation about only failing if the migrationVersion is newer than the current Kibana version.

During testing in #100171, we found that in some scenarios both v1 and v2 Migrations actually already exhibits this behavior, though somewhat by accident

This almost certainly has to be by accident, because it won't allow users to follow our documentation for removing a plugin that relies on saved-object migrations: https://www.elastic.co/guide/en/kibana/7.13/kibana-plugins.html#update-remove-plugin. Will either option that is proposed allow users to remove plugins following our documentation? If not, we're going to be introducing a breaking change. However, if we think that the breaking change is subtle and a large number of our users shouldn't be impacted, we can document it as a breaking change and proceed.

@pgayvallet
Copy link
Contributor

This almost certainly has to be by accident

Yea, this is a bug. Which has been present for a long time, probably since the beginning of SO migrations, but a bug nevertheless.

If not, we're going to be introducing a breaking change

Technically we would not be introducing anything, as the problem has always been there. We would just be transforming a bug into a feature ™️ .

However, if we think that the breaking change is subtle and a large number of our users shouldn't be impacted, we can document it as a breaking change and proceed

Currently most of our users shouldn't be impacted, as triggering this situations requires to disable a plugin registering a SO type you were effectively using before upgrading Kibana. Effectively, there isn't really any reason to do that, so it's unlikely our customers are doing it (and the total lack of SDH issues about this problem seems like a good confirmation).

Now, if we think long-term, this feels more problematic. The best example would be third party plugins, where a scenario where a user tries a plugin for a bit and then choose to uninstall it.

However, committing to to option (3) from the RFC now, and choosing to change the behavior later is also a possibility.

@kobelb
Copy link
Contributor

kobelb commented Jun 2, 2021

Technically we would not be introducing anything, as the problem has always been there. We would just be transforming a bug into a feature ™️ .

We should not generalize the approach of "transforming a bug into a feature". The goal of not creating breaking changes is to provide a good experience for our users, so that Elastic products provide them joy. If this bug was to affect a large number of our users, this is a bug that we should absolutely be fixing and not embracing. However, if we can use the existence of a bug for a long time to justify the minimal impact of introducing a breaking change, that's fine.

Ignoring my concerns with the "transforming a bug into a feature" logic, the v1 and v2 migrations only fail when the migrationVersion field is present, right? Isn't one of the proposals to always fail to perform migrations when a plugin isn't present?

Now, if we think long-term, this feels more problematic. The best example would be third party plugins, where a scenario where a user tries a plugin for a bit and then choose to uninstall it.

👍 This is a scenario we should not be ignoring. If it's not incredibly common today, it will become incredibly common.

@TinaHeiligers
Copy link
Contributor

My 2c worth is I see the POV from both sides. However, as a current elastic user, I'd like the flexibility of being able to try a feature, disable it and still retain the possibility of being able to use my old data in some way or another. Be it through snapshots, exporting data before disabling the feature or any other way of holding on to it without impacting perf would be a huge bonus.

That being said, there is the cost of hanging on to docs that aren't being used indefinitely. We saw that with the fleet-agent docs that caused massive SO count explosions!

Whichever way we go, we need to make sure it's documented well and that folks know the risks involved.

Remembering back to when fleet was first introduced as beta, even with the very clear warning that there wasn't a migration path for the beta version, folks were still taken by surprise when they did migrate.

@pgayvallet
Copy link
Contributor

pgayvallet commented Jun 3, 2021

We should not generalize the approach of "transforming a bug into a feature". The goal of not creating breaking changes is to provide a good experience for our users, so that Elastic products provide them joy.

You probably misunderstood me on that one (I may have set my french cursor a bit too high). I'm strongly in favor of trying to find a solution for this potential ticking bomb instead of just keeping this behavior as it is currently. I was just saying that effectively, not fixing the issue should probably not be considered a breaking change, as this issue has always been there.

However, as a current elastic user, I'd like the flexibility of being able to try a feature, disable it and still retain the possibility of being able to use my old data in some way or another

I strongly agree.

One potential option I see would be to

  • do not fail when unknown objects are found during migration, and just migrate/copy then into the new index, preserving their migrationVersion (still note: adapting the DocumentMigrator, and the upper layers, to allow such behavior is NOT trivial, for a bunch of reasons I could enumerate if we decide to dig into that direction)
  • add a new management section (or leverage SOM) to be able to identify 'unknown'/'unused' objects and allow the user to delete them if they want to

Ideally, we would also be able to update the bin/kibana-plugin command and the plugin system to allow plugins to register 'cleanup' hooks during uninstall. We could, that way, allow plugins to prompt the user if he wants to remove the plugin's data when uninstalling. This is probably some significant work though, and we probably want to delay that until we focus more seriously on the 3rd party plugins problematic (which could benefit from a whole lot of other improvements too)

We saw that with the fleet-agent docs that caused massive SO count explosions

My gut feeling is that, now that we're performing full client-side reindex during the migration, we'll want to have #96626 quite soon. This would address that problem.

@pgayvallet
Copy link
Contributor

Another thing to take into account if we decide to copy unknown documents to the target index without failing the migration is the multi-instance scenario

  • instance1 has the foo type registered
  • instance2 has not the foo type registered
  • user triggers a migration by bumping Kibana and restarting both instances
  • both instances are started at the same time
  • both instances are therefor firing the migration
  • instance1 perform the migration 'first' (aka wins the race), and migrates the documents from foo
  • instance2 then perform the migration slightly later (expected behavior) but, as it's not aware of the type, it copies it without migrating it
  • correct documents migrated by instance1 are overridden by instance2
  • outch

So even if we decide to copy unknown docs during migration, I think we still should declare that having multiple instance with different plugin sets is not something that we're supporting.

@mshustov
Copy link
Contributor

mshustov commented Jun 3, 2021

add a new management section (or leverage SOM) to be able to identify 'unknown'/'unused' objects and allow the user to delete them if they want to

Doesn't it require the first option anyway? Or you propose the migration should fail and we ask a Kibana user to navigate to this UI to perform cleanup?

I think we still should declare that having multiple instance with different plugin sets is not something that we're supporting.

IMO it sounds like the most practical solution, as it's mostly an edge case.

do not fail when unknown objects are found during migration,

Btw how Kibana behaves when unknown SO types are imported? I'd assume it fails as well.

@joshdover
Copy link
Contributor Author

I think we still should declare that having multiple instance with different plugin sets is not something that we're supporting.

This should probably be the case anyways, though we'll need to determine how we're going to handle this if we ever support dedicated node types. We should document this in the "settings that must be the same" section: https://www.elastic.co/guide/en/kibana/7.13/production.html#load-balancing-kibana

  • do not fail when unknown objects are found during migration, and just migrate/copy then into the new index, preserving their migrationVersion (still note: adapting the DocumentMigrator, and the upper layers, to allow such behavior is NOT trivial, for a bunch of reasons I could enumerate if we decide to dig into that direction)

I'm +1 on this and this was the intended behavior during 7.x. I think essentially fixing this bug for 7.x is the best option we have right now, given that users use the ability to disable plugins in some cases.

I don't think failing migrations with unknown document types is a practical option unless and until we adopt not letting any 1st party plugins from being disabled (#89584). That change could only happen at 8.0 at the earliest.

If there's no objections, I propose we close this issue and open a new one to list out the steps needed to update DocumentMigrator to support preserving documents of an unknown type with a migrationVersion specified. @pgayvallet would you mind doing this since you already have the context?

@kobelb
Copy link
Contributor

kobelb commented Jun 3, 2021

So even if we decide to copy unknown docs during migration, I think we still should declare that having multiple instance with different plugin sets is not something that we're supporting.

Agreed, 100%

I'm +1 on this and this was the intended behavior during 7.x. I think essentially fixing this bug for 7.x is the best option we have right now, given that users use the ability to disable plugins in some cases.

👍

@pgayvallet
Copy link
Contributor

Closing in favor of #101351

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

No branches or pull requests

6 participants