Fail v2 Migrations when unknown document types are found #101052

joshdover · 2021-06-01T13:26:27Z

In our initial design of the v2 Migrations, we decided that the ideal way to handle documents from disabled plugins was to fail migrations. See the option (3) in the RFC section describing why. However, in some scenarios, v1 Migrations allowed documents of unknown types to be preserved, so we decided to defer on this change until 8.0 since it was considered breaking.

During testing in #100171, we found that in some scenarios both v1 and v2 Migrations actually already exhibits this behavior, though somewhat by accident. If a customer disables a plugin, it's been growing more and more likely that this scenario would be encountered. With v2 migrations, this situation can actually be encountered on any upgrade where there are documents of an unknown type (or belonging to a disabled plugin) that have a migrationVersion specified. More details about the observed behavior going back to 6.8 can be viewed here: #100171 (comment).

We've decided that we could go ahead and commit to option (3) from the RFC in 7.x since this is already the behavior in most scenarios but not all:

3. Refuse to start a migration until the plugin is enabled or it's data deleted

Advantages:

We force users to enable a plugin or delete the documents which prevents these documents from creating future problems like a mapping update not being compatible because there are fields which are assumed to have been migrated.

We keep the index “clean”.

Disadvantages:

Since users have to take down outdated nodes before they can start the upgrade, they have to enter the downtime window before they know about this problem. This prolongs the downtime window and in many cases might cause an operations team to have to reschedule their downtime window to give them time to investigate the documents that need to be deleted. Logging an error on every startup could warn users ahead of time to mitigate this.

We don’t expose Kibana logs on Cloud so this will have to be escalated to support and could take 48hrs to resolve (users can safely rollback, but without visibility into the logs they might not know this). Exposing Kibana logs is on the cloud team’s roadmap.

It might not be obvious just from the saved object type, which plugin created these objects.

Committing to this option entails that we:

Add an explicit step that will fail a migration if it detects unknown types in the mappings & documents
Remove the OUTDATED_DOCUMENTS_* steps
Close Cleanup migrated docs on OUTDATED_DOCUMENTS_SEARCH step #97965 since it is no longer needed

The primary risk I see to this change is that it effectively no longer allows users to disable plugins that have any stored Saved Objects, since Kibana will fail to startup. This seems like a major issue for cases where disabling things like task_manager or telemetry would are helpful for debugging a problem.

An alternative could be to only fail if the migrationVersion is newer than the current Kibana version, rather than failing if we don't have a known migration, as @pgayvallet shared here: #100171 (comment). This was previously considered, but discarded:

I was a while ago since we discussed that with @rudolf, and I don't remember exactly why, but I think this had some subtle implications though.

@kobelb Do you remember what the risks were to this option?

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-06-01T13:26:29Z

Pinging @elastic/kibana-core (Team:Core)

kobelb · 2021-06-01T17:00:34Z

@joshdover, I don't recall being part of a conversation about only failing if the migrationVersion is newer than the current Kibana version.

During testing in #100171, we found that in some scenarios both v1 and v2 Migrations actually already exhibits this behavior, though somewhat by accident

This almost certainly has to be by accident, because it won't allow users to follow our documentation for removing a plugin that relies on saved-object migrations: https://www.elastic.co/guide/en/kibana/7.13/kibana-plugins.html#update-remove-plugin. Will either option that is proposed allow users to remove plugins following our documentation? If not, we're going to be introducing a breaking change. However, if we think that the breaking change is subtle and a large number of our users shouldn't be impacted, we can document it as a breaking change and proceed.

pgayvallet · 2021-06-02T11:24:07Z

This almost certainly has to be by accident

Yea, this is a bug. Which has been present for a long time, probably since the beginning of SO migrations, but a bug nevertheless.

If not, we're going to be introducing a breaking change

Technically we would not be introducing anything, as the problem has always been there. We would just be transforming a bug into a feature ™️ .

However, if we think that the breaking change is subtle and a large number of our users shouldn't be impacted, we can document it as a breaking change and proceed

Currently most of our users shouldn't be impacted, as triggering this situations requires to disable a plugin registering a SO type you were effectively using before upgrading Kibana. Effectively, there isn't really any reason to do that, so it's unlikely our customers are doing it (and the total lack of SDH issues about this problem seems like a good confirmation).

Now, if we think long-term, this feels more problematic. The best example would be third party plugins, where a scenario where a user tries a plugin for a bit and then choose to uninstall it.

However, committing to to option (3) from the RFC now, and choosing to change the behavior later is also a possibility.

kobelb · 2021-06-02T15:26:16Z

Technically we would not be introducing anything, as the problem has always been there. We would just be transforming a bug into a feature ™️ .

We should not generalize the approach of "transforming a bug into a feature". The goal of not creating breaking changes is to provide a good experience for our users, so that Elastic products provide them joy. If this bug was to affect a large number of our users, this is a bug that we should absolutely be fixing and not embracing. However, if we can use the existence of a bug for a long time to justify the minimal impact of introducing a breaking change, that's fine.

Ignoring my concerns with the "transforming a bug into a feature" logic, the v1 and v2 migrations only fail when the migrationVersion field is present, right? Isn't one of the proposals to always fail to perform migrations when a plugin isn't present?

Now, if we think long-term, this feels more problematic. The best example would be third party plugins, where a scenario where a user tries a plugin for a bit and then choose to uninstall it.

👍 This is a scenario we should not be ignoring. If it's not incredibly common today, it will become incredibly common.

TinaHeiligers · 2021-06-02T21:16:59Z

My 2c worth is I see the POV from both sides. However, as a current elastic user, I'd like the flexibility of being able to try a feature, disable it and still retain the possibility of being able to use my old data in some way or another. Be it through snapshots, exporting data before disabling the feature or any other way of holding on to it without impacting perf would be a huge bonus.

That being said, there is the cost of hanging on to docs that aren't being used indefinitely. We saw that with the fleet-agent docs that caused massive SO count explosions!

Whichever way we go, we need to make sure it's documented well and that folks know the risks involved.

Remembering back to when fleet was first introduced as beta, even with the very clear warning that there wasn't a migration path for the beta version, folks were still taken by surprise when they did migrate.

pgayvallet · 2021-06-03T06:15:56Z

We should not generalize the approach of "transforming a bug into a feature". The goal of not creating breaking changes is to provide a good experience for our users, so that Elastic products provide them joy.

You probably misunderstood me on that one (I may have set my french cursor a bit too high). I'm strongly in favor of trying to find a solution for this potential ticking bomb instead of just keeping this behavior as it is currently. I was just saying that effectively, not fixing the issue should probably not be considered a breaking change, as this issue has always been there.

However, as a current elastic user, I'd like the flexibility of being able to try a feature, disable it and still retain the possibility of being able to use my old data in some way or another

I strongly agree.

One potential option I see would be to

do not fail when unknown objects are found during migration, and just migrate/copy then into the new index, preserving their migrationVersion (still note: adapting the DocumentMigrator, and the upper layers, to allow such behavior is NOT trivial, for a bunch of reasons I could enumerate if we decide to dig into that direction)
add a new management section (or leverage SOM) to be able to identify 'unknown'/'unused' objects and allow the user to delete them if they want to

Ideally, we would also be able to update the bin/kibana-plugin command and the plugin system to allow plugins to register 'cleanup' hooks during uninstall. We could, that way, allow plugins to prompt the user if he wants to remove the plugin's data when uninstalling. This is probably some significant work though, and we probably want to delay that until we focus more seriously on the 3rd party plugins problematic (which could benefit from a whole lot of other improvements too)

We saw that with the fleet-agent docs that caused massive SO count explosions

My gut feeling is that, now that we're performing full client-side reindex during the migration, we'll want to have #96626 quite soon. This would address that problem.

pgayvallet · 2021-06-03T09:18:20Z

Another thing to take into account if we decide to copy unknown documents to the target index without failing the migration is the multi-instance scenario

instance1 has the foo type registered
instance2 has not the foo type registered
user triggers a migration by bumping Kibana and restarting both instances
both instances are started at the same time
both instances are therefor firing the migration
instance1 perform the migration 'first' (aka wins the race), and migrates the documents from foo
instance2 then perform the migration slightly later (expected behavior) but, as it's not aware of the type, it copies it without migrating it
correct documents migrated by instance1 are overridden by instance2
outch

So even if we decide to copy unknown docs during migration, I think we still should declare that having multiple instance with different plugin sets is not something that we're supporting.

mshustov · 2021-06-03T12:52:41Z

add a new management section (or leverage SOM) to be able to identify 'unknown'/'unused' objects and allow the user to delete them if they want to

Doesn't it require the first option anyway? Or you propose the migration should fail and we ask a Kibana user to navigate to this UI to perform cleanup?

I think we still should declare that having multiple instance with different plugin sets is not something that we're supporting.

IMO it sounds like the most practical solution, as it's mostly an edge case.

do not fail when unknown objects are found during migration,

Btw how Kibana behaves when unknown SO types are imported? I'd assume it fails as well.

joshdover · 2021-06-03T13:50:30Z

I think we still should declare that having multiple instance with different plugin sets is not something that we're supporting.

This should probably be the case anyways, though we'll need to determine how we're going to handle this if we ever support dedicated node types. We should document this in the "settings that must be the same" section: https://www.elastic.co/guide/en/kibana/7.13/production.html#load-balancing-kibana

do not fail when unknown objects are found during migration, and just migrate/copy then into the new index, preserving their migrationVersion (still note: adapting the DocumentMigrator, and the upper layers, to allow such behavior is NOT trivial, for a bunch of reasons I could enumerate if we decide to dig into that direction)

I'm +1 on this and this was the intended behavior during 7.x. I think essentially fixing this bug for 7.x is the best option we have right now, given that users use the ability to disable plugins in some cases.

I don't think failing migrations with unknown document types is a practical option unless and until we adopt not letting any 1st party plugins from being disabled (#89584). That change could only happen at 8.0 at the earliest.

If there's no objections, I propose we close this issue and open a new one to list out the steps needed to update DocumentMigrator to support preserving documents of an unknown type with a migrationVersion specified. @pgayvallet would you mind doing this since you already have the context?

kobelb · 2021-06-03T14:33:17Z

So even if we decide to copy unknown docs during migration, I think we still should declare that having multiple instance with different plugin sets is not something that we're supporting.

Agreed, 100%

I'm +1 on this and this was the intended behavior during 7.x. I think essentially fixing this bug for 7.x is the best option we have right now, given that users use the ability to disable plugins in some cases.

👍

pgayvallet · 2021-06-04T06:49:13Z

Closing in favor of #101351

joshdover added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient labels Jun 1, 2021

joshdover mentioned this issue Jun 1, 2021

[saved objects] Add migrations v2 integration test for scenario with multiple Kibana instances. #100171

Closed

lukeelmers added a commit to lukeelmers/kibana that referenced this issue Jun 1, 2021

Remove second test pending outcome of elastic#101052.

6e3675f

pgayvallet mentioned this issue Jun 4, 2021

SO migration v2: fail the migration early if unknown types are encountered #101351

Closed

pgayvallet closed this as completed Jun 4, 2021

joshdover mentioned this issue Jul 6, 2021

Add test that ensures that SO types are not removed #104418

Closed

joshdover mentioned this issue Aug 4, 2021

Refuse to start with unknown saved object types in 8.0 #107678

Closed

petrklapka mentioned this issue Dec 15, 2022

Issue Description Length Check (Sync Test - Ignore this issue and the description it contains!) (From Jira) #147637

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail v2 Migrations when unknown document types are found #101052

Fail v2 Migrations when unknown document types are found #101052

joshdover commented Jun 1, 2021

elasticmachine commented Jun 1, 2021

kobelb commented Jun 1, 2021

pgayvallet commented Jun 2, 2021

kobelb commented Jun 2, 2021 •

edited

Loading

TinaHeiligers commented Jun 2, 2021

pgayvallet commented Jun 3, 2021 •

edited

Loading

pgayvallet commented Jun 3, 2021

mshustov commented Jun 3, 2021

joshdover commented Jun 3, 2021

kobelb commented Jun 3, 2021

pgayvallet commented Jun 4, 2021

Fail v2 Migrations when unknown document types are found #101052

Fail v2 Migrations when unknown document types are found #101052

Comments

joshdover commented Jun 1, 2021

elasticmachine commented Jun 1, 2021

kobelb commented Jun 1, 2021

pgayvallet commented Jun 2, 2021

kobelb commented Jun 2, 2021 • edited Loading

TinaHeiligers commented Jun 2, 2021

pgayvallet commented Jun 3, 2021 • edited Loading

pgayvallet commented Jun 3, 2021

mshustov commented Jun 3, 2021

joshdover commented Jun 3, 2021

kobelb commented Jun 3, 2021

pgayvallet commented Jun 4, 2021

kobelb commented Jun 2, 2021 •

edited

Loading

pgayvallet commented Jun 3, 2021 •

edited

Loading