Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resilient saved object migration algorithm #78413

Merged
merged 94 commits into from
Dec 15, 2020
Merged

Conversation

rudolf
Copy link
Contributor

@rudolf rudolf commented Sep 24, 2020

Summary

Implements: #75780

Current status

  • Having tried to clone indices we ran into several versions of Kibana which have incompatible mappings making cloning impossible.
  • I tried to just do a reindex operation instead of cloning, but this had a risk of lost deletes
  • I tried to use an alias and write block to create my own 'reindex block', this works most of the time, but there was a risk that the node that started the reindex is no longer part of the cluster, but that the node is still sending reindex batches which could lead to lost deletes
  • Finally we're reindexing into a temp index, then cloning that index into a target index. Deletes only happen on the cloned target index so that prevents any lost deletes. We delete the temp index after completing the migration. This means that it's possible for us to do two full reindex operations when one instance doesn't yet see that the migration is complete, but the temp index gets deleted before it initiates a reindex. This adds unnecessary load but otherwise doesn't have an impact.
  • I've done the following manual testing:
    • Test migrations against a data set of 100k documents, performance is similar
    • Tested with a 6.8.13 data set (the integration tests include datasets for 7.3, 7.7 and 8.0)
    • Tested that all actions are idempotent by repeating the following process:
      1. Start Kibana
      2. Fail at step N
      3. Start Kibana
      4. Fail at step N
      5. N++; repeat

Checklist

Delete any items that are not applicable to this PR.

For maintainers

@rudolf rudolf requested a review from a team as a code owner September 24, 2020 12:27
@rudolf rudolf marked this pull request as draft September 24, 2020 12:27
src/core/server/saved_objects/migrationsv2/index.ts Outdated Show resolved Hide resolved
};
const control = state.controlState;
if (control === 'INIT') {
return delay(Actions.fetchAliases(client, ['.kibana', '.kibana_current', '.kibana_7.11.0']));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the action is tightly coupled with the state. Is it possible to collocate them closer?

src/core/server/saved_objects/migrationsv2/index.ts Outdated Show resolved Hide resolved
src/core/server/saved_objects/migrationsv2/index.ts Outdated Show resolved Hide resolved
src/core/server/saved_objects/migrationsv2/index.ts Outdated Show resolved Hide resolved
src/core/server/saved_objects/migrationsv2/index.ts Outdated Show resolved Hide resolved
src/core/server/saved_objects/migrationsv2/index.ts Outdated Show resolved Hide resolved
Copy link
Contributor

@joshdover joshdover left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I'm very excited about this FSM-based approach. This will make it much easier for us to test very specific states in isolation. I took a very similar approach with the reindexing mechanism in the Upgrade Assistant and it was much easier to reason about.

If we add detailed debug logging for the state & actions, understanding failures in production will also be much eaiser for our support team. We should definitely add detailed documentation on what these states mean and how to diagnose problems (but make clear that these states are not 'stable' so that we can change them between versions, if necessary)

src/core/server/saved_objects/migrationsv2/index.ts Outdated Show resolved Hide resolved
@rudolf rudolf force-pushed the so-migrations branch 3 times, most recently from 67bcbe1 to 6ad7a54 Compare November 17, 2020 13:55
@rudolf rudolf added project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc labels Nov 27, 2020
@kobelb
Copy link
Contributor

kobelb commented Nov 30, 2020

The new saved-object migrations algorithm relies on the fact that the migrations defined by the plugins are deterministic. Do we have any controls in place to ensure that all current and future migrations are deterministic?

@rudolf rudolf force-pushed the so-migrations branch 2 times, most recently from 7d7db7c to 8d011b5 Compare December 2, 2020 16:15
@rudolf rudolf requested a review from a team as a code owner December 14, 2020 11:52
Copy link
Contributor

@pgayvallet pgayvallet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the code coverage is way better now.

Still got some nits/comments, but nothing that significant.

Comment on lines +68 to +70
if (log.type === 'transition') {
logger.info(logMessagePrefix + `${log.prevControlState} -> ${log.controlState}`, log.state);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: logger.info(${logMessagePrefix} ${log.prevControlState} -> ${log.controlState}, log.state); (multiple occurences)

});

const countUntilThree = countUntilModel(3);
const finalStateP = stateActionMachine(state, next, countUntilThree);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: this can throw. I would at least move it to a beforeAll or beforeEach block

Comment on lines +86 to +96
pipe(
TaskEither.tryCatch(
() => transformRawDocs(state.outdatedDocuments),
(e) => {
throw e;
}
),
TaskEither.chain((docs) =>
Actions.bulkOverwriteTransformedDocuments(client, state.targetIndex, docs)
)
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: should probably be 'extracted' to the action creator

Comment on lines +125 to +127
if (state.controlState === 'DONE' || state.controlState === 'FATAL') {
// Return null if we're in one of the terminating states
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ultraNIT: could introduce either

  • isTerminal(controlState)
  • terminalStates array + terminalStates.includes(state.controlState)

src/core/server/saved_objects/migrationsv2/model.ts Outdated Show resolved Hide resolved
Comment on lines +689 to +698
const outdatedDocumentsQuery = {
bool: {
should: Object.entries(migrationVersionPerType).map(([type, latestVersion]) => ({
bool: {
must: { term: { type } },
must_not: { term: { [`migrationVersion.${type}`]: latestVersion } },
},
})),
},
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a little strange that this is put in the state, but OTOH not doing it would force to put migrationVersionPerType in it instead, so not sure which one is the best

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah @restrry had a good idea to create a KibanaModel that could contain all the state and logic that's static, things we don't collect as part of the algorithm but rather that's injected into it at the start. We could still attach the KibanaModel onto the state, but it could for instance be a class and wouldn't have to be a POJO like the rest of the state.

Didn't get around to it though :(

Comment on lines +196 to +203
if (stateP.controlState === 'INIT') {
const res = resW as ExcludeRetryableEsError<ResponseType<typeof stateP.controlState>>;

if (Either.isRight(res)) {
const indices = res.right;
const aliases = getAliases(indices);

if (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file would still need to be split into per-state files.

@rudolf rudolf added v7.12.0 and removed v7.11.0 labels Dec 15, 2020
@rudolf
Copy link
Contributor Author

rudolf commented Dec 15, 2020

@elasticmachine merge upstream

@rudolf
Copy link
Contributor Author

rudolf commented Dec 15, 2020

I'm merging this so that we can get a snapshot build for QA to test with and so that we can backport it to 7.x (7.12) which will make it a lot simpler to test with 6.x data.

@rudolf rudolf merged commit 89bd0fb into elastic:master Dec 15, 2020
@rudolf rudolf deleted the so-migrations branch December 15, 2020 20:40
@kibanamachine kibanamachine added the backport missing Added to PRs automatically when the are determined to be missing a backport. label Dec 17, 2020
@kibanamachine
Copy link
Contributor

Looks like this PR has a backport PR but it still hasn't been merged. Please merge it ASAP to keep the branches relatively in sync.

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Metrics [docs]

Distributable file count

id before after diff
default 47241 48013 +772
oss 27540 27859 +319

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

Looks like this PR has a backport PR but it still hasn't been merged. Please merge it ASAP to keep the branches relatively in sync.

rudolf added a commit that referenced this pull request Dec 18, 2020
* Resilient saved object migration algorithm (#78413)

* Initial structure of migration state-action machine

* Fix type import

* Retries with exponential back off

* Use discriminated union for state type

* Either type for actions

* Test exponential retries

* TaskEither types for actions

* Fetch indices instead of aliases so we can collect all index state in one request

* Log document id if transform fails

* WIP: Legacy pre-migrations

* UPDATE_TARGET_MAPPINGS

* WIP OUTDATED_DOCUMENTS_TRANSFORM

* Narrow res types depending on control state

* OUTDATED_DOCUMENTS_TRANSFORM

* Use .kibana instead of .kibana_current

* rename control states TARGET_DOCUMENTS* -> OUTDATED_DOCUMENTS*

* WIP MARK_VERSION_INDEX_READY

* Fix and expand INIT -> * transition tests

* Add alias/index name helper functions

* Add feature flag for enabling v2 migrations

* split state_action_machine, reindex legacy indices

* Don't use a scroll search for migrating outdated documents

* model: test control state progressions

* Action integration tests

* Fix existing tests and type errors

* snapshot_in_progress_exception can only happen when closing/deleting an index

* Retry steps up to 10 times

* Update api.md documentation files

* Further actions integration tests

* Action unit tests

* Fix actions integration tests

* Rename actions to be more domain-specific

* Apply suggestions from code review

Co-authored-by: Josh Dover <me@joshdover.com>

* Review feedback: polish and flesh out inline comments

* Fix unhandled rejections in actions unit tests

* model: only delay retryable_es_client_error, reset for other left responses

* Actions unit tests

* More inline comments

* Actions: Group index settings under 'index' key

* bulkIndex -> bulkOverwriteTransformedDocuments to be more domain specific

* state_action_machine tests, fix and add additional tests

* Action integration tests: updateAndPickupMappings, searchForOutdatedDocuments

* oops: uncomment commented out code

* actions integration tests: rejection for createIndex

* update state properties: clearer names, mark all as readonly

* add state properties currentAlias, versionAlias, legacyIndex and test for invalid version scheme in index names

* Use CONSTANTS for constants :D

* Actions: Clarify behaviour and impact of acknowledged: false responses

* Use consistent vocabulary for action responses

* KibanaMigrator test for migrationsV2

* KibanaMigrator test for FATAL state and action exceptions in v2 migrations

* Fix ts error in test

* Refactor: split index file up into a file per model, next, types

* next: use partial application so we don't generate a nextActionMap on every call

* move logic from index.ts to migrations_state_action_machine.ts and test

* add test

* use `Root` to allow specifying oss mode

* Add fix and todo tests for reindexing with preMigrationScript

* Dump execution log of state transitions and responses if we hit FATAL

* add 7.3 xpack tests

* add 100k test data

* Reindex instead of cloning for migrations

* Skip 100k x-pack integration test

* MARK_VERSION_INDEX_READY_CONFLICT for dealing with different versions migrating in parallel

* Track elapsed time

* Fix tests

* Model: make exhaustiveness checks more explicit

* actions integration tests: add additional tests from CR

* migrations_state_action_machine fix flaky test

* Fix flaky integration test

* Reserve FATAL termination only for situations which we never can recover from such as later version already migrated the index

* Handle incompatible_mapping_exception caused by another instance

* Cleanup logging

* Fix/stabilize integration tests

* Add REINDEX_SOURCE_TO_TARGET_VERIFY step

* Strip tests archives of */.DS_Store and __MAC_OSX

* Task manager migrations: remove invalid kibana property when converting legacy indices

* Add disabled mappings for removed field in map saved object type

* verifyReindex action: use count API

* REINDEX_BLOCK_* to prevent lost deletes (needs tests)

* Split out 100k docs integration test so that it has it's own kibana process

* REINDEX_BLOCK_* action tests

* REINDEX_BLOCK_* model tests

* Include original error message when migration_state_machine throws

* Address some CR nits

* Fix TS errors

* Fix bugs

* Reindex then clone to prevent lost deletes

* Fix tests

Co-authored-by: Josh Dover <me@joshdover.com>
Co-authored-by: pgayvallet <pierre.gayvallet@elastic.co>
Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
# Conflicts:
#	rfcs/text/0013_saved_object_migrations.md

* createRootWithCorePlugins support for cliArgs which wasn't backported to 7.x

* Attempt to stabilize cloneIndex integration tests (#86123)

* Attempt to stabilize cloneIndex integration tests

* Unskip test

* return resolves/rejects and add assertions counts to each test

* Await don't return expect promises

* Await don't return expect promises for other tests too

* Remove 8.0.0 integration test
@kibanamachine kibanamachine removed the backport missing Added to PRs automatically when the are determined to be missing a backport. label Dec 18, 2020
@rudolf rudolf mentioned this pull request Mar 22, 2021
17 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Embedding Embedding content via iFrame project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient release_note:fix Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v7.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants