Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

instance reconciler: in-doubt migrations should go to failed if their VMMs are observed to have failed #6520

Open
gjcolombo opened this issue Sep 4, 2024 · 3 comments
Assignees
Labels
nexus Related to nexus

Comments

@gjcolombo
Copy link
Contributor

When Nexus observes or is told that a VMM is unexpectedly no longer resident on a sled where it lived previously, Nexus takes control of the state machine (assuming the sled has abandoned it) and moves the VMM to Failed. Similar logic applies to migrations: if a migration-participant VMM has disappeared, then the sled has also abdicated the state machine for that side of the migration, and Nexus should also move that migration-half to Failed.

@hawkw
Copy link
Member

hawkw commented Sep 4, 2024

Is instance-reconciler what we're calling the update saga now? :)

@hawkw hawkw self-assigned this Sep 4, 2024
@hawkw
Copy link
Member

hawkw commented Sep 4, 2024

Hmm, I suppose we should also make sure to do this if we see an update that moves a target VMM to Destroyed without touching the migration?

@hawkw
Copy link
Member

hawkw commented Sep 5, 2024

Greg and I discussed this at length yesterday. My opinion is that it is not currently all that urgent to fix this, as the migration table is currently only used to indicate to instance-update sagas when an in-progress migration has completed or failed. Its main purpose is to tell the update saga whether migration IDs need to be unset and/or the active VMM ID has changed...and if either VMM has transitioned to Failed, the update saga already knows that it needs to clean up a VMM and potentially unset the migration IDs. So, it would be nice if the update saga also cleaned up the migration records, but it's currently the only consumer of the migration table anyway, and leaving them in progress when a VMM moves to Failed isn't actually a problem as far as the update saga is concerned.

I do still think it's worth fixing, especially if the migration table is ever used for other purposes in the future. If we were to, for example, use it to generate a UI showing an instance's migration history, it would, of course, be wrong to leave behind permanently InProgress migrations after a VMM goes to Failed. But, it's not terribly urgent to fix immediately IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nexus Related to nexus
Projects
None yet
Development

No branches or pull requests

2 participants