instance reconciler: in-doubt migrations should go to failed if their VMMs are observed to have failed #6520

gjcolombo · 2024-09-04T20:40:21Z

When Nexus observes or is told that a VMM is unexpectedly no longer resident on a sled where it lived previously, Nexus takes control of the state machine (assuming the sled has abandoned it) and moves the VMM to Failed. Similar logic applies to migrations: if a migration-participant VMM has disappeared, then the sled has also abdicated the state machine for that side of the migration, and Nexus should also move that migration-half to Failed.

hawkw · 2024-09-04T20:47:21Z

Is instance-reconciler what we're calling the update saga now? :)

hawkw · 2024-09-04T21:17:11Z

Hmm, I suppose we should also make sure to do this if we see an update that moves a target VMM to Destroyed without touching the migration?

hawkw · 2024-09-05T16:15:08Z

Greg and I discussed this at length yesterday. My opinion is that it is not currently all that urgent to fix this, as the migration table is currently only used to indicate to instance-update sagas when an in-progress migration has completed or failed. Its main purpose is to tell the update saga whether migration IDs need to be unset and/or the active VMM ID has changed...and if either VMM has transitioned to Failed, the update saga already knows that it needs to clean up a VMM and potentially unset the migration IDs. So, it would be nice if the update saga also cleaned up the migration records, but it's currently the only consumer of the migration table anyway, and leaving them in progress when a VMM moves to Failed isn't actually a problem as far as the update saga is concerned.

I do still think it's worth fixing, especially if the migration table is ever used for other purposes in the future. If we were to, for example, use it to generate a UI showing an instance's migration history, it would, of course, be wrong to leave behind permanently InProgress migrations after a VMM goes to Failed. But, it's not terribly urgent to fix immediately IMO.

gjcolombo added the nexus Related to nexus label Sep 4, 2024

gjcolombo mentioned this issue Sep 4, 2024

[nexus] Mark VMMs on Expunged sleds as Failed #6519

Merged

hawkw self-assigned this Sep 4, 2024

hawkw mentioned this issue Sep 21, 2024

consider renaming "instance update" saga to "instance reconciliation" #6631

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

instance reconciler: in-doubt migrations should go to failed if their VMMs are observed to have failed #6520

instance reconciler: in-doubt migrations should go to failed if their VMMs are observed to have failed #6520

gjcolombo commented Sep 4, 2024

hawkw commented Sep 4, 2024

hawkw commented Sep 4, 2024

hawkw commented Sep 5, 2024

instance reconciler: in-doubt migrations should go to failed if their VMMs are observed to have failed #6520

instance reconciler: in-doubt migrations should go to failed if their VMMs are observed to have failed #6520

Comments

gjcolombo commented Sep 4, 2024

hawkw commented Sep 4, 2024

hawkw commented Sep 4, 2024

hawkw commented Sep 5, 2024