[ Cherry-Pick ] Fix stuck rebuilds and stuck nexus subsystems #1721

tiagolobocastro · 2024-08-14T09:39:23Z

fix(nvmx/retire): disconnect failed controllers

When we are pausing the nexus, all IO must get flushed before
the subsystem pausing completes.
If we can't flush the IO then pausing is stuck forever...

The issue we have seen is that when IO's are stuck there's
nothing which can fail them and allow pause to complete.
One way this can happen is when the controller is failed as
it seems in this case the io queues are not getting polled.

A first fix that can be done is to piggy back on the adminq
polling failure and use this to drive the removal of the
failed child devices from the nexus per-core channels.

A better approach might be needed in the future to be able
to timeout the IOs even when no completions are processed
in a given I/O qpair.

Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

fix(opts): convert adminq poll period to us

This seems to have been mistakenly added as ms.
In practice this would have caused no harm as this value is not
currently being overrided by the helm chart.

Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

fix(rebuild): ensure comms channel is drained on drop

When the rebuild backend is dropped, we must also drain the async channel.
This covers a corner case where a message may be sent at the same time as
we're dropping and in this case the message would hang.

This is not a hang for prod as there we have timeouts which would
eventually cancel the future and allow the drop, though this can still
lead to timeouts and confusion.

Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

When the rebuild backend is dropped, we must also drain the async channel. This covers a corner case where a message may be sent at the same time as we're dropping and in this case the message would hang. This is not a hang for prod as there we have timeouts which would eventually cancel the future and allow the drop, though this can still lead to timeouts and confusion. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

This seems to have been mistakenly added as ms. In practice this would have caused no harm as this value is not currently being overrided by the helm chart. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

tiagolobocastro · 2024-08-27T10:23:28Z

bors merge

1721: [ Cherry-Pick ] Fix stuck rebuilds and stuck nexus subsystems r=tiagolobocastro a=tiagolobocastro fix(nvmx/retire): disconnect failed controllers When we are pausing the nexus, all IO must get flushed before the subsystem pausing completes. If we can't flush the IO then pausing is stuck forever... The issue we have seen is that when IO's are stuck there's nothing which can fail them and allow pause to complete. One way this can happen is when the controller is failed as it seems in this case the io queues are not getting polled. A first fix that can be done is to piggy back on the adminq polling failure and use this to drive the removal of the failed child devices from the nexus per-core channels. A better approach might be needed in the future to be able to timeout the IOs even when no completions are processed in a given I/O qpair. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- fix(opts): convert adminq poll period to us This seems to have been mistakenly added as ms. In practice this would have caused no harm as this value is not currently being overrided by the helm chart. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- fix(rebuild): ensure comms channel is drained on drop When the rebuild backend is dropped, we must also drain the async channel. This covers a corner case where a message may be sent at the same time as we're dropping and in this case the message would hang. This is not a hang for prod as there we have timeouts which would eventually cancel the future and allow the drop, though this can still lead to timeouts and confusion. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> Co-authored-by: Tiago Castro <tiagolobocastro@gmail.com>

bors-openebs-mayastor · 2024-08-27T10:51:04Z

Build failed:

continuous-integration/jenkins/branch

tiagolobocastro · 2024-08-27T11:08:57Z

hmm weird build failed error...
bors merge

1721: [ Cherry-Pick ] Fix stuck rebuilds and stuck nexus subsystems r=tiagolobocastro a=tiagolobocastro fix(nvmx/retire): disconnect failed controllers When we are pausing the nexus, all IO must get flushed before the subsystem pausing completes. If we can't flush the IO then pausing is stuck forever... The issue we have seen is that when IO's are stuck there's nothing which can fail them and allow pause to complete. One way this can happen is when the controller is failed as it seems in this case the io queues are not getting polled. A first fix that can be done is to piggy back on the adminq polling failure and use this to drive the removal of the failed child devices from the nexus per-core channels. A better approach might be needed in the future to be able to timeout the IOs even when no completions are processed in a given I/O qpair. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- fix(opts): convert adminq poll period to us This seems to have been mistakenly added as ms. In practice this would have caused no harm as this value is not currently being overrided by the helm chart. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- fix(rebuild): ensure comms channel is drained on drop When the rebuild backend is dropped, we must also drain the async channel. This covers a corner case where a message may be sent at the same time as we're dropping and in this case the message would hang. This is not a hang for prod as there we have timeouts which would eventually cancel the future and allow the drop, though this can still lead to timeouts and confusion. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> Co-authored-by: Tiago Castro <tiagolobocastro@gmail.com>

bors-openebs-mayastor · 2024-08-27T11:35:18Z

Build failed:

continuous-integration/jenkins/branch

tiagolobocastro · 2024-08-27T12:40:08Z

Disabled worker-2 for now... let's try again
bors merge

1721: [ Cherry-Pick ] Fix stuck rebuilds and stuck nexus subsystems r=tiagolobocastro a=tiagolobocastro fix(nvmx/retire): disconnect failed controllers When we are pausing the nexus, all IO must get flushed before the subsystem pausing completes. If we can't flush the IO then pausing is stuck forever... The issue we have seen is that when IO's are stuck there's nothing which can fail them and allow pause to complete. One way this can happen is when the controller is failed as it seems in this case the io queues are not getting polled. A first fix that can be done is to piggy back on the adminq polling failure and use this to drive the removal of the failed child devices from the nexus per-core channels. A better approach might be needed in the future to be able to timeout the IOs even when no completions are processed in a given I/O qpair. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- fix(opts): convert adminq poll period to us This seems to have been mistakenly added as ms. In practice this would have caused no harm as this value is not currently being overrided by the helm chart. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> --- fix(rebuild): ensure comms channel is drained on drop When the rebuild backend is dropped, we must also drain the async channel. This covers a corner case where a message may be sent at the same time as we're dropping and in this case the message would hang. This is not a hang for prod as there we have timeouts which would eventually cancel the future and allow the drop, though this can still lead to timeouts and confusion. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com> Co-authored-by: Tiago Castro <tiagolobocastro@gmail.com>

bors-openebs-mayastor · 2024-08-27T15:50:47Z

Timed out.

When we are pausing the nexus, all IO must get flushed before the subsystem pausing completes. If we can't flush the IO then pausing is stuck forever... The issue we have seen is that when IO's are stuck there's nothing which can fail them and allow pause to complete. One way this can happen is when the controller is failed as it seems in this case the io queues are not getting polled. A first fix that can be done is to piggy back on the adminq polling failure and use this to drive the removal of the failed child devices from the nexus per-core channels. A better approach might be needed in the future to be able to timeout the IOs even when no completions are processed in a given I/O qpair. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

tiagolobocastro · 2024-08-27T17:23:28Z

bors merge

bors-openebs-mayastor · 2024-08-27T17:49:42Z

Build succeeded:

continuous-integration/jenkins/branch

tiagolobocastro requested review from dsavitskiy, Abhinandan-Purkait, abhilashshetty04 and dsharma-dc August 14, 2024 09:39

tiagolobocastro force-pushed the rebuild-rel branch 2 times, most recently from 3148374 to b873525 Compare August 14, 2024 19:22

tiagolobocastro force-pushed the rebuild-rel branch from b873525 to 83d28cc Compare August 14, 2024 19:25

fix(opts): convert adminq poll period to us

cd305eb

This seems to have been mistakenly added as ms. In practice this would have caused no harm as this value is not currently being overrided by the helm chart. Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

tiagolobocastro changed the title ~~fix(rebuild): ensure comms channel is drained on drop~~ Fix stuck rebuilds and stuck nexus subsystems Aug 16, 2024

tiagolobocastro changed the title ~~Fix stuck rebuilds and stuck nexus subsystems~~ [ Cherry-Pick ] Fix stuck rebuilds and stuck nexus subsystems Aug 16, 2024

dsharma-dc approved these changes Aug 21, 2024

View reviewed changes

abhilashshetty04 approved these changes Aug 27, 2024

View reviewed changes

tiagolobocastro force-pushed the rebuild-rel branch from 428121f to acb1462 Compare August 27, 2024 17:11

bors-openebs-mayastor bot merged commit 5b36521 into release/2.7 Aug 27, 2024
4 checks passed

bors-openebs-mayastor bot deleted the rebuild-rel branch August 27, 2024 17:49

tiagolobocastro mentioned this pull request Sep 25, 2024

Failed to unshare or destroy Nexus, subsystem is busy #1722

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ Cherry-Pick ] Fix stuck rebuilds and stuck nexus subsystems #1721

[ Cherry-Pick ] Fix stuck rebuilds and stuck nexus subsystems #1721

tiagolobocastro commented Aug 14, 2024 •

edited

Loading

tiagolobocastro commented Aug 27, 2024

bors-openebs-mayastor bot commented Aug 27, 2024

tiagolobocastro commented Aug 27, 2024

bors-openebs-mayastor bot commented Aug 27, 2024

tiagolobocastro commented Aug 27, 2024

bors-openebs-mayastor bot commented Aug 27, 2024

tiagolobocastro commented Aug 27, 2024

bors-openebs-mayastor bot commented Aug 27, 2024

[ Cherry-Pick ] Fix stuck rebuilds and stuck nexus subsystems #1721

[ Cherry-Pick ] Fix stuck rebuilds and stuck nexus subsystems #1721

Conversation

tiagolobocastro commented Aug 14, 2024 • edited Loading

tiagolobocastro commented Aug 27, 2024

bors-openebs-mayastor bot commented Aug 27, 2024

tiagolobocastro commented Aug 27, 2024

bors-openebs-mayastor bot commented Aug 27, 2024

tiagolobocastro commented Aug 27, 2024

bors-openebs-mayastor bot commented Aug 27, 2024

tiagolobocastro commented Aug 27, 2024

bors-openebs-mayastor bot commented Aug 27, 2024

tiagolobocastro commented Aug 14, 2024 •

edited

Loading