Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ Cherry-Pick ] Fix stuck rebuilds and stuck nexus subsystems #1721

Merged
merged 3 commits into from
Aug 27, 2024

Conversation

tiagolobocastro
Copy link
Contributor

@tiagolobocastro tiagolobocastro commented Aug 14, 2024

fix(nvmx/retire): disconnect failed controllers

When we are pausing the nexus, all IO must get flushed before
the subsystem pausing completes.
If we can't flush the IO then pausing is stuck forever...

The issue we have seen is that when IO's are stuck there's
nothing which can fail them and allow pause to complete.
One way this can happen is when the controller is failed as
it seems in this case the io queues are not getting polled.

A first fix that can be done is to piggy back on the adminq
polling failure and use this to drive the removal of the
failed child devices from the nexus per-core channels.

A better approach might be needed in the future to be able
to timeout the IOs even when no completions are processed
in a given I/O qpair.

Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

fix(opts): convert adminq poll period to us

This seems to have been mistakenly added as ms.
In practice this would have caused no harm as this value is not
currently being overrided by the helm chart.

Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

fix(rebuild): ensure comms channel is drained on drop

When the rebuild backend is dropped, we must also drain the async channel.
This covers a corner case where a message may be sent at the same time as
we're dropping and in this case the message would hang.

This is not a hang for prod as there we have timeouts which would
eventually cancel the future and allow the drop, though this can still
lead to timeouts and confusion.

Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

When the rebuild backend is dropped, we must also drain the async channel.
This covers a corner case where a message may be sent at the same time as
we're dropping and in this case the message would hang.

This is not a hang for prod as there we have timeouts which would
eventually cancel the future and allow the drop, though this can still
lead to timeouts and confusion.

Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>
This seems to have been mistakenly added as ms.
In practice this would have caused no harm as this value is not
currently being overrided by the helm chart.

Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>
@tiagolobocastro tiagolobocastro changed the title fix(rebuild): ensure comms channel is drained on drop Fix stuck rebuilds and stuck nexus subsystems Aug 16, 2024
@tiagolobocastro tiagolobocastro changed the title Fix stuck rebuilds and stuck nexus subsystems [ Cherry-Pick ] Fix stuck rebuilds and stuck nexus subsystems Aug 16, 2024
@tiagolobocastro
Copy link
Contributor Author

bors merge

bors-openebs-mayastor bot pushed a commit that referenced this pull request Aug 27, 2024
1721: [ Cherry-Pick ] Fix stuck rebuilds and stuck nexus subsystems r=tiagolobocastro a=tiagolobocastro

    fix(nvmx/retire): disconnect failed controllers
    
    When we are pausing the nexus, all IO must get flushed before
    the subsystem pausing completes.
    If we can't flush the IO then pausing is stuck forever...
    
    The issue we have seen is that when IO's are stuck there's
    nothing which can fail them and allow pause to complete.
    One way this can happen is when the controller is failed as
    it seems in this case the io queues are not getting polled.
    
    A first fix that can be done is to piggy back on the adminq
    polling failure and use this to drive the removal of the
    failed child devices from the nexus per-core channels.
    
    A better approach might be needed in the future to be able
    to timeout the IOs even when no completions are processed
    in a given I/O qpair.
    
    Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

---

    fix(opts): convert adminq poll period to us
    
    This seems to have been mistakenly added as ms.
    In practice this would have caused no harm as this value is not
    currently being overrided by the helm chart.
    
    Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

---

    fix(rebuild): ensure comms channel is drained on drop
    
    When the rebuild backend is dropped, we must also drain the async channel.
    This covers a corner case where a message may be sent at the same time as
    we're dropping and in this case the message would hang.
    
    This is not a hang for prod as there we have timeouts which would
    eventually cancel the future and allow the drop, though this can still
    lead to timeouts and confusion.
    
    Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>


Co-authored-by: Tiago Castro <tiagolobocastro@gmail.com>
@bors-openebs-mayastor
Copy link

Build failed:

@tiagolobocastro
Copy link
Contributor Author

hmm weird build failed error...
bors merge

bors-openebs-mayastor bot pushed a commit that referenced this pull request Aug 27, 2024
1721: [ Cherry-Pick ] Fix stuck rebuilds and stuck nexus subsystems r=tiagolobocastro a=tiagolobocastro

    fix(nvmx/retire): disconnect failed controllers
    
    When we are pausing the nexus, all IO must get flushed before
    the subsystem pausing completes.
    If we can't flush the IO then pausing is stuck forever...
    
    The issue we have seen is that when IO's are stuck there's
    nothing which can fail them and allow pause to complete.
    One way this can happen is when the controller is failed as
    it seems in this case the io queues are not getting polled.
    
    A first fix that can be done is to piggy back on the adminq
    polling failure and use this to drive the removal of the
    failed child devices from the nexus per-core channels.
    
    A better approach might be needed in the future to be able
    to timeout the IOs even when no completions are processed
    in a given I/O qpair.
    
    Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

---

    fix(opts): convert adminq poll period to us
    
    This seems to have been mistakenly added as ms.
    In practice this would have caused no harm as this value is not
    currently being overrided by the helm chart.
    
    Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

---

    fix(rebuild): ensure comms channel is drained on drop
    
    When the rebuild backend is dropped, we must also drain the async channel.
    This covers a corner case where a message may be sent at the same time as
    we're dropping and in this case the message would hang.
    
    This is not a hang for prod as there we have timeouts which would
    eventually cancel the future and allow the drop, though this can still
    lead to timeouts and confusion.
    
    Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>


Co-authored-by: Tiago Castro <tiagolobocastro@gmail.com>
@bors-openebs-mayastor
Copy link

Build failed:

@tiagolobocastro
Copy link
Contributor Author

Disabled worker-2 for now... let's try again
bors merge

bors-openebs-mayastor bot pushed a commit that referenced this pull request Aug 27, 2024
1721: [ Cherry-Pick ] Fix stuck rebuilds and stuck nexus subsystems r=tiagolobocastro a=tiagolobocastro

    fix(nvmx/retire): disconnect failed controllers
    
    When we are pausing the nexus, all IO must get flushed before
    the subsystem pausing completes.
    If we can't flush the IO then pausing is stuck forever...
    
    The issue we have seen is that when IO's are stuck there's
    nothing which can fail them and allow pause to complete.
    One way this can happen is when the controller is failed as
    it seems in this case the io queues are not getting polled.
    
    A first fix that can be done is to piggy back on the adminq
    polling failure and use this to drive the removal of the
    failed child devices from the nexus per-core channels.
    
    A better approach might be needed in the future to be able
    to timeout the IOs even when no completions are processed
    in a given I/O qpair.
    
    Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

---

    fix(opts): convert adminq poll period to us
    
    This seems to have been mistakenly added as ms.
    In practice this would have caused no harm as this value is not
    currently being overrided by the helm chart.
    
    Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>

---

    fix(rebuild): ensure comms channel is drained on drop
    
    When the rebuild backend is dropped, we must also drain the async channel.
    This covers a corner case where a message may be sent at the same time as
    we're dropping and in this case the message would hang.
    
    This is not a hang for prod as there we have timeouts which would
    eventually cancel the future and allow the drop, though this can still
    lead to timeouts and confusion.
    
    Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>


Co-authored-by: Tiago Castro <tiagolobocastro@gmail.com>
@bors-openebs-mayastor
Copy link

Timed out.

When we are pausing the nexus, all IO must get flushed before
the subsystem pausing completes.
If we can't flush the IO then pausing is stuck forever...

The issue we have seen is that when IO's are stuck there's
nothing which can fail them and allow pause to complete.
One way this can happen is when the controller is failed as
it seems in this case the io queues are not getting polled.

A first fix that can be done is to piggy back on the adminq
polling failure and use this to drive the removal of the
failed child devices from the nexus per-core channels.

A better approach might be needed in the future to be able
to timeout the IOs even when no completions are processed
in a given I/O qpair.

Signed-off-by: Tiago Castro <tiagolobocastro@gmail.com>
@tiagolobocastro
Copy link
Contributor Author

bors merge

@bors-openebs-mayastor
Copy link

Build succeeded:

@bors-openebs-mayastor bors-openebs-mayastor bot merged commit 5b36521 into release/2.7 Aug 27, 2024
4 checks passed
@bors-openebs-mayastor bors-openebs-mayastor bot deleted the rebuild-rel branch August 27, 2024 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants