Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix data race between zil_commit() and zil_suspend() #14514

Merged
merged 1 commit into from
Mar 1, 2023

Conversation

ryao
Copy link
Contributor

@ryao ryao commented Feb 21, 2023

Motivation and Context

openzfsonwindows#206 found that it is possible to trip VERIFY(list_is_empty(&lwb->lwb_itxs)) when a zil_commit() is delayed by the scheduler long enough for a parallel zil_suspend() operation to exit zil_commit_impl(). This is a data race. To prevent this, we introduce a zilog->zl_commit_lock rwlock to ensure that all outstanding zil_commit() operations finish before zil_suspend() begins and that subsequent operations fallback to txg_wait_synced() after zil_suspend() has begun.

On PREEMPT_RT Linux kernels, the rw_enter() implementation suffers from writer starvation. This means that a ZIL intensive system can delay zil_suspend() indefinitely. This is a pre-existing problem that affects everything that uses rw locks, so it needs to be addressed in the SPL. However, builds against PREEMPT_RT Linux kernels are currently broken due to a GPL symbol issue (#11097), so we can safely disregard that issue for now.

Description

We modify zil_commit() to grab a read lock for both its suspend check and the full duration of zil_commit_impl(), although not for the duration of txg_wait_synced() when the suspend check shows that ZIL is suspended. We also modify zil_suspend() to grab a write lock before grabbing zilog->zl_lock and release it after it has incremented zilog->zl_suspend. The result is that all outstanding zil_commit() operations finish before zil_suspend() begins and subsequent zil_commit() operations fallback to txg_wait_synced() as expected. This prevents the scheduler from adding arbitrarily long waits to zil_commit() that will cause it to run zil_commit_impl() on a ZIL that is either already suspending or already suspended.

Note that grabbing the write lock before grabbing zilog->zl_lock is intended to prevent a lock inversion deadlock between zil_commit() and zil_suspend() on the two locks.

How Has This Been Tested?

The buildbot can test it.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@ryao ryao changed the title Introduce zilog->zl_commit_lock to protect zil_suspend() Fix data race between zil_commit() and zil_suspend() Feb 21, 2023
@ryao
Copy link
Contributor Author

ryao commented Feb 21, 2023

This appears to be a regression introduced by 1ce23dc.

@ryao
Copy link
Contributor Author

ryao commented Feb 28, 2023

I have rebased this on master and repushed.

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call this lock zl_suspend_lock, since that is what it protects.

Also I am worrying about one more contention point per ZIL. Even though in most cases it won't block, it is still a couple of atomics on shared variable. This reminds me about FreeBSD's rms_rlock() primitives, but may be there are some other solutions too.

module/zfs/zil.c Outdated Show resolved Hide resolved
@ryao
Copy link
Contributor Author

ryao commented Mar 1, 2023

I would call this lock zl_suspend_lock, since that is what it protects.

I will change the name when I do my next push.

Also I am worrying about one more contention point per ZIL. Even though in most cases it won't block, it is still a couple of atomics on shared variable. This reminds me about FreeBSD's rms_rlock() primitives, but may be there are some other solutions too.

I could not see a better way to close the race.

openzfsonwindows#206 found that it is possible to trip
`VERIFY(list_is_empty(&lwb->lwb_itxs))` when a `zil_commit()` is delayed
by the scheduler long enough for a parallel `zil_suspend()` operation to
exit `zil_commit_impl()`. This is a data race. To prevent this, we
introduce a `zilog->zl_suspend_lock` rwlock to ensure that all
outstanding `zil_commit()` operations finish before `zil_suspend()`
begins and that subsequent operations fallback to `txg_wait_synced()`
after `zil_suspend()` has begun.

On `PREEMPT_RT` Linux kernels, the `rw_enter()` implementation suffers
from writer starvation. This means that a ZIL intensive system can delay
`zil_suspend()` indefinitely. This is a pre-existing problem that
affects everything that uses rw locks, so it needs to be addressed in
the SPL.  However, builds against `PREEMPT_RT` Linux kernels are
currently broken due to a GPL symbol issue (openzfs#11097), so we can safely
disregard that issue for now.

Reported-by: Arun KV <arun.kv@datacore.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
@ryao
Copy link
Contributor Author

ryao commented Mar 1, 2023

The latest push rebases on master and addresses both of @amotin's comments.

@behlendorf behlendorf added the Status: Accepted Ready to integrate (reviewed, tested) label Mar 1, 2023
@behlendorf behlendorf merged commit 4c856fb into openzfs:master Mar 1, 2023
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Mar 3, 2023
#206 found that it is possible to trip
`VERIFY(list_is_empty(&lwb->lwb_itxs))` when a `zil_commit()` is delayed
by the scheduler long enough for a parallel `zil_suspend()` operation to
exit `zil_commit_impl()`. This is a data race. To prevent this, we
introduce a `zilog->zl_suspend_lock` rwlock to ensure that all
outstanding `zil_commit()` operations finish before `zil_suspend()`
begins and that subsequent operations fallback to `txg_wait_synced()`
after `zil_suspend()` has begun.

On `PREEMPT_RT` Linux kernels, the `rw_enter()` implementation suffers
from writer starvation. This means that a ZIL intensive system can delay
`zil_suspend()` indefinitely. This is a pre-existing problem that
affects everything that uses rw locks, so it needs to be addressed in
the SPL.  However, builds against `PREEMPT_RT` Linux kernels are
currently broken due to a GPL symbol issue (openzfs#11097), so we can safely
disregard that issue for now.

Reported-by: Arun KV <arun.kv@datacore.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes openzfs#14514
datacore-rm pushed a commit to DataCoreSoftware/openzfs that referenced this pull request Mar 7, 2023
openzfsonwindows#206 found that it is possible to trip
`VERIFY(list_is_empty(&lwb->lwb_itxs))` when a `zil_commit()` is delayed
by the scheduler long enough for a parallel `zil_suspend()` operation to
exit `zil_commit_impl()`. This is a data race. To prevent this, we
introduce a `zilog->zl_suspend_lock` rwlock to ensure that all
outstanding `zil_commit()` operations finish before `zil_suspend()`
begins and that subsequent operations fallback to `txg_wait_synced()`
after `zil_suspend()` has begun.

On `PREEMPT_RT` Linux kernels, the `rw_enter()` implementation suffers
from writer starvation. This means that a ZIL intensive system can delay
`zil_suspend()` indefinitely. This is a pre-existing problem that
affects everything that uses rw locks, so it needs to be addressed in
the SPL.  However, builds against `PREEMPT_RT` Linux kernels are
currently broken due to a GPL symbol issue (openzfs#11097), so we can safely
disregard that issue for now.

Reported-by: Arun KV <arun.kv@datacore.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes openzfs#14514
datacore-rm pushed a commit to DataCoreSoftware/openzfs that referenced this pull request Mar 21, 2023
openzfsonwindows#206 found that it is possible to trip
`VERIFY(list_is_empty(&lwb->lwb_itxs))` when a `zil_commit()` is delayed
by the scheduler long enough for a parallel `zil_suspend()` operation to
exit `zil_commit_impl()`. This is a data race. To prevent this, we
introduce a `zilog->zl_suspend_lock` rwlock to ensure that all
outstanding `zil_commit()` operations finish before `zil_suspend()`
begins and that subsequent operations fallback to `txg_wait_synced()`
after `zil_suspend()` has begun.

On `PREEMPT_RT` Linux kernels, the `rw_enter()` implementation suffers
from writer starvation. This means that a ZIL intensive system can delay
`zil_suspend()` indefinitely. This is a pre-existing problem that
affects everything that uses rw locks, so it needs to be addressed in
the SPL.  However, builds against `PREEMPT_RT` Linux kernels are
currently broken due to a GPL symbol issue (openzfs#11097), so we can safely
disregard that issue for now.

Reported-by: Arun KV <arun.kv@datacore.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes openzfs#14514
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants