Taskq locking optimization #32

behlendorf · 2010-11-04T22:53:46Z

Recent oprofile results suggest that the tq->tq_lock can be highly contended on 16-core when your attempting perform a large number of small work items. This results in a significant impact to taskq throughput at lots of wasted cpu cycles, 19% according to oprofile. Additionally, this lock has to disable interrupts so there may be wide spread performance implications. This appears to be much less of an issue on smaller SMP systems.

To address this issue we need to look in to refactoring the taskq code to minimize the length of time this lock is held. Currently this single lock protects everything is the taskq structure all the values, flags, list heads, and wait queues. This was done to keep the implementation simple but we could benefit from finer grained locking and clever use of atomic types.

Fundamentally, the only the pending and priority lists need to be protected by a spin_lock_irqsave which disables interrupts. This is done to ensure that while one of the taskq threads is dequeuing a request with the lock held, we never service an interrupt which dispatched a new item to the same taskq. This would lead to a deadlock. If we can minimize what need to be locked in taskq_dispatch() say by making it atomic, or using per-cpu data structures, we can then minimize the need to use spin_lock_irqsave().

In fact as of 2.6.36 we can take advantage of some new kernel infrastructure. It's in this release that concurrency managed work queues (cmwq) first appear and they perform a very similar function to taskq. So much so that it appears we should be able to layer the taskq API on top of cmwqs. This has a lot of advantages including keeping the kernel thread count to a minimum. The previous workqueues were not used for the taskq's because the API was for to limited, incidentally the kernel developers decided the same thing which is why we now have cmwq's.

Anyway, that's great for folks running bleeding edge kernels but we may still want to look in to optimizing the existing taskq implementation. This support is so new it likely won't be appearing in things like RHEL6 which is what we'll likely be using.

nedbass · 2012-01-18T17:44:06Z

I've found the large memory allocation in the splat test may fail on some systems. We should probably scale back the number of tasks before this is merged.

Add a test designed to generate contention on the taskq spinlock by using a large number of threads (100) to perform a large number (131072) of trivial work items from a single queue. This simulates conditions that may occur with the zio free taskq when a 1TB file is removed from a ZFS filesystem, for example. This test should always pass. Its purpose is to provide a benchmark to easily measure the effectiveness of taskq optimizations using statistics from the kernel lock profiler. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #32

This reverts commit ec2b410. A race condition was introduced by which a wake_up() call can be lost after the taskq thread determines there is no pending work items, leading to deadlock: 1. taksq thread enables interrupts 2. dispatcher thread runs, queues work item, call wake_up() 3. taskq thread runs, adds self to waitq, sleeps This could easily happen if an interrupt for an IO completion was outstanding at the point where the taskq thread reenables interrupts, just before the call to add_wait_queue_exclusive(). The handler would run immediately within the race window. Issue openzfs#32

Testing has shown that tq->tq_lock can be highly contended when a large number of small work items are dispatched. The lock hold time is reduced by the following changes: 1) Use exclusive threads in the work_waitq When a single work item is dispatched we only need to wake a single thread to service it. The current implementation uses non-exclusive threads so all threads are woken when the dispatcher calls wake_up(). If a large number of threads are in the queue this overhead can become non-negligible. 2) Conditionally add/remove threads from work waitq Taskq threads need only add themselves to the work wait queue if there are no pending work items. Issue openzfs#32

This reverts commit ec2b410. A race condition was introduced by which a wake_up() call can be lost after the taskq thread determines there is no pending work items, leading to deadlock: 1. taksq thread enables interrupts 2. dispatcher thread runs, queues work item, call wake_up() 3. taskq thread runs, adds self to waitq, sleeps This could easily happen if an interrupt for an IO completion was outstanding at the point where the taskq thread reenables interrupts, just before the call to add_wait_queue_exclusive(). The handler would run immediately within the race window. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #32

Testing has shown that tq->tq_lock can be highly contended when a large number of small work items are dispatched. The lock hold time is reduced by the following changes: 1) Use exclusive threads in the work_waitq When a single work item is dispatched we only need to wake a single thread to service it. The current implementation uses non-exclusive threads so all threads are woken when the dispatcher calls wake_up(). If a large number of threads are in the queue this overhead can become non-negligible. 2) Conditionally add/remove threads from work waitq Taskq threads need only add themselves to the work wait queue if there are no pending work items. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #32

This was referenced Jan 17, 2012

100% CPU Spin on all cores by zfs threads openzfs/zfs#258

Closed

Slow unlink openzfs/zfs#458

Closed

behlendorf closed this as completed in ec2b410 Jan 18, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taskq locking optimization #32

Taskq locking optimization #32

behlendorf commented Nov 4, 2010

nedbass commented Jan 18, 2012

Taskq locking optimization #32

Taskq locking optimization #32

Comments

behlendorf commented Nov 4, 2010

nedbass commented Jan 18, 2012