Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msync occasionally takes several seconds (up to ~15 seconds) #12662

Closed
shaan1337 opened this issue Oct 20, 2021 · 5 comments · Fixed by #12790
Closed

msync occasionally takes several seconds (up to ~15 seconds) #12662

shaan1337 opened this issue Oct 20, 2021 · 5 comments · Fixed by #12790
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@shaan1337
Copy link
Contributor

shaan1337 commented Oct 20, 2021

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 20.04
Kernel Version 5.11.0-1018-gcp (on a Google Cloud VM)
Architecture x86_64
OpenZFS Version zfs-0.8.3-1ubuntu12.12, zfs-kmod-2.0.2-1ubuntu5.1

Describe the problem you're observing

When running EventStoreDB (https://github.com/EventStore/EventStore) on a Google Cloud VM, we occasionally see writes taking several seconds in the logs (it happens 1-2 times per day with a constant write load):

{"@t":"2021-10-19T10:51:11.5248104Z","@mt":"SLOW QUEUE MSG [{queue}]: {message} - {elapsed}ms. Q: {prevQueueCount}/{curQueueCount}.","@l":"Debug","queue":"StorageWriterQueue","message":"WritePrepares","elapsed":1574,"prevQueueCount":0,"curQueueCount":0,"SourceContext":"EventStore.Core.Bus.QueuedHandlerMRES","ProcessId":1100082,"ThreadId":10}

In the above case, it took ~1.5 seconds to complete a write but sometimes it can be up to ~15 seconds.

After doing an strace, I've come to the conclusion that msync is the culprit. We have a small memory-mapped file (8 bytes large) which is updated and flushed very regularly (every ~2 milliseconds).

  • Note that we also have another regular (not memory-mapped) file in the same directory that is being flushed with fsync at the same time but no delay is incurred for it.
  • During the slow write, there are no pending reads/writes on the system (verified with iostat and zfs iostat)
  • In our code, msync is followed by an fsync on the same file descriptor but they run on the same thread, so the calls do not overlap.
  • Once the slow write delay is over, everything continues normally (i.e it does not hang forever)
  • We don't seem to see the same problem on other cloud platforms (AWS, Azure) although they should be running in a similar environment. EDIT: The issue also seems to happen on AWS/Azure
  • EDIT: We also tested with a non-zfs cluster and the issue has not happened after 48+ hours of continuous writes

After doing an ftrace on msync (__x64_sys_msync) I've got the following output which shows that the delay is incurred in wait_on_page_writeback:

 StorageWriterQu-1100154 [000] 1634640669.215961: funcgraph_entry:                   |  __x64_sys_msync() {
 StorageWriterQu-1100154 [000] 1634640669.215962: funcgraph_entry:                   |    down_read() {
 StorageWriterQu-1100154 [000] 1634640669.215963: funcgraph_entry:                   |      _cond_resched() {
 StorageWriterQu-1100154 [000] 1634640669.215963: funcgraph_entry:        0.355 us   |        rcu_all_qs();
 StorageWriterQu-1100154 [000] 1634640669.215963: funcgraph_exit:         0.775 us   |      }
 StorageWriterQu-1100154 [000] 1634640669.215964: funcgraph_exit:         1.409 us   |    }
 StorageWriterQu-1100154 [000] 1634640669.215964: funcgraph_entry:                   |    find_vma() {
 StorageWriterQu-1100154 [000] 1634640669.215964: funcgraph_entry:        0.186 us   |      vmacache_find();
 StorageWriterQu-1100154 [000] 1634640669.215965: funcgraph_exit:         0.636 us   |    }
 StorageWriterQu-1100154 [000] 1634640669.215965: funcgraph_entry:        0.173 us   |    up_read();
 StorageWriterQu-1100154 [000] 1634640669.215965: funcgraph_entry:                   |    vfs_fsync_range() {
 StorageWriterQu-1100154 [000] 1634640669.215966: funcgraph_entry:                   |      zpl_fsync() {
 StorageWriterQu-1100154 [000] 1634640669.215966: funcgraph_entry:                   |        filemap_write_and_wait_range() {
 StorageWriterQu-1100154 [000] 1634640669.215967: funcgraph_entry:                   |          __filemap_fdatawrite_range() {
 StorageWriterQu-1100154 [000] 1634640669.215967: funcgraph_entry:        0.306 us   |            _raw_spin_lock();
 StorageWriterQu-1100154 [000] 1634640669.215967: funcgraph_entry:        0.199 us   |            wbc_attach_and_unlock_inode();
 StorageWriterQu-1100154 [000] 1634640669.215968: funcgraph_entry:                   |            do_writepages() {
 StorageWriterQu-1100154 [000] 1634640669.215968: funcgraph_entry:      + 78.604 us  |              zpl_writepages();
 StorageWriterQu-1100154 [000] 1634640669.216047: funcgraph_exit:       + 79.316 us  |            }
 StorageWriterQu-1100154 [000] 1634640669.216047: funcgraph_entry:        0.235 us   |            wbc_detach_inode();
 StorageWriterQu-1100154 [000] 1634640669.216048: funcgraph_exit:       + 81.505 us  |          }
 StorageWriterQu-1100154 [000] 1634640669.216048: funcgraph_entry:                   |          __filemap_fdatawait_range() {
 StorageWriterQu-1100154 [000] 1634640669.216048: funcgraph_entry:                   |            pagevec_lookup_range_tag() {
 StorageWriterQu-1100154 [000] 1634640669.216048: funcgraph_entry:        0.430 us   |              find_get_pages_range_tag();
 StorageWriterQu-1100154 [000] 1634640669.216049: funcgraph_exit:         0.783 us   |            }
 StorageWriterQu-1100154 [000] 1634640669.216049: funcgraph_entry:                   |            wait_on_page_writeback() {
 StorageWriterQu-1100154 [000] 1634640669.216049: funcgraph_entry:        0.265 us   |              page_mapping();
 StorageWriterQu-1100154 [000] 1634640669.216050: funcgraph_entry:      $ 1571904 us |              wait_on_page_bit();
 StorageWriterQu-1100154 [001] 1634640670.787956: funcgraph_exit:       $ 1571907 us |            }
 StorageWriterQu-1100154 [001] 1634640670.787957: funcgraph_entry:                   |            __pagevec_release() {
 StorageWriterQu-1100154 [001] 1634640670.787957: funcgraph_entry:      + 11.553 us  |              lru_add_drain_cpu();
 StorageWriterQu-1100154 [001] 1634640670.787969: funcgraph_entry:        0.405 us   |              release_pages();
 StorageWriterQu-1100154 [001] 1634640670.787970: funcgraph_exit:       + 13.001 us  |            }
 StorageWriterQu-1100154 [001] 1634640670.787970: funcgraph_entry:                   |            _cond_resched() {
 StorageWriterQu-1100154 [001] 1634640670.787970: funcgraph_entry:        0.178 us   |              rcu_all_qs();
 StorageWriterQu-1100154 [001] 1634640670.787971: funcgraph_exit:         0.652 us   |            }
 StorageWriterQu-1100154 [001] 1634640670.787971: funcgraph_exit:       $ 1571923 us |          }
 StorageWriterQu-1100154 [001] 1634640670.787971: funcgraph_entry:        0.350 us   |          filemap_check_errors();
 StorageWriterQu-1100154 [001] 1634640670.787972: funcgraph_exit:       $ 1572005 us |        }
 StorageWriterQu-1100154 [001] 1634640670.787972: funcgraph_entry:        0.399 us   |        crhold();
 StorageWriterQu-1100154 [001] 1634640670.787973: funcgraph_entry:                   |        zfs_fsync() {
 StorageWriterQu-1100154 [001] 1634640670.787974: funcgraph_entry:                   |          tsd_set() {
 StorageWriterQu-1100154 [001] 1634640670.787974: funcgraph_entry:                   |            tsd_hash_search() {
 StorageWriterQu-1100154 [001] 1634640670.787974: funcgraph_entry:        0.279 us   |              _raw_spin_lock();
 StorageWriterQu-1100154 [001] 1634640670.787975: funcgraph_exit:         0.749 us   |            }
 StorageWriterQu-1100154 [001] 1634640670.787975: funcgraph_entry:                   |            tsd_hash_search() {
 StorageWriterQu-1100154 [001] 1634640670.787975: funcgraph_entry:        0.255 us   |              _raw_spin_lock();
 StorageWriterQu-1100154 [001] 1634640670.787976: funcgraph_exit:         0.622 us   |            }
 StorageWriterQu-1100154 [001] 1634640670.787976: funcgraph_entry:                   |            tsd_hash_add_pid() {
 StorageWriterQu-1100154 [001] 1634640670.787976: funcgraph_entry:        1.214 us   |              spl_kmem_alloc();
 StorageWriterQu-1100154 [001] 1634640670.787977: funcgraph_entry:        0.186 us   |              _raw_spin_lock();
 StorageWriterQu-1100154 [001] 1634640670.787978: funcgraph_entry:        0.238 us   |              _raw_spin_lock();
 StorageWriterQu-1100154 [001] 1634640670.787978: funcgraph_exit:         2.413 us   |            }
 StorageWriterQu-1100154 [001] 1634640670.787979: funcgraph_entry:                   |            tsd_hash_add() {
 StorageWriterQu-1100154 [001] 1634640670.787979: funcgraph_entry:        0.578 us   |              spl_kmem_alloc();
 StorageWriterQu-1100154 [001] 1634640670.787980: funcgraph_entry:        0.253 us   |              _raw_spin_lock();
 StorageWriterQu-1100154 [001] 1634640670.787980: funcgraph_entry:        1.398 us   |              tsd_hash_search();
 StorageWriterQu-1100154 [001] 1634640670.787982: funcgraph_entry:        0.350 us   |              tsd_hash_search();
 StorageWriterQu-1100154 [001] 1634640670.787982: funcgraph_entry:        0.250 us   |              _raw_spin_lock();
 StorageWriterQu-1100154 [001] 1634640670.787983: funcgraph_exit:         4.591 us   |            }
 StorageWriterQu-1100154 [001] 1634640670.787983: funcgraph_exit:         9.417 us   |          }
 StorageWriterQu-1100154 [001] 1634640670.787984: funcgraph_entry:                   |          rrm_enter_read() {
 StorageWriterQu-1100154 [001] 1634640670.787984: funcgraph_entry:                   |            rrw_enter_read() {
 StorageWriterQu-1100154 [001] 1634640670.787984: funcgraph_entry:        0.850 us   |              rrw_enter_read_impl();
 StorageWriterQu-1100154 [001] 1634640670.787985: funcgraph_exit:         1.398 us   |            }
 StorageWriterQu-1100154 [001] 1634640670.787985: funcgraph_exit:         1.791 us   |          }
 StorageWriterQu-1100154 [001] 1634640670.787986: funcgraph_entry:                   |          zil_commit() {
 StorageWriterQu-1100154 [001] 1634640670.787986: funcgraph_entry:        0.171 us   |            spa_writeable();
 StorageWriterQu-1100154 [001] 1634640670.787987: funcgraph_entry:                   |            zil_commit_impl() {
 StorageWriterQu-1100154 [001] 1634640670.787987: funcgraph_entry:        1.771 us   |              zil_async_to_sync();
 StorageWriterQu-1100154 [001] 1634640670.787989: funcgraph_entry:        1.622 us   |              zil_alloc_commit_waiter();
 StorageWriterQu-1100154 [001] 1634640670.787991: funcgraph_entry:        9.633 us   |              zil_commit_itx_assign();
 StorageWriterQu-1100154 [001] 1634640670.788001: funcgraph_entry:        4.155 us   |              zil_commit_writer();
 StorageWriterQu-1100154 [001] 1634640670.788005: funcgraph_entry:        0.784 us   |              zil_commit_waiter();
 StorageWriterQu-1100154 [001] 1634640670.788006: funcgraph_entry:        0.605 us   |              zil_free_commit_waiter();
 StorageWriterQu-1100154 [001] 1634640670.788007: funcgraph_exit:       + 20.187 us  |            }
 StorageWriterQu-1100154 [001] 1634640670.788007: funcgraph_exit:       + 21.044 us  |          }
 StorageWriterQu-1100154 [001] 1634640670.788007: funcgraph_entry:                   |          rrm_exit() {
 StorageWriterQu-1100154 [001] 1634640670.788008: funcgraph_entry:                   |            rrw_exit() {
 StorageWriterQu-1100154 [001] 1634640670.788008: funcgraph_entry:        0.325 us   |              mutex_lock();
 StorageWriterQu-1100154 [001] 1634640670.788008: funcgraph_entry:        0.263 us   |              __cv_broadcast();
 StorageWriterQu-1100154 [001] 1634640670.788009: funcgraph_entry:        0.279 us   |              _raw_spin_lock();
 StorageWriterQu-1100154 [001] 1634640670.788009: funcgraph_entry:        0.259 us   |              mutex_unlock();
 StorageWriterQu-1100154 [001] 1634640670.788010: funcgraph_exit:         2.051 us   |            }
 StorageWriterQu-1100154 [001] 1634640670.788010: funcgraph_exit:         2.497 us   |          }
 StorageWriterQu-1100154 [001] 1634640670.788010: funcgraph_entry:                   |          tsd_set() {
 StorageWriterQu-1100154 [001] 1634640670.788010: funcgraph_entry:                   |            tsd_hash_search() {
 StorageWriterQu-1100154 [001] 1634640670.788011: funcgraph_entry:        0.184 us   |              _raw_spin_lock();
 StorageWriterQu-1100154 [001] 1634640670.788011: funcgraph_exit:         0.708 us   |            }
 StorageWriterQu-1100154 [001] 1634640670.788011: funcgraph_entry:                   |            tsd_remove_entry() {
 StorageWriterQu-1100154 [001] 1634640670.788012: funcgraph_entry:        0.174 us   |              _raw_spin_lock();
 StorageWriterQu-1100154 [001] 1634640670.788012: funcgraph_entry:        0.175 us   |              _raw_spin_lock();
 StorageWriterQu-1100154 [001] 1634640670.788013: funcgraph_entry:        0.246 us   |              tsd_hash_del.constprop.0();
 StorageWriterQu-1100154 [001] 1634640670.788013: funcgraph_entry:        0.175 us   |              _raw_spin_lock();
 StorageWriterQu-1100154 [001] 1634640670.788013: funcgraph_entry:        0.191 us   |              tsd_hash_del.constprop.0();
 StorageWriterQu-1100154 [001] 1634640670.788014: funcgraph_entry:        1.373 us   |              tsd_hash_dtor();
 StorageWriterQu-1100154 [001] 1634640670.788015: funcgraph_exit:         4.004 us   |            }
 StorageWriterQu-1100154 [001] 1634640670.788016: funcgraph_exit:         5.289 us   |          }
 StorageWriterQu-1100154 [001] 1634640670.788016: funcgraph_exit:       + 42.299 us  |        }
 StorageWriterQu-1100154 [001] 1634640670.788016: funcgraph_entry:        0.204 us   |        crfree();
 StorageWriterQu-1100154 [001] 1634640670.788016: funcgraph_exit:       $ 1572050 us |      }
 StorageWriterQu-1100154 [001] 1634640670.788017: funcgraph_exit:       $ 1572051 us |    }
 StorageWriterQu-1100154 [001] 1634640670.788017: funcgraph_entry:                   |    fput() {
 StorageWriterQu-1100154 [001] 1634640670.788017: funcgraph_entry:        0.437 us   |      fput_many();
 StorageWriterQu-1100154 [001] 1634640670.788018: funcgraph_exit:         0.874 us   |    }
 StorageWriterQu-1100154 [001] 1634640670.788018: funcgraph_exit:       $ 1572058 us |  }

I've done an additional ftrace, but this time only on wait_on_page_bit() and I've got this output:

 StorageWriterQu-1100154 [006] 1634719033.236598: funcgraph_entry:                   |  wait_on_page_bit() {
 StorageWriterQu-1100154 [006] 1634719033.236601: funcgraph_entry:        0.519 us   |    _raw_spin_lock_irq();
 StorageWriterQu-1100154 [006] 1634719033.236602: funcgraph_entry:                   |    io_schedule() {
 StorageWriterQu-1100154 [006] 1634719033.236602: funcgraph_entry:        0.316 us   |      io_schedule_prepare();
 StorageWriterQu-1100154 [006] 1634719033.236603: funcgraph_entry:                   |      schedule() {
 StorageWriterQu-1100154 [006] 1634719033.236603: funcgraph_entry:                   |        rcu_note_context_switch() {
 StorageWriterQu-1100154 [006] 1634719033.236603: funcgraph_entry:        0.379 us   |          rcu_qs();
 StorageWriterQu-1100154 [006] 1634719033.236604: funcgraph_exit:         0.950 us   |        }
 StorageWriterQu-1100154 [006] 1634719033.236604: funcgraph_entry:        0.358 us   |        _raw_spin_lock();
 StorageWriterQu-1100154 [006] 1634719033.236605: funcgraph_entry:        0.419 us   |        update_rq_clock();
 StorageWriterQu-1100154 [006] 1634719033.236606: funcgraph_entry:                   |        psi_task_change() {
 StorageWriterQu-1100154 [006] 1634719033.236606: funcgraph_entry:        0.300 us   |          psi_flags_change();
 StorageWriterQu-1100154 [006] 1634719033.236607: funcgraph_entry:                   |          psi_group_change() {
 StorageWriterQu-1100154 [006] 1634719033.236607: funcgraph_entry:        0.321 us   |            record_times();
 StorageWriterQu-1100154 [006] 1634719033.236608: funcgraph_exit:         1.029 us   |          }
 StorageWriterQu-1100154 [006] 1634719033.236608: funcgraph_entry:                   |          psi_group_change() {
 StorageWriterQu-1100154 [006] 1634719033.236608: funcgraph_entry:        0.288 us   |            record_times();
 StorageWriterQu-1100154 [006] 1634719033.236609: funcgraph_exit:         0.897 us   |          }
 StorageWriterQu-1100154 [006] 1634719033.236609: funcgraph_entry:                   |          psi_group_change() {
 StorageWriterQu-1100154 [006] 1634719033.236610: funcgraph_entry:        0.305 us   |            record_times();
 StorageWriterQu-1100154 [006] 1634719033.236610: funcgraph_exit:         0.883 us   |          }
 StorageWriterQu-1100154 [006] 1634719033.236610: funcgraph_exit:         4.667 us   |        }
 StorageWriterQu-1100154 [006] 1634719033.236611: funcgraph_entry:                   |        dequeue_task_fair() {
 StorageWriterQu-1100154 [006] 1634719033.236611: funcgraph_entry:                   |          dequeue_entity() {
 StorageWriterQu-1100154 [006] 1634719033.236611: funcgraph_entry:                   |            update_curr() {
 StorageWriterQu-1100154 [006] 1634719033.236612: funcgraph_entry:        0.350 us   |              update_min_vruntime();
 StorageWriterQu-1100154 [006] 1634719033.236612: funcgraph_entry:                   |              cpuacct_charge() {
 StorageWriterQu-1100154 [006] 1634719033.236613: funcgraph_entry:        0.299 us   |                rcu_read_unlock_strict();
 StorageWriterQu-1100154 [006] 1634719033.236613: funcgraph_exit:         0.888 us   |              }
 StorageWriterQu-1100154 [006] 1634719033.236614: funcgraph_entry:                   |              __cgroup_account_cputime() {
 StorageWriterQu-1100154 [006] 1634719033.236614: funcgraph_entry:        0.285 us   |                cgroup_rstat_updated();
 StorageWriterQu-1100154 [006] 1634719033.236614: funcgraph_exit:         0.854 us   |              }
 StorageWriterQu-1100154 [006] 1634719033.236615: funcgraph_entry:        0.296 us   |              rcu_read_unlock_strict();
 StorageWriterQu-1100154 [006] 1634719033.236615: funcgraph_exit:         3.909 us   |            }
 StorageWriterQu-1100154 [006] 1634719033.236616: funcgraph_entry:        0.268 us   |            __update_load_avg_se();
 StorageWriterQu-1100154 [006] 1634719033.236616: funcgraph_entry:        0.304 us   |            __update_load_avg_cfs_rq();
 StorageWriterQu-1100154 [006] 1634719033.236617: funcgraph_entry:        0.293 us   |            clear_buddies();
 StorageWriterQu-1100154 [006] 1634719033.236617: funcgraph_entry:        0.284 us   |            update_cfs_group();
 StorageWriterQu-1100154 [006] 1634719033.236618: funcgraph_entry:        0.301 us   |            update_min_vruntime();
 StorageWriterQu-1100154 [006] 1634719033.236618: funcgraph_exit:         7.319 us   |          }
 StorageWriterQu-1100154 [006] 1634719033.236619: funcgraph_entry:                   |          dequeue_entity() {
 StorageWriterQu-1100154 [006] 1634719033.236619: funcgraph_entry:                   |            update_curr() {
 StorageWriterQu-1100154 [006] 1634719033.236619: funcgraph_entry:        0.328 us   |              __calc_delta();
 StorageWriterQu-1100154 [006] 1634719033.236620: funcgraph_entry:        0.300 us   |              update_min_vruntime();
 StorageWriterQu-1100154 [006] 1634719033.236620: funcgraph_exit:         1.487 us   |            }
 StorageWriterQu-1100154 [006] 1634719033.236621: funcgraph_entry:        0.305 us   |            __update_load_avg_se();
 StorageWriterQu-1100154 [006] 1634719033.236621: funcgraph_entry:        0.296 us   |            __update_load_avg_cfs_rq();
 StorageWriterQu-1100154 [006] 1634719033.236622: funcgraph_entry:        0.279 us   |            clear_buddies();
 StorageWriterQu-1100154 [006] 1634719033.236622: funcgraph_entry:                   |            update_cfs_group() {
 StorageWriterQu-1100154 [006] 1634719033.236623: funcgraph_entry:        0.328 us   |              reweight_entity();
 StorageWriterQu-1100154 [006] 1634719033.236623: funcgraph_exit:         0.939 us   |            }
 StorageWriterQu-1100154 [006] 1634719033.236624: funcgraph_entry:        0.287 us   |            update_min_vruntime();
 StorageWriterQu-1100154 [006] 1634719033.236624: funcgraph_exit:         5.516 us   |          }
 StorageWriterQu-1100154 [006] 1634719033.236624: funcgraph_entry:        0.239 us   |          hrtick_update();
 StorageWriterQu-1100154 [006] 1634719033.236625: funcgraph_exit:       + 14.232 us  |        }
 StorageWriterQu-1100154 [006] 1634719033.236625: funcgraph_entry:                   |        __delayacct_blkio_start() {
 StorageWriterQu-1100154 [006] 1634719033.236626: funcgraph_entry:        0.366 us   |          ktime_get();
 StorageWriterQu-1100154 [006] 1634719033.236626: funcgraph_exit:         0.921 us   |        }
 StorageWriterQu-1100154 [006] 1634719033.236627: funcgraph_entry:                   |        pick_next_task_fair() {
 StorageWriterQu-1100154 [006] 1634719033.236627: funcgraph_entry:                   |          newidle_balance() {
 StorageWriterQu-1100154 [006] 1634719033.236627: funcgraph_entry:        0.285 us   |            __msecs_to_jiffies();
 StorageWriterQu-1100154 [006] 1634719033.236628: funcgraph_entry:        0.245 us   |            rcu_read_unlock_strict();
 StorageWriterQu-1100154 [006] 1634719033.236628: funcgraph_exit:         1.419 us   |          }
 StorageWriterQu-1100154 [006] 1634719033.236628: funcgraph_exit:         1.972 us   |        }
 StorageWriterQu-1100154 [006] 1634719033.236629: funcgraph_entry:                   |        put_prev_task_fair() {
 StorageWriterQu-1100154 [006] 1634719033.236629: funcgraph_entry:                   |          put_prev_entity() {
 StorageWriterQu-1100154 [006] 1634719033.236629: funcgraph_entry:        0.278 us   |            check_cfs_rq_runtime();
 StorageWriterQu-1100154 [006] 1634719033.236630: funcgraph_exit:         0.840 us   |          }
 StorageWriterQu-1100154 [006] 1634719033.236630: funcgraph_entry:                   |          put_prev_entity() {
 StorageWriterQu-1100154 [006] 1634719033.236630: funcgraph_entry:        0.281 us   |            check_cfs_rq_runtime();
 StorageWriterQu-1100154 [006] 1634719033.236631: funcgraph_exit:         0.829 us   |          }
 StorageWriterQu-1100154 [006] 1634719033.236631: funcgraph_exit:         2.501 us   |        }
 StorageWriterQu-1100154 [006] 1634719033.236632: funcgraph_entry:                   |        pick_next_task_idle() {
 StorageWriterQu-1100154 [006] 1634719033.236632: funcgraph_entry:                   |          set_next_task_idle() {
 StorageWriterQu-1100154 [006] 1634719033.236632: funcgraph_entry:                   |            __update_idle_core() {
 StorageWriterQu-1100154 [006] 1634719033.236632: funcgraph_entry:        0.287 us   |              rcu_read_unlock_strict();
 StorageWriterQu-1100154 [006] 1634719033.236633: funcgraph_exit:         0.835 us   |            }
 StorageWriterQu-1100154 [006] 1634719033.236633: funcgraph_exit:         1.372 us   |          }
 StorageWriterQu-1100154 [006] 1634719033.236633: funcgraph_exit:         1.934 us   |        }
 StorageWriterQu-1100154 [006] 1634719033.236634: funcgraph_entry:        0.225 us   |        psi_task_switch();
 StorageWriterQu-1100154 [006] 1634719033.236634: funcgraph_entry:        1.438 us   |        __traceiter_sched_switch();
 StorageWriterQu-1100154 [006] 1634719038.397231: funcgraph_entry:        2.358 us   |        finish_task_switch();
 StorageWriterQu-1100154 [006] 1634719038.397233: funcgraph_exit:       $ 5160630 us |      }
 StorageWriterQu-1100154 [006] 1634719038.397233: funcgraph_exit:       $ 5160631 us |    }
 StorageWriterQu-1100154 [006] 1634719038.397234: funcgraph_entry:        0.206 us   |    finish_wait();
 StorageWriterQu-1100154 [006] 1634719038.397237: funcgraph_entry:        0.280 us   |    irq_enter_rcu();
 StorageWriterQu-1100154 [006] 1634719038.397237: funcgraph_entry:                   |    __sysvec_irq_work() {
 StorageWriterQu-1100154 [006] 1634719038.397238: funcgraph_entry:                   |      __wake_up() {
 StorageWriterQu-1100154 [006] 1634719038.397238: funcgraph_entry:                   |        __wake_up_common_lock() {
 StorageWriterQu-1100154 [006] 1634719038.397238: funcgraph_entry:        0.181 us   |          _raw_spin_lock_irqsave();
 StorageWriterQu-1100154 [006] 1634719038.397239: funcgraph_entry:        0.199 us   |          __wake_up_common();
 StorageWriterQu-1100154 [006] 1634719038.397239: funcgraph_entry:        0.197 us   |          __lock_text_start();
 StorageWriterQu-1100154 [006] 1634719038.397240: funcgraph_exit:         1.614 us   |        }
 StorageWriterQu-1100154 [006] 1634719038.397240: funcgraph_exit:         1.981 us   |      }
 StorageWriterQu-1100154 [006] 1634719038.397240: funcgraph_entry:                   |      __wake_up() {
 StorageWriterQu-1100154 [006] 1634719038.397240: funcgraph_entry:                   |        __wake_up_common_lock() {
 StorageWriterQu-1100154 [006] 1634719038.397240: funcgraph_entry:        0.178 us   |          _raw_spin_lock_irqsave();
 StorageWriterQu-1100154 [006] 1634719038.397241: funcgraph_entry:                   |          __wake_up_common() {
 StorageWriterQu-1100154 [006] 1634719038.397242: funcgraph_entry:                   |            autoremove_wake_function() {
 StorageWriterQu-1100154 [006] 1634719038.397242: funcgraph_entry:                   |              default_wake_function() {
 StorageWriterQu-1100154 [006] 1634719038.397243: funcgraph_entry:                   |                try_to_wake_up() {
 StorageWriterQu-1100154 [006] 1634719038.397243: funcgraph_entry:        0.427 us   |                  _raw_spin_lock_irqsave();
 StorageWriterQu-1100154 [006] 1634719038.397243: funcgraph_entry:        0.300 us   |                  __traceiter_sched_waking();
 StorageWriterQu-1100154 [006] 1634719038.397244: funcgraph_entry:                   |                  select_task_rq_fair() {
 StorageWriterQu-1100154 [006] 1634719038.397244: funcgraph_entry:                   |                    select_idle_sibling() {
 StorageWriterQu-1100154 [006] 1634719038.397244: funcgraph_entry:        0.252 us   |                      available_idle_cpu();
 StorageWriterQu-1100154 [006] 1634719038.397245: funcgraph_entry:        0.180 us   |                      cpus_share_cache();
 StorageWriterQu-1100154 [006] 1634719038.397245: funcgraph_entry:        0.227 us   |                      available_idle_cpu();
 StorageWriterQu-1100154 [006] 1634719038.397246: funcgraph_entry:        0.210 us   |                      available_idle_cpu();
 StorageWriterQu-1100154 [006] 1634719038.397246: funcgraph_entry:        0.175 us   |                      available_idle_cpu();
 StorageWriterQu-1100154 [006] 1634719038.397247: funcgraph_entry:        0.195 us   |                      available_idle_cpu();
 StorageWriterQu-1100154 [006] 1634719038.397247: funcgraph_entry:        0.212 us   |                      available_idle_cpu();
 StorageWriterQu-1100154 [006] 1634719038.397247: funcgraph_exit:         3.202 us   |                    }
 StorageWriterQu-1100154 [006] 1634719038.397248: funcgraph_entry:        0.182 us   |                    rcu_read_unlock_strict();
 StorageWriterQu-1100154 [006] 1634719038.397248: funcgraph_exit:         4.042 us   |                  }
 StorageWriterQu-1100154 [006] 1634719038.397248: funcgraph_entry:                   |                  set_task_cpu() {
 StorageWriterQu-1100154 [006] 1634719038.397248: funcgraph_entry:                   |                    migrate_task_rq_fair() {
 StorageWriterQu-1100154 [006] 1634719038.397249: funcgraph_entry:                   |                      remove_entity_load_avg() {
 StorageWriterQu-1100154 [006] 1634719038.397249: funcgraph_entry:        0.376 us   |                        __update_load_avg_blocked_se();
 StorageWriterQu-1100154 [006] 1634719038.397250: funcgraph_entry:        0.261 us   |                        _raw_spin_lock_irqsave();
 StorageWriterQu-1100154 [006] 1634719038.397250: funcgraph_entry:        0.193 us   |                        __lock_text_start();
 StorageWriterQu-1100154 [006] 1634719038.397251: funcgraph_exit:         1.612 us   |                      }
 StorageWriterQu-1100154 [006] 1634719038.397251: funcgraph_exit:         2.327 us   |                    }
 StorageWriterQu-1100154 [006] 1634719038.397251: funcgraph_entry:        0.174 us   |                    set_task_rq_fair();
 StorageWriterQu-1100154 [006] 1634719038.397252: funcgraph_exit:         3.537 us   |                  }
 StorageWriterQu-1100154 [006] 1634719038.397252: funcgraph_entry:        0.217 us   |                  ttwu_queue_wakelist();
 StorageWriterQu-1100154 [006] 1634719038.397252: funcgraph_entry:        0.216 us   |                  _raw_spin_lock();
 StorageWriterQu-1100154 [006] 1634719038.397253: funcgraph_entry:        0.289 us   |                  update_rq_clock();
 StorageWriterQu-1100154 [006] 1634719038.397253: funcgraph_entry:                   |                  ttwu_do_activate() {
 StorageWriterQu-1100154 [006] 1634719038.397253: funcgraph_entry:                   |                    psi_task_change() {
 StorageWriterQu-1100154 [006] 1634719038.397254: funcgraph_entry:        0.236 us   |                      psi_flags_change();
 StorageWriterQu-1100154 [006] 1634719038.397255: funcgraph_entry:                   |                      psi_group_change() {
 StorageWriterQu-1100154 [006] 1634719038.397256: funcgraph_entry:        0.215 us   |                        record_times();
 StorageWriterQu-1100154 [006] 1634719038.397256: funcgraph_entry:                   |                        queue_delayed_work_on() {
 StorageWriterQu-1100154 [006] 1634719038.397257: funcgraph_entry:                   |                          __queue_delayed_work() {
 StorageWriterQu-1100154 [006] 1634719038.397257: funcgraph_entry:                   |                            add_timer() {
 StorageWriterQu-1100154 [006] 1634719038.397257: funcgraph_entry:        0.359 us   |                              lock_timer_base();
 StorageWriterQu-1100154 [006] 1634719038.397258: funcgraph_entry:        0.181 us   |                              detach_if_pending();
 StorageWriterQu-1100154 [006] 1634719038.397258: funcgraph_entry:        0.176 us   |                              get_nohz_timer_target();
 StorageWriterQu-1100154 [006] 1634719038.397258: funcgraph_entry:        0.219 us   |                              _raw_spin_lock();
 StorageWriterQu-1100154 [006] 1634719038.397259: funcgraph_entry:        0.476 us   |                              internal_add_timer();
 StorageWriterQu-1100154 [006] 1634719038.397259: funcgraph_entry:        0.205 us   |                              __lock_text_start();
 StorageWriterQu-1100154 [006] 1634719038.397260: funcgraph_exit:         3.069 us   |                            }
 StorageWriterQu-1100154 [006] 1634719038.397260: funcgraph_exit:         3.513 us   |                          }
 StorageWriterQu-1100154 [006] 1634719038.397260: funcgraph_exit:         3.904 us   |                        }
 StorageWriterQu-1100154 [006] 1634719038.397260: funcgraph_exit:         5.155 us   |                      }
 StorageWriterQu-1100154 [006] 1634719038.397261: funcgraph_entry:                   |                      psi_group_change() {
 StorageWriterQu-1100154 [006] 1634719038.397261: funcgraph_entry:        0.196 us   |                        record_times();
 StorageWriterQu-1100154 [006] 1634719038.397262: funcgraph_exit:         0.838 us   |                      }
 StorageWriterQu-1100154 [006] 1634719038.397262: funcgraph_entry:                   |                      psi_group_change() {
 StorageWriterQu-1100154 [006] 1634719038.397263: funcgraph_entry:        0.192 us   |                        record_times();
 StorageWriterQu-1100154 [006] 1634719038.397263: funcgraph_exit:         0.831 us   |                      }
 StorageWriterQu-1100154 [006] 1634719038.397263: funcgraph_entry:                   |                      psi_group_change() {
 StorageWriterQu-1100154 [006] 1634719038.397263: funcgraph_entry:        0.226 us   |                        record_times();
 StorageWriterQu-1100154 [006] 1634719038.397264: funcgraph_exit:         0.649 us   |                      }
 StorageWriterQu-1100154 [006] 1634719038.397264: funcgraph_exit:       + 10.517 us  |                    }
 StorageWriterQu-1100154 [006] 1634719038.397264: funcgraph_entry:                   |                    enqueue_task_fair() {
 StorageWriterQu-1100154 [006] 1634719038.397264: funcgraph_entry:                   |                      enqueue_entity() {
 StorageWriterQu-1100154 [006] 1634719038.397265: funcgraph_entry:        0.234 us   |                        update_curr();
 StorageWriterQu-1100154 [006] 1634719038.397265: funcgraph_entry:        0.321 us   |                        __update_load_avg_cfs_rq();
 StorageWriterQu-1100154 [006] 1634719038.397266: funcgraph_entry:        0.198 us   |                        attach_entity_load_avg();
 StorageWriterQu-1100154 [006] 1634719038.397266: funcgraph_entry:        0.180 us   |                        update_cfs_group();
 StorageWriterQu-1100154 [006] 1634719038.397267: funcgraph_entry:        0.196 us   |                        __enqueue_entity();
 StorageWriterQu-1100154 [006] 1634719038.397267: funcgraph_exit:         2.875 us   |                      }
 StorageWriterQu-1100154 [006] 1634719038.397268: funcgraph_entry:                   |                      enqueue_entity() {
 StorageWriterQu-1100154 [006] 1634719038.397268: funcgraph_entry:        0.176 us   |                        update_curr();
 StorageWriterQu-1100154 [006] 1634719038.397268: funcgraph_entry:        0.240 us   |                        __update_load_avg_se();
 StorageWriterQu-1100154 [006] 1634719038.397269: funcgraph_entry:        0.276 us   |                        __update_load_avg_cfs_rq();
 StorageWriterQu-1100154 [006] 1634719038.397270: funcgraph_entry:                   |                        update_cfs_group() {
 StorageWriterQu-1100154 [006] 1634719038.397270: funcgraph_entry:        0.194 us   |                          reweight_entity();
 StorageWriterQu-1100154 [006] 1634719038.397270: funcgraph_exit:         0.698 us   |                        }
 StorageWriterQu-1100154 [006] 1634719038.397271: funcgraph_entry:        0.174 us   |                        __enqueue_entity();
 StorageWriterQu-1100154 [006] 1634719038.397271: funcgraph_exit:         3.732 us   |                      }
 StorageWriterQu-1100154 [006] 1634719038.397272: funcgraph_entry:        0.249 us   |                      hrtick_update();
 StorageWriterQu-1100154 [006] 1634719038.397272: funcgraph_exit:         7.731 us   |                    }
 StorageWriterQu-1100154 [006] 1634719038.397272: funcgraph_entry:                   |                    ttwu_do_wakeup() {
 StorageWriterQu-1100154 [006] 1634719038.397272: funcgraph_entry:                   |                      check_preempt_curr() {
 StorageWriterQu-1100154 [006] 1634719038.397273: funcgraph_entry:                   |                        resched_curr() {
 StorageWriterQu-1100154 [006] 1634719038.397273: funcgraph_entry:                   |                          native_smp_send_reschedule() {
 StorageWriterQu-1100154 [006] 1634719038.397273: funcgraph_entry:        3.417 us   |                            x2apic_send_IPI();
 StorageWriterQu-1100154 [006] 1634719038.397277: funcgraph_exit:         3.830 us   |                          }
 StorageWriterQu-1100154 [006] 1634719038.397277: funcgraph_exit:         4.254 us   |                        }
 StorageWriterQu-1100154 [006] 1634719038.397277: funcgraph_exit:         4.756 us   |                      }
 StorageWriterQu-1100154 [006] 1634719038.397277: funcgraph_entry:        0.941 us   |                      __traceiter_sched_wakeup();
 StorageWriterQu-1100154 [006] 1634719038.397279: funcgraph_exit:         6.337 us   |                    }
 StorageWriterQu-1100154 [006] 1634719038.397279: funcgraph_exit:       + 25.545 us  |                  }
 StorageWriterQu-1100154 [006] 1634719038.397279: funcgraph_entry:        0.207 us   |                  __lock_text_start();
 StorageWriterQu-1100154 [006] 1634719038.397279: funcgraph_exit:       + 36.927 us  |                }
 StorageWriterQu-1100154 [006] 1634719038.397280: funcgraph_exit:       + 37.294 us  |              }
 StorageWriterQu-1100154 [006] 1634719038.397280: funcgraph_exit:       + 38.393 us  |            }
 StorageWriterQu-1100154 [006] 1634719038.397280: funcgraph_exit:       + 39.452 us  |          }
 StorageWriterQu-1100154 [006] 1634719038.397280: funcgraph_entry:        0.197 us   |          __lock_text_start();
 StorageWriterQu-1100154 [006] 1634719038.397281: funcgraph_exit:       + 40.621 us  |        }
 StorageWriterQu-1100154 [006] 1634719038.397281: funcgraph_exit:       + 40.974 us  |      }
 StorageWriterQu-1100154 [006] 1634719038.397281: funcgraph_exit:       + 44.034 us  |    }
 StorageWriterQu-1100154 [006] 1634719038.397281: funcgraph_entry:                   |    irq_exit_rcu() {
 StorageWriterQu-1100154 [006] 1634719038.397282: funcgraph_entry:        0.194 us   |      idle_cpu();
 StorageWriterQu-1100154 [006] 1634719038.397282: funcgraph_exit:         0.670 us   |    }
 StorageWriterQu-1100154 [006] 1634719038.397283: funcgraph_exit:       $ 5160753 us |  }

It looks like the kernel is waiting for ZFS to complete the page write to stable storage.

Describe how to reproduce the problem

  1. Start EventStoreDB (20.10.4)
  2. Write data constantly to a stream in the database
  3. Check for SLOW QUEUE messages after 1 day in the EventStoreDB logs

I'm attempting a minimal repro with a C application which repeatedly flushes an 8-byte memory-mapped file and I've seen the issue happen once so it confirms that the issue is not with EventStoreDB (i'm trying to get the repro to work consistently).

Include any warning/errors/backtraces from the system logs

Nothing in the syslogs.
The machine is quite powerful as well and has plenty of free memory.

@shaan1337 shaan1337 added the Type: Defect Incorrect behavior (e.g. crash, hang) label Oct 20, 2021
@shaan1337
Copy link
Contributor Author

shaan1337 commented Oct 27, 2021

After some more investigation, I've made a few changes to the initial issue. it definitely looks related to zfs since:

  • The issue doesn't seem to happen on non-zfs file systems
  • The issue happened once with a custom C application (so it's not a problem with EventStoreDB - e.g garbage collection)
  • The issue also seems to happen on AWS/Azure (we initially thought it was an issue specific to Google Cloud VMs which have a custom kernel)

@rincebrain
Copy link
Contributor

Could try #12284 and see if it makes life any better.

Could also try twiddling WBT off - there have been issues with huge amounts of IO contention eventually deadlocking with that, and I don't know whether the fix for that made it into stable.

@shaan1337
Copy link
Contributor Author

@rincebrain thanks!

we're not using an SLOG on these machines, so it doesn't appear that #12284 would apply in our case.
we'll try to turn off write-back throttling to see if this fixes it.

@shaan1337
Copy link
Contributor Author

We've turned off WBT (with echo 0 > /sys/block/sdb/queue/wbt_lat_usec) and we can see that the issue still occurs.

@shaan1337
Copy link
Contributor Author

shaan1337 commented Nov 19, 2021

i've finally got a consistent repro: repro.working.zip

It requires changing the kernel's dirty page writeback frequency to make it much more reproducible:

sudo sysctl -w "vm.dirty_expire_centisecs=1"
sudo sysctl -w "vm.dirty_background_bytes=1"
sudo sysctl -w "vm.dirty_writeback_centisecs=1"

It appears that the delay occurs when both an application and the kernel are writing back the same dirty page. the relevant part of the code is:

if (PageWriteback(pp)) {

~/repro
[2021-11-18 10:31:31] slow msync: 2303.270000 ms
[2021-11-18 10:31:47] slow msync: 4050.328000 ms
[2021-11-18 10:31:57] slow msync: 2838.579000 ms
[2021-11-18 10:32:02] slow msync: 4380.869000 ms
[2021-11-18 10:32:22] slow msync: 4681.770000 ms
[2021-11-18 10:32:28] slow msync: 2799.192000 ms
[2021-11-18 10:32:33] slow msync: 3692.178000 ms
[2021-11-18 10:32:38] slow msync: 5001.400000 ms
[2021-11-18 10:32:43] slow msync: 3201.233000 ms
...

The page stays stuck in the write back state until the transaction group containing that write closes. As a workaround, the zfs transaction group timeout can be reduced but it will cause more frequent writes to stable storage (set to 1 second below):

echo 1 > /sys/module/zfs/parameters/zfs_txg_timeout

shaan1337 added a commit to shaan1337/zfs that referenced this issue Nov 24, 2021
…teback, do a commit to speed up its completion

If we're doing a writeback with WB_SYNC_NONE, add a small delay to make sure we catch any WB_SYNC_ALL waiters due to a race condition in filemap_write_and_wait_range()

Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes openzfs#12662
shaan1337 added a commit to shaan1337/zfs that referenced this issue Nov 25, 2021
…teback, do a commit to speed up its completion

If we're doing a writeback with WB_SYNC_NONE, add a small delay to make sure we catch any WB_SYNC_ALL waiters due to a race condition in filemap_write_and_wait_range()
Introduce a parameter named zfs_page_writeback_no_sync_delay_us to control the small delay. The default value is 0 since even without the artificial delay, the issue occurs much more rarely now.

Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes openzfs#12662
shaan1337 added a commit to shaan1337/zfs that referenced this issue Nov 25, 2021
…teback, do a commit to speed up its completion

If we're doing a writeback with WB_SYNC_NONE, add a small delay to make sure we catch any WB_SYNC_ALL waiters due to a race condition in filemap_write_and_wait_range()
Introduce a parameter named zfs_page_writeback_no_sync_delay_us to control the small delay. The default value is 0 since even without the artificial delay, the issue occurs much more rarely now.

Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes openzfs#12662
shaan1337 added a commit to shaan1337/zfs that referenced this issue Nov 25, 2021
…teback, do a commit to speed up its completion

If we're doing a writeback with WB_SYNC_NONE, add a small delay to make sure we catch any WB_SYNC_ALL waiters due to a race condition in filemap_write_and_wait_range()
Introduce a parameter named zfs_page_writeback_no_sync_delay_us to control the small delay.

Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes openzfs#12662
shaan1337 added a commit to shaan1337/zfs that referenced this issue Nov 29, 2021
…teback, do a commit to speed up its completion

Applies to kernel versions >= 4.10.0:
If we're doing a writeback with WB_SYNC_NONE, add a small delay to make sure we catch any WB_SYNC_ALL waiters due to a race condition in filemap_write_and_wait_range()
Introduce a parameter named zfs_page_writeback_no_sync_delay_us to control the small delay.

Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes openzfs#12662
behlendorf pushed a commit that referenced this issue May 3, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete 
since they wait for the transaction group to close before being 
committed. This is usually not a problem since the caller does not 
need to wait. However, if we're simultaneously doing a writeback 
with WB_SYNC_ALL (e.g via msync), the latter can block for several 
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE 
writeback since it needs to wait for the transaction to complete 
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts 
  and even completes. But when it's about to check if the PG_writeback 
  bit has been cleared, another writeback with WB_SYNC_NONE starts. 
  The sync page writeback ends up waiting for the non-sync page 
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a 
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up 
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync 
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes #12662
Closes #12790
nicman23 pushed a commit to nicman23/zfs that referenced this issue Aug 22, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete 
since they wait for the transaction group to close before being 
committed. This is usually not a problem since the caller does not 
need to wait. However, if we're simultaneously doing a writeback 
with WB_SYNC_ALL (e.g via msync), the latter can block for several 
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE 
writeback since it needs to wait for the transaction to complete 
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts 
  and even completes. But when it's about to check if the PG_writeback 
  bit has been cleared, another writeback with WB_SYNC_NONE starts. 
  The sync page writeback ends up waiting for the non-sync page 
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a 
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up 
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync 
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes openzfs#12662
Closes openzfs#12790
nicman23 pushed a commit to nicman23/zfs that referenced this issue Aug 22, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete 
since they wait for the transaction group to close before being 
committed. This is usually not a problem since the caller does not 
need to wait. However, if we're simultaneously doing a writeback 
with WB_SYNC_ALL (e.g via msync), the latter can block for several 
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE 
writeback since it needs to wait for the transaction to complete 
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts 
  and even completes. But when it's about to check if the PG_writeback 
  bit has been cleared, another writeback with WB_SYNC_NONE starts. 
  The sync page writeback ends up waiting for the non-sync page 
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a 
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up 
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync 
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes openzfs#12662
Closes openzfs#12790
lundman pushed a commit to openzfsonwindows/openzfs that referenced this issue Sep 12, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes openzfs#12662
Closes openzfs#12790
beren12 pushed a commit to beren12/zfs that referenced this issue Sep 19, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete 
since they wait for the transaction group to close before being 
committed. This is usually not a problem since the caller does not 
need to wait. However, if we're simultaneously doing a writeback 
with WB_SYNC_ALL (e.g via msync), the latter can block for several 
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE 
writeback since it needs to wait for the transaction to complete 
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts 
  and even completes. But when it's about to check if the PG_writeback 
  bit has been cleared, another writeback with WB_SYNC_NONE starts. 
  The sync page writeback ends up waiting for the non-sync page 
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a 
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up 
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync 
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes openzfs#12662
Closes openzfs#12790
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Sep 23, 2022
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes openzfs#12662
Closes openzfs#12790
behlendorf pushed a commit to behlendorf/zfs that referenced this issue Jun 2, 2023
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes openzfs#12662
Closes openzfs#12790
tonyhutter pushed a commit that referenced this issue Jun 5, 2023
Page writebacks with WB_SYNC_NONE can take several seconds to complete
since they wait for the transaction group to close before being
committed. This is usually not a problem since the caller does not
need to wait. However, if we're simultaneously doing a writeback
with WB_SYNC_ALL (e.g via msync), the latter can block for several
seconds (up to zfs_txg_timeout) due to the active WB_SYNC_NONE
writeback since it needs to wait for the transaction to complete
and the PG_writeback bit to be cleared.

This commit deals with 2 cases:

- No page writeback is active. A WB_SYNC_ALL page writeback starts
  and even completes. But when it's about to check if the PG_writeback
  bit has been cleared, another writeback with WB_SYNC_NONE starts.
  The sync page writeback ends up waiting for the non-sync page
  writeback to complete.

- A page writeback with WB_SYNC_NONE is already active when a
  WB_SYNC_ALL writeback starts. The WB_SYNC_ALL writeback ends up
  waiting for the WB_SYNC_NONE writeback.

The fix works by carefully keeping track of active sync/non-sync
writebacks and committing when beneficial.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaan Nobee <sniper111@gmail.com>
Closes #12662
Closes #12790
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
2 participants