Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve batching done in zil_commit() #6337

Closed
wants to merge 1 commit into from
Closed

improve batching done in zil_commit() #6337

wants to merge 1 commit into from

Conversation

prakashsurya
Copy link
Member

@prakashsurya prakashsurya commented Jul 11, 2017

This is a first attempt at porting the OpenZFS pull request here:

This hasn't seen much testing apart from verifying it compiles in an
Ubuntu VM, and it doesn't panic when using a very light sync write
workload using fio. I'll have to rely on the automated testing and
volunteers to verify the correctness and performance; if needed, I can
try to allocate some time to do this myself, but I doubt I'll have the
time to do that in the near future.

One big difference between this version, and the version posted to
OpenZFS, is how the lifetime of an itx_t structure differs between the
codebases. In this Linux port, the itx_t structure maintains an
itx_callback field, and this function needs to be called after the itx
has been committed to disk. Thus, the itx_t must wait to be destroyed
until it's been committed to disk. This is in contrast to how the
itx_t is immediately destroyed once it's committed to an lwb in the
OpenZFS codebase, but prior to the lwb being committed to disk.

To adapt the new algorithm to this itx_callback requirement, an
lwb_itxs list was added to the lwb_t structure. Each time an itx is
committed to an lwb, it is added to the lwb's lwb_itxs list. This list
is used to keep track of the itxs such that their callback can be called
after the itx is safe on disk, and only then will the itx be destroyed.

Additionally, this attempts to address the same deficiencies that
Richard Yao is addressing in the following pull requests:

I'd be great to perform some head-to-head comparisons of this with
Richard's work, and see which parts of each design work best.

Signed-off-by: Prakash Surya prakash.surya@delphix.com

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

@behlendorf
Copy link
Contributor

behlendorf commented Jul 11, 2017

It would be great if you could update the commit message comment to reference the SPL PR. With that hint the buildbot will be able to build and test this PR.

Requires-spl: refs/pull/627/head

[edit] Oh you did! Nevermind!

@prakashsurya
Copy link
Member Author

yup, I added that directive. I'm not quite sure how to interpret the test failures yet (although I haven't spent much time trying); is there any documentation to help me out with that?

@behlendorf
Copy link
Contributor

The console and ztest logs are my favorite place to start if the test suite doesn't run to completion. We're currently not collecting cores but stack trace is often more than enough. We've add a little bit of documention describing how to run the same tests locally but it'd be nice to add more.

[ 2562.047197] VERIFY(!list_is_empty(list)) failed
[ 2562.050654] PANIC at list.h:126:list_remove()
[ 2562.054008] Showing stack for process 24511
[ 2562.054013] CPU: 1 PID: 24511 Comm: z_null_int Tainted: P           OE   4.9.2-kmemleak #1
[ 2562.054014] Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
[ 2562.054017]  ffff93b70b063b50 ffffffffad619156 ffffffffc0e749ff 000000000000007e
[ 2562.054023]  ffff93b70b063b60 ffffffffc0a6a802 ffff93b70b063ce8 ffffffffc0a6aad9
[ 2562.054029]  ffff8828b3158000 ffff882900000028 ffff93b70b063cf8 ffff93b70b063c98
[ 2562.054035] Call Trace:
[ 2562.054044]  [<ffffffffad619156>] dump_stack+0xb0/0xea
[ 2562.054065]  [<ffffffffc0a6a802>] spl_dumpstack+0x62/0x70 [spl]
[ 2562.054080]  [<ffffffffc0a6aad9>] spl_panic+0xd9/0x120 [spl]
[ 2562.054087]  [<ffffffffad21d7e7>] ? stack_trace_call+0x67/0xb0
[ 2562.054093]  [<ffffffffadc9e387>] ? ftrace_call+0x5/0x34
[ 2562.054098]  [<ffffffffad34c27a>] ? __delete_object+0x5a/0xb0
[ 2562.054116]  [<ffffffffc0a6aa05>] ? spl_panic+0x5/0x120 [spl]
[ 2562.054274]  [<ffffffffc0d4eced>] zil_lwb_flush_vdevs_done+0x22d/0x640 [zfs]
[ 2562.054402]  [<ffffffffc0d65dd6>] zio_done+0xa16/0x2310 [zfs]
[ 2562.054533]  [<ffffffffc0d653c5>] ? zio_done+0x5/0x2310 [zfs]
[ 2562.054663]  [<ffffffffc0d59749>] zio_execute+0x149/0x3f0 [zfs]
[ 2562.054680]  [<ffffffffc0a667ed>] taskq_thread+0x36d/0x760 [spl]
[ 2562.054687]  [<ffffffffad10bc80>] ? try_to_wake_up+0x680/0x680
[ 2562.054701]  [<ffffffffc0a66480>] ? taskq_thread_spawn+0x80/0x80 [spl]
[ 2562.054707]  [<ffffffffad0fabd9>] kthread+0x119/0x150
[ 2562.054713]  [<ffffffffad0faac0>] ? kthread_flush_work+0x1f0/0x1f0
[ 2562.054717]  [<ffffffffadc9ccf5>] ret_from_fork+0x25/0x30

@prakashsurya
Copy link
Member Author

perfect, thanks!

@prakashsurya
Copy link
Member Author

Any clues about these build failures:

In file included from /usr/src/spl-0.7.0/include/sys/sysmacros.h:31:0,
                 from /tmp/zfs-build-buildbot-DH81uCAS/BUILD/zfs-kmod-0.7.0/_kmod_build_3.13.0-117-generic/../zfs-0.7.0/module/zfs/zio.c:27:
/tmp/zfs-build-buildbot-DH81uCAS/BUILD/zfs-kmod-0.7.0/_kmod_build_3.13.0-117-generic/../zfs-0.7.0/module/zfs/zio.c: In function 'zio_shrink':
/usr/src/spl-0.7.0/include/sys/debug.h:76:64: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
 #define VERIFY3P(x,y,z) VERIFY3_IMPL(x, y, z, uintptr_t, "%p", (void *))
                                                                ^
/usr/src/spl-0.7.0/include/sys/debug.h:70:6: note: in definition of macro 'VERIFY3_IMPL'
      CAST (LEFT), CAST (RIGHT)))
      ^
/usr/src/spl-0.7.0/include/sys/debug.h:113:26: note: in expansion of macro 'VERIFY3P'
 #define ASSERT3P(x,y,z)  VERIFY3P(x, y, z)
                          ^
/tmp/zfs-build-buildbot-DH81uCAS/BUILD/zfs-kmod-0.7.0/_kmod_build_3.13.0-117-generic/../zfs-0.7.0/module/zfs/zio.c:1163:2: note: in expansion of macro 'ASSERT3P'
  ASSERT3P(zio->io_orig_size, ==, zio->io_size);
  ^
/usr/src/spl-0.7.0/include/sys/debug.h:76:64: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
 #define VERIFY3P(x,y,z) VERIFY3_IMPL(x, y, z, uintptr_t, "%p", (void *))
                                                                ^
/usr/src/spl-0.7.0/include/sys/debug.h:70:19: note: in definition of macro 'VERIFY3_IMPL'
      CAST (LEFT), CAST (RIGHT)))
                   ^
/usr/src/spl-0.7.0/include/sys/debug.h:113:26: note: in expansion of macro 'VERIFY3P'
 #define ASSERT3P(x,y,z)  VERIFY3P(x, y, z)
                          ^
/tmp/zfs-build-buildbot-DH81uCAS/BUILD/zfs-kmod-0.7.0/_kmod_build_3.13.0-117-generic/../zfs-0.7.0/module/zfs/zio.c:1163:2: note: in expansion of macro 'ASSERT3P'
  ASSERT3P(zio->io_orig_size, ==, zio->io_size);
  ^
cc1: all warnings being treated as errors

?

I don't see these when I build locally, and I didn't modify the ASSERT3P macro, so I'm confused as to how my change might be causing these. are these build failures normal and/or safe to ignore?

module/zfs/zio.c Outdated
ASSERT(zio->io_orig_size == zio->io_size);
ASSERT(size <= zio->io_size);
ASSERT3P(zio->io_executor, ==, NULL);
ASSERT3P(zio->io_orig_size, ==, zio->io_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be ASSERT3U. The build failure you're seeing is coming from 32-bit system where the pointer size is 32-bit and not 64-bit. The ASSERT3P macro is forcing a cast of a uint64_t to a uintptr_t which is only 32-bit on failing architectures. It looks like ASSERT3P is used incorrectly in a zio_wait() too and maybe elsewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! I should have caught that myself, sorry for the trouble.

This is a first attempt at porting the OpenZFS pull request here:

 - openzfs/openzfs#421

This hasn't seen much testing apart from verifying it compiles in an
Ubuntu VM, and it doesn't panic when using a very light sync write
workload using fio. I'll have to rely on the automated testing and
volunteers to verify the correctness and performance; if needed, I can
try to allocate some time to do this myself, but I doubt I'll have the
time to do that in the near future.

One big difference between this version, and the version posted to
OpenZFS, is how the lifetime of an `itx_t` structure differs between the
codebases. In this Linux port, the `itx_t` structure maintains an
`itx_callback` field, and this function needs to be called after the itx
has been committed to disk. Thus, the `itx_t` must wait to be destroyed
until it's been committed to disk. This is in contrast to how the
`itx_t` is immediately destroyed once it's committed to an lwb in the
OpenZFS codebase, but prior to the lwb being committed to disk.

To adapt the new algorithm to this `itx_callback` requirement, an
`lwb_itxs` list was added to the `lwb_t` structure. Each time an itx is
committed to an lwb, it is added to the lwb's `lwb_itxs` list. This list
is used to keep track of the itxs such that their callback can be called
after the itx is safe on disk, and only then will the itx be destroyed.

Additionally, this attempts to address the same deficiencies that
Richard Yao is addressing in the following pull requests:

 - #6133
 - openzfs/openzfs#404

I'd be great to perform some head-to-head comparisons of this with
Richard's work, and see which parts of each design work best.

Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
@prakashsurya prakashsurya mentioned this pull request Aug 19, 2017
12 tasks
@prakashsurya
Copy link
Member Author

prakashsurya commented Aug 22, 2017

@behlendorf this is causing a hang when running the mmap_read_001_pos test case, due to readmmap hanging when it calls msync; I can easily reproduce it in my VM:

delphix@ubuntu:~$ pstree -p 2337
test-runner.py(2337)-+-sudo(3183)---mmap_read_001_p(3185)---readmmap(3198)
                     `-{test-runner.py}(3184)
delphix@ubuntu:~$ sudo cat /proc/3198/stack
[<ffffffff9bdafb68>] wait_on_page_bit_common+0x118/0x1d0
[<ffffffff9bdafd34>] __filemap_fdatawait_range+0x114/0x190
[<ffffffff9bdafdc4>] filemap_fdatawait_range+0x14/0x30
[<ffffffff9bdb2477>] filemap_write_and_wait_range+0x57/0x90
[<ffffffffc08f049d>] zpl_fsync+0x3d/0x110 [zfs]
[<ffffffff9be7b93b>] vfs_fsync_range+0x4b/0xb0
[<ffffffff9bdf6af2>] SyS_msync+0x182/0x200
[<ffffffff9c4d453b>] entry_SYSCALL_64_fastpath+0x1e/0xad
[<ffffffffffffffff>] 0xffffffffffffffff

do you have any clues as to what might be going on? I'm not familiar with the mmap code, is there linux specific changes that I may have broken in this patch? I'll have to do some reading to better understand how filemap_write_and_wait_range is supposed to work...

FWIW, I can run the readmmap binary outside of the test framework (with the test process still stuck) and it appears to work:

delphix@ubuntu:~$ sudo ./tests/zfs-tests/cmd/readmmap/readmmap /var/tmp/testdir/test-read-file
good data from read: buf[313]=1

@behlendorf
Copy link
Contributor

do you have any clues as to what might be going on?

I'm not sure of the detail but I've got a good guess where to start looking.

The stack you posted is what I'd expect to happen if for some reason the zil_callback_t passed to zfs_log_write() didn't run on each page. filemap_write_and_wait_range() is waiting to be notified that each page is safe on disk. The callback (zfs_putpage_commit_cb) is responsible for that notification by ending the writeback on each dirty page and marking it clean.

I haven't had a chance to carefully go over the patch but is it possible there's a case which was missed where the callback doesn't run?

For background there's a nice description in commit 119a394 regarding why this was needed for Linux.

@prakashsurya
Copy link
Member Author

Thanks, Brian. That's basically what I was thinking too, so it's good to get a second opinion. I'll revisit the code and see if botched something. OpenZFS doesn't have the "itx callback", so I wouldn't be surprised if I got that subtly wrong when pulling this code over to linux.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants