Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Prototype for systemd and fstab integration #4943

Closed
wants to merge 1 commit into from

Conversation

tuxoko
Copy link
Contributor

@tuxoko tuxoko commented Aug 8, 2016

Recently, I have been cooking some stuff for better integration with systemd and fstab integration. This is what I've got so far.

What does this achieve:

  1. Event driven zpool import from udev, no more non sense like wait for udev-settle.
  2. Configurable selective import with fstab or explicit systemd service file.
  3. Using fstab for mounting to let systemd build a proper mounting dependency.

So how does this work?
I create a zready command, which udev will fire upon discovering zfs devices. The zready command will read the label and build up the vdev tree in /dev/zpool/<pool>/. And when it figures the pool is ready to import. It will create a file called /dev/zpool/<pool>/ready. (Note, the current zready is just a hack I came up with. Since I'm not familiar with the label stuff, it might not be able to handle every cases.)

There's two systemd unit files, zpool@.path and zpool@.service. zpool@<pool>.path will listen on the /dev/zpool/<pool>/ready. And zpool@<pool>.service will do import with dependency on zpool@<pool>.path. Note, you don't need to do any configuration with these two unit files if you use fstab as described below, but you can still explicit enable a zpool@.service if you want.

To let systemd mount stuff on boot, you only need to add an entry like this in fstab:
rpool/home /home zfs rw,defaults,x-systemd.requires=zpool@rpool.service
This will automatically make this mount depends on the service unit.

For using zfs as root, each distribution probably has its only way to do it. But if you use systemd in initrd, you can add root=rpool/root rootfstype=zfs rootflags=x-systemd.requires=zpool@rpool.service to kernel cmdline. systemd will be able to automatically build the dependency just as using fstab.

Signed-off-by: Chunwei Chen david.chen@osnexus.com

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool and I think it's a promising approach to get better integrated with systemd and fstab. Just a few questions.

  • What cleans up the /dev/zpool directory if the system crashes?
  • How are the ready files removed when devices are removed from the system?

#define ZPOOL_PATH "/dev/zpool"

static nvlist_t *
get_config(const char *dev)
Copy link
Contributor

@behlendorf behlendorf Sep 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing zpool_read_label() function already does most of this and is part of libzfs. Rather than have two versions of this function I'd suggest extending zpool_read_label() as needed. The interfaces are slightly different but not really in a significant way.

sprintf(dir, "%llu", (u_longlong_t)vguid);
if (mkdir(dir, 0755) < 0 && errno != EEXIST) {
fprintf(stderr, "failed to mkdir %s: %s\n", dir, strerror(errno));
exit(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's a prototype but best to return (1); and return error from main().


[Path]
PathExists=/dev/zpool/%i/ready
Unit=zpool@%i.service
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These templates are pretty clever.

RemainAfterExit=yes
ExecStartPre=/sbin/modprobe zfs
ExecStart=@sbindir@/zpool import -EN %i
ExecStop=@sbindir@/zpool export %i
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't want the export.

ExecStop=@sbindir@/zpool export %i

[Install]
WantedBy=local-fs.target
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you added WantedBy=zfs-mount.service and renamed this to zfs-import-udev@.service.in then this could be a third pool import option. Users would then just need to disable the other zfs-* services if they didn't want them to use them. This would allow this same import service to be used for both use cases which would be nice.

@@ -2221,6 +2222,9 @@ zpool_do_import(int argc, char **argv)
case 'D':
do_destroyed = B_TRUE;
break;
case 'E':
do_imported = B_TRUE;
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll want to add this option to the man page and comment above. I'm surprised to see E was already part of the getopt string above.

err = 0;
log_history = B_FALSE;
goto exit_early;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the other error cases below?

@tuxoko
Copy link
Contributor Author

tuxoko commented Sep 21, 2016

@behlendorf
I think most system have devtmpfs for /dev, so it should clean up the between boot. Also this is a requirement because we need the /dev to stay the same between initrd and rootfs.

The whole detection and ready stuff is quite hacky. I'm hoping libzfs can already do this sort of stuff, but since I'm not familiar with it, I just came up with this prototype to mainly show that we can integrate well with systemd.

@rlaager rlaager changed the title [RFC] Prototype for systemd and fstab integration [WIP] Prototype for systemd and fstab integration Sep 27, 2016
@behlendorf behlendorf added the Status: Work in Progress Not yet ready for general review label Sep 30, 2016
@mailinglists35
Copy link

is this is ready for testing? i'd like to try it out.

@tuxoko
Copy link
Contributor Author

tuxoko commented Jan 2, 2017

@mailinglists35
You can try it out, it should work for simple pool, but the whole detection code is a quick hack and probably need to be rewritten, and may have some corner cases that won't work. I did this mainly for a demonstration purpose, but if people find this interesting, I'll find time to work on this again.

@rlaager
Copy link
Member

rlaager commented Jan 6, 2017

What is the criteria for a pool being "ready"? For example, if I have a 6-disk raidz2 pool, is it ready when 4 devices arrive, when 6 devices arrive, or something else? The behavior I think is desired is: Once 4 devices are ready, start a timer. If the timer expires or all 6 devices are ready, import the pool.

In terms of this being production-ready, it would definitely need to handle corner cases like duplicate pool names (e.g. old devices), etc. I haven't reviewed the code, so I'm not saying it doesn't do that now.

Am I correct in understanding that this always imports every available pool? How would I handle a system where I only want to import certain pools?

I'd personally like to mount all the filesystems upon pool import. I'd rather not have to add everything to /etc/fstab. Is there a clean way to do that?

@Mic92
Copy link
Contributor

Mic92 commented Jan 18, 2017

I think relying on /dev being a devtempfs is a sane choice. Systemd and udev already make this assumption.

@intelfx
Copy link

intelfx commented Jan 21, 2017

The zready command will read the label and build up the vdev tree in /dev/zpool//. And when it figures the pool is ready to import. It will create a file called /dev/zpool//ready. (Note, the current zready is just a hack I came up with. Since I'm not familiar with the label stuff, it might not be able to handle every cases.)

Nice. So that's how you solved the hierarchical zpool naming problem.

@intelfx
Copy link

intelfx commented Jan 21, 2017

@rlaager

I'd personally like to mount all the filesystems upon pool import. I'd rather not have to add everything to /etc/fstab. Is there a clean way to do that?

If zfs has code to mount filesystems upon import automatically (I recall there was something like that, though I don't use zfs anymore), then you could just pull in zpool@<zpool>.service from your local-fs.target.

@intelfx
Copy link

intelfx commented Jan 21, 2017

@rlaager

Am I correct in understanding that this always imports every available pool? How would I handle a system where I only want to import certain pools?


@tuxoko, this is something to consider. Not everyone wants their pools to be automatically imported (the opposite is also true though — in simple usecases it could be useful).

@tuxoko
Copy link
Contributor Author

tuxoko commented Feb 2, 2017

@rlaager

What is the criteria for a pool being "ready"? For example, if I have a 6-disk raidz2 pool, is it ready when 4 devices arrive, when 6 devices arrive, or something else? The behavior I think is desired is: Once 4 devices are ready, start a timer. If the timer expires or all 6 devices are ready, import the pool.

Currently it waits for all devices to come up. I do want to have what configurable timeout and continue thing but I haven't figured out the best way to do this.

In terms of this being production-ready, it would definitely need to handle corner cases like duplicate pool names

Yes, currently when zready finds same pool name with different pool guid, it will error out. I'm not sure if this is good enough. Feel free to make suggestions.

Am I correct in understanding that this always imports every available pool? How would I handle a system where I only want to import certain pools?

No, As I mentioned in description:
2. Configurable selective import with fstab or explicit systemd service file.
This doesn't import automatically, only if you enable zpool@.service or specify it in fstab.

I'd personally like to mount all the filesystems upon pool import. I'd rather not have to add everything to /etc/fstab. Is there a clean way to do that?

We can add helper script to make it easy to add or remove stuff in fstab, or perhaps use generator to generate systemd mount service. The point is that we want to let systemd handle the mounting, because systemd handles mounting much better than zfs does.

@pdf
Copy link

pdf commented Feb 8, 2017

I can definitely understand wanting better integration with systemd, but going via fstab to get there doesn't seem like an optimal approach. Is there some other reasoning for wanting fstab integration that I'm not grokking, or is this primarily for targeting systemd?

What happens when fstab disagrees with the ZFS mountpoint?

@intelfx
Copy link

intelfx commented Feb 8, 2017

@pdf

going via fstab to get there doesn't seem like an optimal approach

Why? Either you are going to generate systemd units for mountpoints directly or you use fstab. There is no other way to have support for hotplug and dependency tracking, unless you reimplement it from scratch.

@intelfx
Copy link

intelfx commented Feb 8, 2017

@pdf

If I understand correctly, the "ZFS mountpoints" are applied by zfs itself at import time. If that's what you are talking about, these mountpoints are not known to systemd before they appear, so for example they cannot be waited for or triggered via a unit's dependencies. That's the first side of the problem.

The second side of the problem is depending on a particular pool and its devices. systemd can wait for a device to appear before calling mount(8), or it can shutdown the mountpoint and all services depending on it if a device disappears, but for that it needs to know the mapping between a mountpoint and a real block device. If that mapping is provided, it waits for a device node to appear and then waits for a udev rule to say "ready" on that device. This is used in btrfs — any device node belonging to a filesystem is marked "not ready" unless all device nodes are present.

In zfs, this is a tougher problem because we mount pools, not devices.

@pdf
Copy link

pdf commented Feb 8, 2017

@intelfx wrote:

Either you are going to generate systemd units for mountpoints directly or you use fstab.

The former requires no user action, supports the standard ZFS workflow dynamically, and avoids requiring double-entry, and potential confusion if the mountpoint ZFS property gets out of sync with fstab.

these mountpoints are not known to systemd before they appear, so for example they cannot be waited for or triggered via a unit's dependencies.

Seems like this should be doable if we can ensure that pool imports occur early enough and blocks until .mount services corresponding to the enabled mountpoints are generated, after that, standard systemd dependency management should work as expected. I'm not certain how feasible this is though.

The second side of the problem is depending on a particular pool and its devices. systemd can wait for a device to appear before calling mount(8), or it can shutdown the mountpoint and all services depending on it if a device disappears, but for that it needs to know the mapping between a mountpoint and a real block device.

It doesn't have to be a real block device though, right? This PR appears to be using a newly implemented device node for this purpose. So, is your only issue here determining which pools should be imported? Could a generator for pools be used here? So, if a user wants to disable importing a pool, they just disable the service? If the service default is disabled, there will be no surprise imports on boot either, though this does require the user to enable any pools the do want imported on boot.

@tuxoko
Copy link
Contributor Author

tuxoko commented Feb 8, 2017

The problem for using generator is that you can't specify when generator should fire, and all generator will fire early on, and you need to have your pool imported at this time, and to guarantee that would mean forcing import at initrd. That wouldn't be an issue if you have a root zpool and only have one pool, but for others you'll be delaying their boot time for no reason.

Another problem is that the generator will need to have proper dependency with regards to other non-zfs mountpoint in fstab. But why bother doing this if systemd already does this for you when you use fstab.

The former requires no user action, supports the standard ZFS workflow dynamically, and avoids requiring double-entry, and potential confusion if the mountpoint ZFS property gets out of sync with fstab.

This is not really an issue, we can just have a script to generate fstab. The current way also require user to enable zfs-mount.service, so the burden on user is more or less the same. We can even hook up the script so that when you change mountpoint a new fstab will be automatically generated, but I'm not sure if that's a good idea.

@intelfx
Copy link

intelfx commented Feb 8, 2017

@pdf

The former requires no user action, supports the standard ZFS workflow dynamically, and avoids requiring double-entry, and potential confusion if the mountpoint ZFS property gets out of sync with fstab.

Yes, except that it won't actually work. See below.

Seems like this should be doable if we can ensure that pool imports occur early enough and blocks until .mount services corresponding to the enabled mountpoints are generated, after that, standard systemd dependency management should work as expected. I'm not certain how feasible this is though.

early enough

This is a big "no". This is inherently racy, and this is something I wanted to get rid of in the first place when I filed issue #4178. We can't rely on something being imported "early enough", because buses can be slow, controllers can be hotplugged, USB disks added and removed in runtime and so on.

It doesn't have to be a real block device though, right? This PR appears to be using a newly implemented device node for this purpose. So, is your only issue here determining which pools should be imported? Could a generator for pools be used here? So, if a user wants to disable importing a pool, they just disable the service? If the service default is disabled, there will be no surprise imports on boot either, though this does require the user to enable any pools the do want imported on boot.

It does have to be a real block device — something to match in udev. This is not the only way though: this PR requires a user to add a specific parameter in fstab that makes systemd add an artificial dependency on a service that waits for a pool and imports it (instead of adding a "natural" dependency on a block device).

@pdf
Copy link

pdf commented Feb 9, 2017

@tuxoko

The problem for using generator is that you can't specify when generator should fire, and all generator will fire early on, and you need to have your pool imported at this time, and to guarantee that would mean forcing import at initrd. That wouldn't be an issue if you have a root zpool and only have one pool, but for others you'll be delaying their boot time for no reason.

I'm not sure this is actually a terrible thing. For users who want to optimise boot times by having some pools not imported during early boot (I suspect this is the minority), a kernel parameter could be used, as with other generators. Filesystems are going to need to mount before other services are started in any case, so the total time to boot is likely not changed if all pools have filesystems that will be mounted, but I can see other possible issues with trying to do this in initrd.

Would it make sense, as an alternative to trying to discover all available pools early (I can see how this may be problematic), to use a cache file similar to Illumos?

I will say that I'm not certain this generator approach is actually optimal, but I figure it's worth having the discussion now, rather than after code gets merged.

Another problem is that the generator will need to have proper dependency with regards to other non-zfs mountpoint in fstab. But why bother doing this if systemd already does this for you when you use fstab.

I think you'll find that systemd already does this for you if you've generated the mount units - they don't explicitly specify dependent filesystems, just a dependent device, and where the mountpoint is.

This is not really an issue, we can just have a script to generate fstab. The current way also require user to enable zfs-mount.service, so the burden on user is more or less the same. We can even hook up the script so that when you change mountpoint a new fstab will be automatically generated, but I'm not sure if that's a good idea.

I'm not entirely opposed to this if it is the best we can do, but I think there are some issues to consider here. What happens if the user edits fstab? Do their changes get clobbered, or does the script stop working on conflict? In the latter case, what is the behaviour when fstab disagrees with the ZFS mountpoint? Which source wins? What about other ZFS props that interact with mounting (eg canmount vs auto/noauto)?

We lose the single source of truth (currently the ZFS props), unless the generator script runs automatically, and always wins, but I guarantee this will cause confusion.

@intelfx

This is a big "no". This is inherently racy, and this is something I wanted to get rid of in the first place when I filed issue #4178. We can't rely on something being imported "early enough", because buses can be slow, controllers can be hotplugged, USB disks added and removed in runtime and so on.

I think there are two scenarios to consider here - mountpoints required at boot (that's essentially everything in fstab, unless noauto, and see above for canmount conflict), and hot-plugged pools. My comment about 'early enough' is related to the former - imports need to happen in a timely fashion for mountpoint introspection, to allow datasets to be mounted for local-fs.target, and for proper mount unit ordering. That's not to say udev triggers would not be involved, and I suspect we could generate new units on hotplug.

It does have to be a real block device — something to match in udev.

You specifically mention in #4178 that it does not have to be a real block devices, which is what I was getting at.

@tuxoko
Copy link
Contributor Author

tuxoko commented Feb 9, 2017

I'm not sure this is actually a terrible thing.

It is terrible. Because you don't know how many pools are there to import during initrd. You can use additional config file, or worse use cache file like you said. But that's already proven to be painful.

@seschwar
Copy link

seschwar commented May 5, 2017

@tuxoko wrote:

The problem for using generator is that you can't specify when generator should fire, (...)

systemd.generator(7) states:

systemd(1) will execute those binaries very early at bootup and at configuration reload time [emphasis added] — (...)

So if systemctl daemon-reload gets run after a successful zpool import messing with fstab could be avoided.

@behlendorf
Copy link
Contributor

Closed in favor of #7329 which has been merged.

@behlendorf behlendorf closed this Apr 6, 2018
@aerusso
Copy link
Contributor

aerusso commented Apr 7, 2018

The zready infrastructure, which deals with zpool import, here is complementary to #7329, which deals with dataset mounting. In fact, #7329 was partially born out of a desire to satisfy the last ingredient required for this PR:

To let systemd mount stuff on boot, you only need to add an entry like this in fstab:
rpool/home /home zfs rw,defaults,x-systemd.requires=zpool@rpool.service
This will automatically make this mount depends on the service unit.

@behlendorf
Copy link
Contributor

My mistake, reopening. Although it probably makes sense to open a new PR with a refreshed version of zready.

@Baughn
Copy link

Baughn commented Jun 19, 2018

So considering what I'm working on for NixOS, I should probably ask: Is there a rough timeline for when this may be merged and released?

@behlendorf behlendorf added Status: Inactive Not being actively updated and removed Status: Work in Progress Not yet ready for general review labels Sep 25, 2018
@behlendorf
Copy link
Contributor

@tuxoko @aerusso I'd love to see this make the 0.8 release. Any chance we can get this over the finish line.

@darkbasic
Copy link

Agree, I'd love if it could make it in time for 0.8 :)

@ahrens ahrens added the Status: Work in Progress Not yet ready for general review label Sep 27, 2018
@@ -2443,12 +2452,13 @@ zpool_do_import(int argc, char **argv)
}

if (err == 1) {
exit_early:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like memleak here?

@behlendorf
Copy link
Contributor

Closing due to inactivity, if someone has the inclination to pick up this work please open a new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Inactive Not being actively updated Status: Work in Progress Not yet ready for general review Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.