Proposal: containerized mount utilities in pods #589

jsafrane · 2017-04-28T11:58:14Z

@kubernetes/sig-storage-proposals @kubernetes/sig-node-proposals

jsafrane · 2017-04-28T12:17:31Z

contributors/design-proposals/containerized-mounter-pod.md

+### Controller
+* There will be new parameter to kube-controller-manager and kubelet:
+    * `--experimental-mount-namespace`, which specifies a dedicated namespace where all pods with mount utilities reside.
+    * `--experimental-mount-plugins`, which contains comma-separated list of all volume plugins that should run their utilities in pods instead on the host. The list must include also all flex volume drivers. Without this option, controllers and kubelet would not know if a plugin should use a pod with mount utilites or directly host, especially on startup when the daemon set may not yet be fully deployed on all nodes.  * If so, it finds a pod in the dedicated namespace with label `mount.kubernetes.io/foo=true` (or `mount.kubernetes.io/flexvolume/foo=true` for flex volumes) and calls the volume plugin with `VolumeExec` pointing to the pod. All utilities that are executed by the volume plugin for attach/detach/provision/delete are executed in the pod as `kubectl exec <pod> xxx` (of course, we'll use clientset interface instead of executing `kubectl`).


I admit I don't like this experimental-mount-plugins cmdline option. Can anyone find a bulletproof way, how kubelet / a controller can reliably find if a volume plugin should execute its utilities on the host or in a pod? Especially, when kubelet starts in a fresh cluster, the pod with mount utilities may not be running yet and kubelet must know if it should wait for it or not. Kubelet must not try to run the utilities on the host, because there may be wrong version or wrong configuration of the utilities.

vishh · 2017-04-28T17:57:23Z

contributors/design-proposals/containerized-mounter-pod.md

+## User story
+Admin wants to run Kubernetes on a distro that does not ship `mount.glusterfs` that's needed for GlusterFS volumes.
+1. Admin installs Kubernetes in any way.
+2. Admin modifies controller-manager and kubelet command line to include `--experimental-mount-namespace=foo` and `--experimental-mount-plugins=kubernetes.io/glusterfs` so Kubernetes knows what volume plugins should use utilities in pods and in which namespace to find these pods. This should be probably part of installation in the future.


Why not leave discovery of appropriate mount plugins be a vendor specific requirement. Kubelet execs a script or binary that knows which container or service to talk to for each type of storage.

You then need to deploy the script to all nodes and masters and that's exactly I'd like to avoid. Otherwise I can deploy the mount utilities directly there, right? I see GCI and Atomic Host and CoreOS as mostly immutable images with some configuration in /etc that just starts Kubernetes with the right options (and even that is complicated enough!)

All these Container optimizes distros do have some writable stateful partitions. That would be necessary for other parts of the system like CNI. How does this align with CSI?

CSI does not dictate any specific way how its drivers will run. @saad-ali expects they will run as a daemonset.

With CNI approach (a script in /opt/cni/bin), we would need a way how to deploy it on a master. This is OK for OpenShift, would it be fine for GKE, where user does not have access to masters so they can't install a attach/detach/provision/delete script for their favorite storage? And how would it talk to Kubernetes to find the right pod where to do the attaching/provisioning?

Why does GKE need to support this on the masters? User pods will not be scheduled to the masters and they would not need to have the binaries installed.

PV controller on master needs a way how to execute Ceph utilities when creating a volume + attach/detach controller uses the same utilities to attach/detach the volume. Now it uses plain exec. When we move Ceph utilities from master somewhere else we need to tell controller-manager where the utilities are.

vishh · 2017-04-28T18:02:05Z

contributors/design-proposals/containerized-mounter-pod.md

+* User creates a pod that uses a GlusterFS volume. Kubelet find a sidecar template for gluster, injects it into the pod and runs it before any mount operation. It then uses `docker exec mount <what> <where>` to mount Gluster volumes for the pod. After that, it starts init containers and the "real" pod containers.
+* User deletes the pod. Kubelet kills all "real" containers in the pod and uses the sidecar container to unmount gluster volumes. Finally, it kills the sidecar container.
+
+-> User does not need to configure anything and sees the pod Running as usual.


Have you considered packaging mount scripts into the infra container? Kubelet will then have to exec into the infra container to mount a volume. This will alter the pod lifecycle in the kubelet where volumes are now setup prior to starting a pod.
The advantage is that all storage related processes belonging to a pod is contained with a pod's boundary and it's lifecycle is tied to that of the pod.

rkt does not use infrastructure container, they "hold" network NS in other way.

using long-running pods better reflects CSI, as it will run one long-running process on each node. @saad-ali, can you confirm?

I will add a note about it to the proposal

Note added. To be hones, infrastructure container looks compelling to me if we did not want to mimic long-running processes of CSI.

Infra container is an implementation detail for the docker integration, I'd not recommend using it. In fact, CRI in its current state would not allow you to exec into an "infra" container.

Note added (and thanks for spotting this)

jingxu97 · 2017-05-02T06:30:06Z

contributors/design-proposals/containerized-mounter-pod.md

+
+
+## Implementation notes
+Flex volumes won't be changed in alpha implementation of this PR. Flex volumes will still need their utilities (and binaries in /usr/libexec/kubernetes) on all hosts.


Is there some reason for this flex volume note?

As mentioned above we are hoping that flex utils will eventually be moved to pods as well - with label mount.kubernetes.io/flexvolume/foo=true but we are not considering that as part of alpha implementation.

yujuhong · 2017-05-15T20:45:32Z

contributors/design-proposals/containerized-mounter-pod.md

+* User creates a pod that uses a GlusterFS volume. Kubelet find a sidecar template for gluster, injects it into the pod and runs it before any mount operation. It then uses `docker exec mount <what> <where>` to mount Gluster volumes for the pod. After that, it starts init containers and the "real" pod containers.
+* User deletes the pod. Kubelet kills all "real" containers in the pod and uses the sidecar container to unmount gluster volumes. Finally, it kills the sidecar container.
+
+-> User does not need to configure anything and sees the pod Running as usual.


Infra container is an implementation detail for the docker integration, I'd not recommend using it. In fact, CRI in its current state would not allow you to exec into an "infra" container.

yujuhong · 2017-05-15T20:50:56Z

contributors/design-proposals/containerized-mounter-pod.md

+
+Disadvantages:
+* One container for all mount utilities. Admin needs to make a single container that holds utilities for e.g. both gluster and nfs and whatnot.
+* Needs some refactoring in kubelet - now kubelet mounts everything and then starts containers. We would need kubelet to start some container(s) first, then mount, then run the rest. This is probably possible, but needs better analysis (and I got lost in kubelet...)


The sidecar container approach above also requires about the same level of kubelet refactoring. Might want to add it to the "drawbacks" of side-car too.

Added to drawbacks.

yujuhong · 2017-05-15T20:51:15Z

contributors/design-proposals/containerized-mounter-pod.md

+Disadvantages:
+* One container for all mount utilities. Admin needs to make a single container that holds utilities for e.g. both gluster and nfs and whatnot.
+* Needs some refactoring in kubelet - now kubelet mounts everything and then starts containers. We would need kubelet to start some container(s) first, then mount, then run the rest. This is probably possible, but needs better analysis (and I got lost in kubelet...)
+* Short-living processes instead of long-running ones that would mimic CSI.


What's the advantage of the long-running processes?

The advantage is that it mimics our current design of CSI and we can catch bugs or even discover that it's not ideal before CSI is standardized.

yujuhong · 2017-05-15T20:51:42Z

contributors/design-proposals/containerized-mounter-pod.md

+
+### Infrastructure containers
+
+Mount utilities could be also part of infrastructure container that holds network namespace (when using Docker). Now it's typically simple `pause` container that does not do anything, it could hold mount utilities too.


As mentioned above, this'd only work for the legacy, pre-CRI docker integration.

yujuhong · 2017-05-15T20:57:16Z

contributors/design-proposals/containerized-mounter-pod.md

+## User story
+Admin wants to run Kubernetes on a distro that does not ship `mount.glusterfs` that's needed for GlusterFS volumes.
+1. Admin installs Kubernetes in any way.
+2. Admin modifies controller-manager and kubelet command line to include `--experimental-mount-namespace=foo` and `--experimental-mount-plugins=kubernetes.io/glusterfs` so Kubernetes knows what volume plugins should use utilities in pods and in which namespace to find these pods. This should be probably part of installation in the future.


Why does GKE need to support this on the masters? User pods will not be scheduled to the masters and they would not need to have the binaries installed.

yujuhong · 2017-05-16T16:40:27Z

contributors/design-proposals/containerized-mounter-pod.md

+## User story
+Admin wants to run Kubernetes on a distro that does not ship `mount.glusterfs` that's needed for GlusterFS volumes.
+1. Admin installs Kubernetes in any way.
+2. Admin modifies controller-manager and kubelet command line to include `--experimental-mount-namespace=foo` and `--experimental-mount-plugins=kubernetes.io/glusterfs` so Kubernetes knows what volume plugins should use utilities in pods and in which namespace to find these pods. This should be probably part of installation in the future.


I think the discovery process is far from ideal. One would need to enumerate a list of plugins via a kubelet flag (which is static) before kubelet starts and before the (dynamic) DaemonSet pods are created. Any change to the plugin list will require restarting kubelet. Can we try finding other discovery methods?

I agree that the discovery is not ideal. What are other options? AFAIK there is no magic way how to configure kubelet dynamically. Is it possible to have a config object somewhere where all kubelets and controller-manager would reliably get list of volume plugins that are supposed to be containerized?

The list is needed only at startup, where kubelet gets its first pods from scheduler - a pod that uses e.g. a gluster volume may be scheduled before daemon set for gluster is started or daemon set controller spawns a pod on the node. With the list kubelet knows that it should wait. Without the list, it blindly tries to mount the Gluster volume on the host, which is likely to fail with something as ugly as wrong fs type, bad option, bad superblock on 192.168.0.1:/foo missing codepage or helper program, or other error. mount stderr and exit codes are not helpful at all here.

When all daemon sets are up and running we don't need --experimental-mount-plugins at all and dynamic discovery works.

I removed --experimental-mount-plugins for now, but it will behave exactly as I described in the previous comment - weird errors may appear in pod events during kubelet startup before a pod with mount utilities is scheduled and started.

jsafrane · 2017-05-22T11:42:57Z

I updated the proposal with current development:

Added Terminology section to clear some confusion
Removed --experimental-mount-plugins option
During alpha (hopefuly 1.7), this feature must be explicitly enabled using kubelet --experimental-mount-namespace=foo so we don't break working clusters accidentally. This may change during beta/GA!
During alpha, no controller changes will be done, as it is only Ceph RBD provisioner who needs to execute stuff on master. I may implement it if time permits, I am just not sure...

kfox1111 · 2017-07-21T23:05:40Z

contributors/design-proposals/containerized-mounter-pod.md

+    * `--experimental-mount-namespace`, which specifies a dedicated namespace where all pods with mount utilities reside. It would default to `kube-mount`.
+* Whenever PV or attach/detach controller needs to call a volume plugin, it looks for *any* running pod in the specified namespace with label `mount.kubernetes.io/foo=true` (or `mount.kubernetes.io/flexvolume/foo=true` for flex volumes) and calls the volume plugin so it all mount utilities are executed as `kubectl exec <pod> xxx` (of course, we'll use clientset interface instead of executing `kubectl`).
+* If such pod does not exist, it executes the mount utilities on the host as usual.
+* During alpha, no controller-manager changes will be done. That means that Ceph RBD provisioner will still require `/usr/bin/rbd` installed on the master. All other volume plugins will work without any problem, as they don't execute any utility when attaching/detaching/provisioning/deleting a volume.


the rbd provisioner has been pulled out to here:
https://github.com/kubernetes-incubator/external-storage/tree/master/ceph/rbd

so the container can be built with the right ceph version already.

lucab · 2017-07-24T17:24:54Z

contributors/design-proposals/containerized-mounter-pod.md

+* One DaemonSet must provide *all* utilities that are needed to provision, attach, mount, unmount, detach and delete a volume for one volume plugin, including `mkfs` and `fsck` utilities if they're needed.
+    * E.g. `mkfs.ext4` is likely to be available on all hosts, but a pod with mount utilities should not depend on that nor use it.
+    * The only exception are kernel modules. They are not portable across distros and they should be on the host.
+* It is expected that these daemon sets will run privileged pods that will see host's `/proc`, `/dev`, `/sys`, `/var/lib/kubelet` and such. Especially `/var/lib/kubelet` must be mounted with shared mount propagation so kubelet can see mounts created by the pods.


Are mounts constrained to be performed under /var/lib/kubelet? If so, this seems to be a contract detail between controller/kubelet/daemonset that should be mentioned.

No, any directory can be shared. It's up to the system admin + author of the privileged pods to make sure it can be shared (i.e. it's on a mount with shared mount propagation) and it's safe to share (e.g. systemd inside a container does not like /sys/fs/cgroup to be shared to the host, I don't remember exact error message but it simply won't start).

jsafrane · 2017-07-28T10:59:52Z

I updated the proposal with latest discussion om sig-node and with @tallclair.

Kubelet won't use docker exec <pod> mount <what> <where> to mount things in a pod, it will talk to a gRPC service running in the pod instead (following docker shim example).
- Allows for much easier discovery of mount pods by kubelet, no magic namespaces or labels.
- Opens a question how controller-manager will talk to these pods, see open items at the bottom.

This is basically a new proposal and needs a complete re-review. I left the original proposal as a separate commit so we can roll back easily.

humblec · 2017-07-28T17:17:25Z

contributors/design-proposals/containerized-mounter-pod.md

+We considered this user story:
+* Admin installs Kubernetes.
+* Admin configures Kubernetes to use sidecar container with template XXX for glusterfs mount/unmount operations and pod with template YYY for glusterfs provision/attach/detach/delete operations. These templates would be yaml files stored somewhere.
+* User creates a pod that uses a GlusterFS volume. Kubelet find a sidecar template for gluster, injects it into the pod and runs it before any mount operation. It then uses `docker exec mount <what> <where>` to mount Gluster volumes for the pod. After that, it starts init containers and the "real" pod containers.


Do we need to extend the pod spec for doing this( sidecar template injection) operation or can it be done by existing pod spec or achievable by kube jobs or similar mechanism?

humblec · 2017-07-28T17:19:44Z

contributors/design-proposals/containerized-mounter-pod.md

+-> User does not need to configure anything and sees the pod Running as usual.
+-> Admin needs to set up the templates.
+
+Similarly, when attaching/detaching a volume, attach/detach controller would spawn a pod on a random node and the controller would then use `kubectl exec <the pod> <any attach/detach utility>` to attach/detach the volume. E.g. Ceph RBD volume plugin needs to execute things during attach/detach. After the volume is attached, the controller would kill the pod.


Can it be random node ? or it should be the same node where the pod is getting scheduled ?

I think it may be driver specific. For some drivers, its probably best to exec into the container on the host going to have the volume?

For my k8s systems, I tend to have user provided code running in containers. I usually segregate these into differently labeled nodes then the control plane. In this configuration, the container doing the reach back to, say, openstack to move around volumes from vm to vm should never run on the user reachable nodes, as access to the secret for volume manipulation would be really bad. with k8s 1.7+, the secret is inaccessable to nodes that don't have a pod referencing the secret. So targeted exec would be much much better for that use case.

For Attach/Detach operation a random node is IMO OK. All state is kept in attach/detach controller and volume plugins, not in the utilities that are executed by a volume plugin. Note that there is only Ceph RBD that executes something during attach.

For chasing secrets, that's actually benefit of a pod with mount utilities - any secrets that are needed to talk to backend storage can be easily available to the pod via env. variables or Secret volume. And since only os.exec will be delegated to a pod, whole command line will be provided to the pod incl. all necessary credentials.

humblec · 2017-07-28T17:21:33Z

contributors/design-proposals/containerized-mounter-pod.md

+Similarly, when attaching/detaching a volume, attach/detach controller would spawn a pod on a random node and the controller would then use `kubectl exec <the pod> <any attach/detach utility>` to attach/detach the volume. E.g. Ceph RBD volume plugin needs to execute things during attach/detach. After the volume is attached, the controller would kill the pod.
+
+Advantages:
+* It's probably easier to update the templates than update the DaemonSet.


I have one doubt here, how we are going to control the version of required mount utils ? for eg# if the mount utils need to be of a particular version can we specify that in the template ? Does that also mean there can be more than one sidecar container if user wish to ?

humblec · 2017-07-28T17:23:33Z

contributors/design-proposals/containerized-mounter-pod.md

+* It's probably easier to update the templates than update the DaemonSet.
+
+Drawbacks:
+* Admin needs to store the templates somewhere. Where?


Cant we make use of configmap or similar mechanism for this templates ? Just a thought 👍

humblec · 2017-07-28T17:27:47Z

contributors/design-proposals/containerized-mounter-pod.md

+* One DaemonSet must provide *all* utilities that are needed to provision, attach, mount, unmount, detach and delete a volume for a volume plugin, including `mkfs` and `fsck` utilities if they're needed.
+    * E.g. `mkfs.ext4` is likely to be available on all hosts, but a pod with mount utilities should not depend on that nor use it.
+    * The only exception are kernel modules. They are not portable across distros and they *should* be on the host.
+* It is expected that these daemon sets will run privileged pods that will see host's `/proc`, `/dev`, `/sys`, `/var/lib/kubelet` and such. Especially `/var/lib/kubelet` must be mounted with shared mount propagation so kubelet can see mounts created by the pods.


if we have one Daemonset per volume plugin and if we share /dev amoung these containers , there is a risk or security concern, Isnt it ?

not sure its any worse then what is there today.

There are some things to take care of when mounting /dev into a container, i.e. oyu need to take care of the pts device to not break the console. And there are other things to take care of as well.

Because of this I wonder if it makes sense to add an API flag to signal that a container should get the hosts proc, sys, and dev paths. If we had such a flag it would be much more well defined what a container will get, if he is told to get the host view of these three directories.

Also a side note, we can not prevent it, but mounting the host's dev directory into a privileged container can cause quite a lot confusion (actually any setup where more than one udev is run can cause problems).

To be more precise on the "things" - Please take a look here https://github.com/kubevirt/libvirt/blob/master/libvirtd.sh#L5-L42 to see the workarounds coughhackscough we need to do to have libvirt running "properly" (we don't use all features, just a subset, and they work well so far) in a container.

We can't influence how Docker (or other container runtime) creates/binds /dev and /sys. Once this flag is available in Docker/Moby and CRI we could expose it via Kubernetes API, but it's a long process. Until then we're stick to workarounds done inside the container.

I'll make sure we ship a well documented sample of such mount container. That's why it's alpha feature and we know all these workarounds before going to beta/stable.

humblec · 2017-07-28T17:31:16Z

contributors/design-proposals/containerized-mounter-pod.md

+* All volume plugins need to be updated to use a new `mount.Exec` interface to call external utilities like `mount`, `mkfs`, `rbd lock` and such. Implementation of the interface will be provided by caller and will lead either to simple `os.exec` on the host or a gRPC call to a socket in `/var/lib/kubelet/plugin-sockets/` directory.
+
+### Controllers
+TODO: how will controller-manager talk to a remote pod? It's relatively easy to do something like `kubectl exec <mount pod>` from controller-manager, however it's harder to *discover* the right pod.


may be we could make use of labelling/selector mechanism based on the pod content.

What are the tradeoffs of using exec vs. http to serve this? My hunch is that this should just be a service model, with a Kubernetes service that provides the volume plugin (how the controller manager identifies the service could be up for debate - predefined name? labels? namespace?). The auth{n/z} is a bit more complicated with that model though.

kubectl exec is easy to implement, does not need a new protocol and can be restrained by RBAC. With HTTP, we need to define and maintain the protocol, its implementation, have a db for auth{n,z}, generate certificates, ...

how the controller manager identifies the service could be up for debate - predefined name? labels? namespace

getting rid of namespaces / labels was the reason why we have gRPC over UNIX sockets. If we have half of the system using gPRC, second half with kubectl exec, why don't we use kubectl exec (or gRPC) for everything?

humblec · 2017-07-28T17:33:39Z

contributors/design-proposals/containerized-mounter-pod.md

+* Update the pod.
+* Remove the taint.
+
+Is there a way how to make it with DaemonSet rolling update? Is there any other way how to do this upgrade better?


iirc, rolling update is yet to come in DS , need to check the current status though. However there is an option called --cascade=false and possible to do a rolling update by manually or in a scripted way, not sure is that you are looking for.

DaemonSets support rolling update as of 1.6 (https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/)

I am asking if there are some tricks how to do DaemonSet rolling update that would drain a node first before updating the pod. Otherwise I need to fall back to --cascade=false and do the update manually as @humblec suggests.

tallclair · 2017-07-28T18:03:19Z

contributors/design-proposals/containerized-mounter-pod.md

+## Goal
+Kubernetes should be able to run all utilities that are needed to provision/attach/mount/unmount/detach/delete volumes in *pods* instead of running them on *the host*. The host can be a minimal Linux distribution without tools to create e.g. Ceph RBD or mount GlusterFS volumes.
+
+## Secondary objectives


What are the goals around adding (or removing) volume plugins dynamically? In other words, do you expect the pods serving the volume plugins to be deployed at cluster creation time, or at a later time? How about removing plugins?

Volume plugins are not a real plugins, they're hardcoded in Kubernetes.

It does not really matter when the pods with mount utilities are deployed - I would expect that they should be deployed during Kubernetes installation because cluster admin plans storage ahead (e.g. has existing NFS server) , however I can imagine that cluster admin could deploy pods for Gluster volumes later as the NFS server becomes full or so.

The only exception are flex plugin drivers. In 1.7, they needed to be installed before kubelet and controller-manager started. In #833 we're trying to change it to a more dynamic model, where flex drivers can be added/removed dynamically and this proposal could be easily extended with flex drivers running in pods. So admins could dynamically install/remove flex drivers running in pods. Again. I would expect that this would be mostly done during installation of a cluster. And #833 is better place to discuss it.

tallclair · 2017-07-28T18:07:57Z

contributors/design-proposals/containerized-mounter-pod.md

+## Requirements on DaemonSets with mount utilities
+These are rules that need to be followed by DaemonSet authors:
+* One DaemonSet can serve mount utilities for one or more volume plugins. We expect that one volume plugin per DaemonSet will be the most popular choice.
+* One DaemonSet must provide *all* utilities that are needed to provision, attach, mount, unmount, detach and delete a volume for a volume plugin, including `mkfs` and `fsck` utilities if they're needed.


that are needed to provision

Provisioning is a "cluster level" operation, and is handled by the volume controller rather than the Kubelet, right? In that case, I don't think they need to be handled by the same pod. In practice its probably often the same utilities that handle both, but I don't think it should be a hard requirement.

Yes, technically it does not need to be the same pod.

On the other hands, the only internal volume plugin that needs to execute something during provisioning or attach/detach (i.e. initiated by controller-manager) is Ceph RBD that needs /usr/bin/rbd. The same utility is then needed by kubelet to finish attachment of the device.

tallclair · 2017-07-28T18:09:23Z

contributors/design-proposals/containerized-mounter-pod.md

+* One DaemonSet must provide *all* utilities that are needed to provision, attach, mount, unmount, detach and delete a volume for a volume plugin, including `mkfs` and `fsck` utilities if they're needed.
+    * E.g. `mkfs.ext4` is likely to be available on all hosts, but a pod with mount utilities should not depend on that nor use it.
+    * The only exception are kernel modules. They are not portable across distros and they *should* be on the host.
+* It is expected that these daemon sets will run privileged pods that will see host's `/proc`, `/dev`, `/sys`, `/var/lib/kubelet` and such. Especially `/var/lib/kubelet` must be mounted with shared mount propagation so kubelet can see mounts created by the pods.


Especially /var/lib/kubelet must be mounted with shared mount propagation so kubelet can see mounts created by the pods.

This only applies if the Kubelet is running in a container, right? Also, it needs slave mount propogation, not shared, right? (Pardon my ignorance of this subject)

no, shared is needed.

slave:
(u)mount events on the host show up in containers. events on the containers dont affect the host.

shared:
(u)mount events that are initiated from either the host or the container show up on the other side.

If you want an (u)mount event in the mount utility container to show up to kubelet, it needs shared.

This only applies if the Kubelet is running in a container, right

No. Kubelet running on the host must see mounts mounted by a pod. Therefore we need shared mount propagation from the pod to the host. With slave propagation in the pod the mount would be visible only in the pod and not on the host.

tallclair · 2017-07-28T18:11:12Z

contributors/design-proposals/containerized-mounter-pod.md

+    * E.g. `mkfs.ext4` is likely to be available on all hosts, but a pod with mount utilities should not depend on that nor use it.
+    * The only exception are kernel modules. They are not portable across distros and they *should* be on the host.
+* It is expected that these daemon sets will run privileged pods that will see host's `/proc`, `/dev`, `/sys`, `/var/lib/kubelet` and such. Especially `/var/lib/kubelet` must be mounted with shared mount propagation so kubelet can see mounts created by the pods.
+* The pods with mount utilities should run some simple init as PID 1 that reaps zombies of potential fuse daemons.


that reaps zombies of potential fuse daemons.

What does this mean? I believe the zombie process issue was fixed in 1.6 (kubernetes/kubernetes#36853)

@yujuhong says in #589 (comment) that infrastructure pod ("pause") is implementation detail of docker integration and other container engines may not use it.

Yes, but if zombie processes is an issue for other runtimes, they should have a built-in way of dealing with them. It shouldn't be necessary to implement reaping in the pod, unless it's expected to generate a lot of zombie processes, I believe. ( @yujuhong does this sound right? )

Could it be an option to provide a base container for these mount util containers, which has a sane pid 1?

I'd like to stay distro agnostic here and let the DaemonSet authors use anything they want. For NFS, simple Alpine Linux + buysbox init could be enough, for Gluster and Ceph a more powerful distro is needed.

tallclair · 2017-07-28T18:12:33Z

contributors/design-proposals/containerized-mounter-pod.md

+    * The only exception are kernel modules. They are not portable across distros and they *should* be on the host.
+* It is expected that these daemon sets will run privileged pods that will see host's `/proc`, `/dev`, `/sys`, `/var/lib/kubelet` and such. Especially `/var/lib/kubelet` must be mounted with shared mount propagation so kubelet can see mounts created by the pods.
+* The pods with mount utilities should run some simple init as PID 1 that reaps zombies of potential fuse daemons.
+* The pods with mount utilities run a daemon with gRPC server that implements `VolumExecService` defined below.


VolumExecService

nit: I'd prefer VolumePluginService, or some other variation. I think Exec in this case is a bit unclear.

tallclair · 2017-07-28T18:46:43Z

contributors/design-proposals/containerized-mounter-pod.md

+
+### gRPC API
+
+`VolumeExecService` is a simple gRPC service that allows to execute anything via gRPC:


Is there a CSI API proposal out? Does this align with that? It might be worth using the CSI API in it's current state, if it's sufficient.

CSI is too complicated. Also, this would require a completely new implementation of at least gluster, nfs, CephFS, Ceph RBD, git volume, iSCSI, FC and ScaleIO volume plugins which is IMO too much. Keeping the plugins as they are, just using an interface that would defer os.Exec to a pod where appropriate is IMO much simpler and without risk of breaking existing (and tested!) volume plugins.

tallclair · 2017-07-28T18:50:17Z

contributors/design-proposals/containerized-mounter-pod.md

+
+message ExecRequest {
+    // Command to execute
+    string cmd = 1;


This should be abstracted so that the Kubelet doesn't need to understand the specifics of the volume type. I believe this is what the volume interfaces defined in https://github.com/kubernetes/kubernetes/blob/4a73f19aed1f95b3fde1177074aee2a8bec1196e/pkg/volume/volume.go do? In that case, this API should probably mirror those interfaces.

Again, that would require me to rewrite the volume plugins. Volume plugins need e.g. access to CloudProvider or SecretManager, I can't put them into pods easily. And this pod would have access to all Kubernetes secrets...

Whole idea of ExecRequest/Response is to take existing and tested volume plugins and replace all os.Exec calls with <abstract exec interface>.Exec. Kubelet would provide the right interface implementation, leading to os.Exec or gRPC. No big changes in the volume plugins*, simple changes in Kubelet, one common VolumeExec server daemon for all pods with mount utilities.

It does not leak any specific volume knowledge to kubelet / controller-manager. It's dumb exec interface, common to all volume plugins.

*) one or two plugins would still need nontrivial refactoring to pass the interface from place where it's available to place where it's needed, but that's another story.

tallclair · 2017-07-28T18:52:04Z

contributors/design-proposals/containerized-mounter-pod.md

+  * Authors of container images with mount utilities can then add this `volume-exec` daemon to their image, they don't need to care about anything else.
+
+### Upgrade
+Upgrade of the DaemonSet with pods with mount utilities needs to be done node by node and with extra care. The pods may run fuse daemons and killing such pod with glusterfs fuse daemon would kill all pods that use glusterfs on the same node.


Would it kill the pods, or just cause IO errors?

IO errors. I guess health probe should fail and the pod should be rescheduled (or deployment / replication set will create a new one).

tallclair · 2017-07-28T18:55:57Z

contributors/design-proposals/containerized-mounter-pod.md

+* Update the pod.
+* Remove the taint.
+
+Is there a way how to make it with DaemonSet rolling update? Is there any other way how to do this upgrade better?


DaemonSets support rolling update as of 1.6 (https://kubernetes.io/docs/tasks/manage-daemon/update-daemon-set/)

tallclair · 2017-07-28T18:57:54Z

contributors/design-proposals/containerized-mounter-pod.md

+
+  * Authors of container images with mount utilities can then add this `volume-exec` daemon to their image, they don't need to care about anything else.
+
+### Upgrade


What happens if the kubelet can't reach the pod serving a volume plugin (either do to update, or some other error) when a pod with a volume is deleted? Will the Kubelet keep retrying until it is able to unmount the volume? What are the implications of being unable to unmount the volume?

Yes, kubelet tries indefinitely.

And if the pod with mount utilities is not available for a longer time... I checked volume plugins, most (if not all) run umount on the host. So the volume gets unmounted cleanly and data won't be corrupted. Detaching an iSCSI/FC/Ceph RBD disk may be a different story. The disk may be attached forever and then it depends on the backend if it supports attaching the volume to a different node.

As I wrote, update of the daemon set is a very tricky operation and the node should be drained first.

Does unmounting block pod deletion? I.e. will the pod be stuck in a terminated state until the volume utility pod is able to be reached?

No, unmounting happens after a pod is deleted.

liggitt · 2017-07-31T13:55:31Z

contributors/design-proposals/containerized-mounter-pod.md

+-> User does not need to configure anything and sees the pod Running as usual.
+-> Admin needs to set up the templates.
+
+Similarly, when attaching/detaching a volume, attach/detach controller would spawn a pod on a random node and the controller would then use `kubectl exec <the pod> <any attach/detach utility>` to attach/detach the volume. E.g. Ceph RBD volume plugin needs to execute things during attach/detach. After the volume is attached, the controller would kill the pod.


would the pod need to be a highly privileged pod, likely with hostpath volume mount privileges?

this is the first instance I'm aware of where a controller would be required to have the ability to create privileged pods. not necessarily a blocker, but that is a significant change.

iic, for mounts we only need CAP_SYS_ADMIN, however if we export `/dev/ we need privilege pods.

tallclair · 2017-07-31T19:20:41Z

I misunderstood the original intent of this proposal. I thought the goal was to get much closer to the desired end-state of true CSI plugins. However, I now see that this is just providing the binary utilities for the existing (hard-coded) plugins.

Given that, I'm afraid I want to go back on my original suggestions. Since this really is an exec interface, I think the original proposal of using the native CRI exec (specifically, ExecSync) makes sense.

jsafrane · 2017-08-11T10:58:02Z

@tallclair ExecSync looks usable.

I'd like to revisit the UNIX sockets. We still need a way how to run stuff in pods with mount utilities from controller-manager, which cannot use UNIX sockets. So there must be a way (namespaces, labels) to find these pods. Why can't kubelet use the same mechanism instead of UNIX sockets? It's easy to do kubectl exec <pod> mount -t glusterfs ... from kubelet (using ExecSync in the background) and it's currently the easiest way how to reach pods from controller-manager.

jsafrane · 2017-08-14T18:33:50Z

@tallclair I just had meeting with @saad-ali and @thockin and we agreed that UNIX sockets are better for now, we care about mount in 1.8 and we'll see if we ever need to implement attach/detach and how.

So, ExecSync indeed looks usable. I am not sure about whole service RuntimeService - can the gRPC endpoint that runs in a pod with mount utilities implement just ExecSync part of it? Would it be better to create a new service RemoteExecService just with ExecSync in it? Is it possible to have two services having the same ExecSync rpc function?

jsafrane · 2017-09-14T14:00:02Z

Trying to resurrect the discussion, I am still interested in this proposal.

@tallclair, looking at device plugin gRPC API, it looks better to me to follow this approach and introduce "container exec API" with ExecSync call instead of re-using CRI. What do you think about it?

tallclair · 2017-09-18T21:35:39Z

The device plugin api is a higher level abstraction than just arbitrary exec. I wasn't a part of the meeting where it was decided to stick with a socket interface, but I don't see the value in implementing an alternative arbitrary command exec interface rather than relying on ExecSync. I'm happy to join in another discussion.

jsafrane · 2017-10-03T09:56:00Z

Reworked according to result of the latest discussion:

Use CRI ExecSync to execute stuff in pods with mount utilities (= no new gRPC interface)
Use files in /var/lib/kubelet/plugin-containers/<plugin name> for discovery of pods with mount utilities.
Added a note about containerized kubelet for completeness, no extra changes necessary.

@tallclair @dchen1107 @thockin @vishh PTAL

jsafrane · 2017-10-04T16:15:54Z

Implementation of this proposal is at kubernetes/kubernetes#53440 - it's quite small and well contained.

castrojo · 2017-10-10T15:54:06Z

This change is

tallclair · 2017-10-13T19:58:33Z

contributors/design-proposals/containerized-mounter-pod.md

+
+To sum it up, it's just a daemon set that spawns privileged pods, running a simple init and registering itself into Kubernetes by placing a file into well-known location.
+
+**Note**: It may be quite difficult to create a pod that see's host's `/dev` and `/sys`, contains necessary kernel modules, does the initialization right and reaps zombies. We're going to provide a template with all this. During alpha, it is expected that this template will be polished as we encounter new bugs, corner cases, systemd / udev / docker weirdness.


During alpha,

Is this expected to ever leave alpha? I thought this was a temporary hack while we wait for CSI?

I removed all notes about alpha in the text and added a note about feature gate and that it's going to be alpha forever.

jsafrane · 2017-10-18T13:00:19Z

I squashed all the commits, the PR is ready to be merged.

On personal meeting with @tallclair and @saad-ali we agreed that all volume plugins are going to be moved to CSI eventually, so this proposal has limited lifetime. CSI drivers will have different discovery mechanism and all kubelet changes proposed here won't be needed.

I still think this PR is useful, as it allows us to create tests for internal volume plugins so we can check their CSI counterparts for regressions in e2e tests. Wherever the CSI drivers will live, Kubernetes still needs to keep its backward compatibility and make sure that old PVs keep working.

jsafrane · 2017-10-27T07:35:57Z

/assign @tallclair

tallclair · 2017-10-31T19:49:25Z

/lgtm

jsafrane · 2017-11-09T15:01:04Z

Why is this not merged? "pull-community-verify — Waiting for status to be reported"
/test all

k8s-github-robot · 2017-11-09T15:04:53Z

/test all [submit-queue is verifying that this PR is safe to merge]

k8s-github-robot · 2017-11-09T15:35:55Z

Automatic merge from submit-queue.

@tallclair

Automatic merge from submit-queue (batch tested with PRs 54005, 55127, 53850, 55486, 53440). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Containerized mount utilities This is implementation of kubernetes/community#589 @tallclair @vishh @dchen1107 PTAL @kubernetes/sig-node-pr-reviews **Release note**: ```release-note Kubelet supports running mount utilities and final mount in a container instead running them on the host. ```

Automatic merge from submit-queue. Proposal: containerized mount utilities in pods @kubernetes/sig-storage-proposals @kubernetes/sig-node-proposals

* Update TOC members list * Removed Joshua Blatt and moved him to Emeritus list * Removed various admin privileges for Josh * Added nrjpoddar as repo admin

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 28, 2017

jsafrane commented Apr 28, 2017

View reviewed changes

vishh reviewed Apr 28, 2017

View reviewed changes

jingxu97 reviewed May 2, 2017

View reviewed changes

gnufied mentioned this pull request May 3, 2017

Containerized mounts kubernetes/enhancements#278

Closed

jsafrane mentioned this pull request May 12, 2017

Make /var/lib/kubelet as shared during startup kubernetes/kubernetes#45724

Merged

yujuhong reviewed May 16, 2017

View reviewed changes

bassam mentioned this pull request May 29, 2017

Explore a rook volume plugin rook/rook#432

Closed

kfox1111 reviewed Jul 21, 2017

View reviewed changes

lucab reviewed Jul 24, 2017

View reviewed changes

aaronlevy mentioned this pull request Jul 24, 2017

Add ceph-common to hyperkube image kubernetes/kubernetes#45040

Merged

jsafrane force-pushed the containerized-mount branch 6 times, most recently from 0486ab2 to 85b39a3 Compare July 28, 2017 10:54

jsafrane mentioned this pull request Jul 28, 2017

Dynamic Flexvolume plugin discovery proposal. #833

Merged

jsafrane force-pushed the containerized-mount branch from 85b39a3 to f803f7b Compare July 28, 2017 12:05

humblec reviewed Jul 28, 2017

View reviewed changes

tallclair reviewed Jul 28, 2017

View reviewed changes

liggitt reviewed Jul 31, 2017

View reviewed changes

This was referenced Aug 8, 2017

Configure mount propagation on systems without systemd kubernetes/kubernetes#44389

Closed

Prepare VolumeHost for running mount tools in containers kubernetes/kubernetes#46458

Merged

jsafrane mentioned this pull request Aug 15, 2017

Mount propagation in kubelet kubernetes/kubernetes#46444

Merged

jsafrane force-pushed the containerized-mount branch 3 times, most recently from 6802ff5 to 72189ee Compare October 3, 2017 09:51

jsafrane force-pushed the containerized-mount branch 3 times, most recently from a02d582 to f003e82 Compare October 3, 2017 11:25

jsafrane mentioned this pull request Oct 4, 2017

Containerized mount utilities kubernetes/kubernetes#53440

Merged

tallclair reviewed Oct 13, 2017

View reviewed changes

Proposal: containerized mount utilities in pods

69c780f

jsafrane force-pushed the containerized-mount branch 2 times, most recently from d797c65 to 69c780f Compare October 18, 2017 12:54

k8s-ci-robot assigned tallclair Oct 27, 2017

k8s-github-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Oct 27, 2017

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 31, 2017

k8s-github-robot merged commit f04cbac into kubernetes:master Nov 9, 2017

adityakali mentioned this pull request Feb 17, 2018

Delete stale UDP conntrack entries that use hostPort kubernetes/kubernetes#59286

Merged

danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023

Update toc (kubernetes#589)

3ee96da

* Update TOC members list * Removed Joshua Blatt and moved him to Emeritus list * Removed various admin privileges for Josh * Added nrjpoddar as repo admin



		## Implementation notes
		Flex volumes won't be changed in alpha implementation of this PR. Flex volumes will still need their utilities (and binaries in /usr/libexec/kubernetes) on all hosts.


		### Infrastructure containers

		Mount utilities could be also part of infrastructure container that holds network namespace (when using Docker). Now it's typically simple `pause` container that does not do anything, it could hold mount utilities too.


		### gRPC API

		`VolumeExecService` is a simple gRPC service that allows to execute anything via gRPC:


		* Authors of container images with mount utilities can then add this `volume-exec` daemon to their image, they don't need to care about anything else.

		### Upgrade


		To sum it up, it's just a daemon set that spawns privileged pods, running a simple init and registering itself into Kubernetes by placing a file into well-known location.

		Note: It may be quite difficult to create a pod that see's host's `/dev` and `/sys`, contains necessary kernel modules, does the initialization right and reaps zombies. We're going to provide a template with all this. During alpha, it is expected that this template will be polished as we encounter new bugs, corner cases, systemd / udev / docker weirdness.

Proposal: containerized mount utilities in pods #589

Proposal: containerized mount utilities in pods #589

Conversation

jsafrane commented Apr 28, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsafrane commented May 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsafrane commented Jul 28, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabiand Aug 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tallclair Jul 28, 2017 • edited by jsafrane Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tallclair Jul 28, 2017 • edited by jsafrane Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabiand Aug 21, 2017 •

edited

Loading

tallclair Jul 28, 2017 •

edited by jsafrane

Loading

tallclair Jul 28, 2017 •

edited by jsafrane

Loading