Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker plugin fails to start after upgrade to 20.10+ #507

Closed
djesernik opened this issue Jan 14, 2021 · 17 comments
Closed

Docker plugin fails to start after upgrade to 20.10+ #507

djesernik opened this issue Jan 14, 2021 · 17 comments

Comments

@djesernik
Copy link

Describe the bug
After upgrading the docker netapp/trident-plugin to version 20.10, it fails to be re-enabled with an error of
Error response from daemon: dial unix /run/docker/plugins/2311ab6e7b3b1461539ddc3c783ac65ad9a179d44fffd76b53aff9dd844ae1c6/netapp.sock: connect: no such file or directory
Upon following the troubleshooting steps in https://netapp-trident.readthedocs.io/en/stable-v20.10/docker/troubleshooting.html and checking the logs with journalctl, the following messages are logged:

Jan 14 10:59:35 hostname dockerd[14554]: time="2021-01-14T10:59:35-07:00" level=error msg="time=\"2021-01-14T17:59:35Z\" level=info msg=\"Running Trident storage orchestrator.\" binary=/netapp/trident build_time=\"Fri Jan  8 19:10:34 UTC 2021\" version=20.10.1" plugin=2311ab6e7b3b1461539ddc3c783ac65ad9a179d44fffd76b53aff9dd844ae1c6
Jan 14 10:59:35 hostname dockerd[14554]: time="2021-01-14T10:59:35-07:00" level=error msg="time=\"2021-01-14T17:59:35Z\" level=fatal msg=\"Insufficient arguments provided for Trident to start.  Specify k8sAPIServer (for Kubernetes) or configPath (for Docker) or csiEndpoint (for CSI).\"" plugin=2311ab6e7b3b1461539ddc3c783ac65ad9a179d44fffd76b53aff9dd844ae1c6
Jan 14 10:59:47 hostname dockerd[14554]: time="2021-01-14T10:59:47.360433094-07:00" level=error msg="Handler for POST /v1.40/plugins/netapp:latest/enable returned error: dial unix /run/docker/plugins/2311ab6e7b3b1461539ddc3c783ac65ad9a179d44fffd76b53aff9dd844ae1c6/netapp.sock: connect: no such file or directory"

Environment
Provide accurate information about the environment to help us reproduce the issue.

  • Trident version: 20.10
  • Trident installation flags used: docker plugin install --grant-all-permissions --alias netapp netapp/trident-plugin:19.10 config=/etc/netappdvp/config.json from the original install, though this is an upgrade
  • Container runtime: Docker 19.03.14-CE
  • Kubernetes version: N/A
  • Kubernetes orchestrator: N/A
  • Kubernetes enabled feature gates: N/A
  • OS: Ubuntu 18.04.5 LTS
  • NetApp backend types: ONTAP-NAS
  • Other: N/A

To Reproduce
Steps to reproduce the behavior:
With a previous version of the netapp/trident-plugin installed (I tested with 19.10, 20.04, and 20.07) run through the steps in https://netapp-trident.readthedocs.io/en/latest/docker/use/managing.html#updating-trident

  1. Disable plugin
$ sudo docker plugin disable -f netapp:latest
netapp:latest
  1. Upgrade plugin
$ sudo docker plugin upgrade --skip-remote-check --grant-all-permissions netapp:latest netapp/trident-plugin:20.10
Upgrading plugin netapp:latest from netapp/trident-plugin:20.07 to netapp/trident-plugin:20.10
20.10: Pulling from netapp/trident-plugin
09043c5b1668: Download complete
Digest: sha256:bb73c2da04224603ce60a88546143b5045b235234c937c07331b4d4a1236da59
Status: Downloaded newer image for netapp/trident-plugin:20.10
Upgraded plugin netapp:latest to netapp/trident-plugin:20.10
  1. Enable plugin
$ sudo docker plugin enable netapp:latest
Error response from daemon: dial unix /run/docker/plugins/2311ab6e7b3b1461539ddc3c783ac65ad9a179d44fffd76b53aff9dd844ae1c6/netapp.sock: connect: no such file or directory

Expected behavior
Expected a response of

$ sudo docker plugin enable netapp:latest
netapp:latest

Additional context
In researching the error that is reported I found the corresponding code at https://github.com/NetApp/trident/blob/stable/v20.10/main.go#L129-L132 suggesting that the configPath may not be set. When the docker plugin was installed with docker plugin install the following parameter was passed though config=/etc/netappdvp/config.json and I can confirm that it is set by running docker plugin inspect netapp and seeing the output for Config.Env.config Value: "config.json" as well as this section of the output:

        "Id": "2311ab6e7b3b1461539ddc3c783ac65ad9a179d44fffd76b53aff9dd844ae1c6",
        "Name": "netapp:latest",
        "PluginReference": "docker.io/netapp/trident-plugin:20.10",
        "Settings": {
            "Args": [],
            "Devices": [],
            "Env": [
                "debug=true",
                "rest=false",
                "config=/etc/netappdvp/config.json"
            ],

I was able to downgrade back to 20.07 which functions as expected.

@djesernik djesernik added the bug label Jan 14, 2021
@gnarl gnarl added the tracked label Jan 15, 2021
@djesernik
Copy link
Author

I have attempted the upgrade again from 20.07, but to the latest version 21.01.1. The same issue is encountered where the docker plugin fails to be enabled/start and returns a fatal error of

Insufficient arguments provided for Trident to start.  Specify k8sAPIServer (for Kubernetes) or configPath (for Docker) or csiEndpoint (for CSI).

@djesernik djesernik changed the title Docker plugin fails to start after upgrade to 20.10 Docker plugin fails to start after upgrade to 20.10+ Feb 16, 2021
@xd999e
Copy link

xd999e commented Mar 10, 2021

Hi @djesernik
I think I have the same problem. I tested different trident version with my test-client yesterday. For me the last working version is 20.04. I did not upgrade the plugin, I did a fresh install and tested it with 19.10, 20.01, 20.04, 20.10, 21.01

docker plugin rm netapp:latest -f
docker plugin install --grant-all-permissions --alias netapp netapp/trident-plugin:<version> config=config.json

I'm able to start the plugin with every version. I'm also able to see the NetApp volumes with "docker volume ls", but I'm not able to mount any volume inside my container.

Error message:
docker: Error response from daemon: open /var/lib/docker/plugins/.../propagated-mount/vol: no such file or directory.

Not sure if this is the same problem. Are you able to use any fresh-installed version above 20.04? Maybe this is just a problem on my side and I need to create a case.

docker-ce client 20.10.5
docker-ce server 20.10.5
containerd: 1.4.3
crun: 0.18 (default)
docker-init: 0.19.0

@gnarl
Copy link
Contributor

gnarl commented Mar 11, 2021

Hi @xd999e,

There is a NetApp support case open on this issue and the team is investigating it. If you need immediate assistance you can also contact NetApp support. We will update this issue when we have more information.

@oleimann
Copy link

oleimann commented Mar 26, 2021

Trident 20.10 changed from 20.07 in that it is now based on distroless instead of Alpine, which has at least two consequences:

  1. The upgrade or install as docker plug-in requires your to provide the config file, but without the path.
  2. It will force usage of docker-ce documented path /var/lib/docker-volumes, which for some installs may fail to fit (some RHEL systems appears to use docker with /etc/docker-volumes) causing issues in managing volumes.

@xd999e
Copy link

xd999e commented Mar 30, 2021

We are running Ubuntu 18.04/20.04 and we use the config file name at install. (without path)
Still get the error message:

/var/lib/docker/493216.493216/plugins/b19b3c65a944fe4ca6cfd345a921f3b1b676568092579ee7899a7458441ac631/propagated-mount/<volname>: no such file or directory.

When browsing into the directory, I can find everything up to "propagated-mount". This directory is not present. (everything else is.)

This issue appears when I try to use the volume with a container. (not while creating the volume, this is working)

@markschuren
Copy link

Same issue here on 2 production hosts. Currently we're unable to upgrade to anything higher than 20.07 because of this issue.
Insufficient arguments provided for Trident to start. Specify k8sAPIServer (for Kubernetes) or configPath (for Docker) or csiEndpoint (for CSI).
Is there any progress on this issue?
Or any workaround, like manually changing the plugin config, add the missing "configPath" argument somewhere?

@gnarl
Copy link
Contributor

gnarl commented Apr 30, 2021

This issue is fixed with commit ce346f3 and is included in the Trident 21.04 release.

@gnarl gnarl closed this as completed Apr 30, 2021
@markschuren
Copy link

markschuren commented May 10, 2021

I think this should be reopened. Just tried again, updating one host from 20.07 to 21.04, but I keep getting the same error. Trident does not start after upgrade. Same error message as with 20.10 or 21.01 versions before:

dockerd[1251]: time="2021-05-10T13:25:55+02:00" level=error msg="time=\"2021-05-10T11:25:55Z\" level=info msg=\"Running Trident storage orchestrator.\" binary=/netapp/trident build_time=\"Sat May  1 00:10:59 UTC 2021\" version=21.04.0" plugin=939bd8ea1d536ce50d4619d9c78
dockerd[1251]: time="2021-05-10T13:25:55+02:00" level=error msg="time=\"2021-05-10T11:25:55Z\" level=fatal msg=\"Insufficient arguments provided for Trident to start.  Specify k8sAPIServer (for Kubernetes) or configPath (for Docker) or csiEndpoint (for CSI).\"" plugin=9

After downgrading to 20.07 it works.

@xd999e
Copy link

xd999e commented May 17, 2021

Yep, still the same error too.
"no such file or directory." - 20.07 is the last working version.

@gnarl gnarl reopened this May 17, 2021
@rgadwagner
Copy link

Can add that 21.07 is the exact same. Still throwing a config error on trying to re-enable the plugin.

Do we have a workaround here? I don't think we're going to be able to "uninstall/reinstall" with existing netapp volumes in use by containers right? To "fix" this would be a complete shutdown of the cluster / containers to uninstall/reinstall the driver?

@jayooin
Copy link

jayooin commented Sep 30, 2021

Hello,
Here is what I found, after troubleshooting for 2h to find what's wrong.

My system:
OS: RockyLinux8.4
Disks:

  • /
  • /var/lib/docker

I have two systems, which act differently.
Machine A with an extra mount point for /var/lib/docker, and the Machine B with /var/lib/docker directly on the main /.

With the version 21.07 of the plugin

  • It works on Machine B, no problems.
  • on Machine A it fails to mount on the right path, and instead creates a /plugins/xxxx at root level, while /var/lib/plugins/xxx/propageted-mount remains empty. Note that the volume is create on Netapp, just the mount is incorrect.

However, I went down to the version 20.07 on Machine A, and it works like intended.

The error messages:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: rootfs_linux.go:76: mounting "/var/lib/docker/plugins/903ffb35c151c88a60aa810b5b49fbcd1344cd9e7b702eae2f9aa1b38ec9b567/propagated-mount/netappdvp_testVol" to rootfs at "/myvol" caused: stat /var/lib/docker/plugins/903ffb35c151c88a60aa810b5b49fbcd1344cd9e7b702eae2f9aa1b38ec9b567/propagated-mount/netappdvp_testVol: no such file or directory: unknown.
ERRO[0003] error waiting for container: context canceled

Where it actually mounted:

netappserver:/netappdvp_testVol on /plugins/903ffb35c151c88a60aa810b5b49fbcd1344cd9e7b702eae2f9aa1b38ec9b567/propagated-mount/netappdvp_testVol type nfs (rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=xxx.xxx.xxx.xxx,mountvers=3,mountport=635,mountproto=udp,local_lock=none,addr=xxx.xxx.xxx.xxx)

@crossond
Copy link

crossond commented Feb 15, 2022 via email

@rgadwagner
Copy link

@jayooin We're not seeing the exact same behavior (I assume by your statement of "creates /plugins/xxx at root level" you actually mean /plugins/xxx off of / and not /var/lib/docker/plugins/xxx off of /var/lib/docker right?) but it is interesting to note that you're not having problems with a box who's /var/lib/docker is NOT it's own mount point from the root drive...our setup is ALSO /var/lib/docker as it's own mount.

We do not see additional directories created off the / path, /var/lib/docker/plugins is a directory that exists on our systems and under that we get a hash of the netapp plugin path. I haven't dug in to the information under that much, except for the times where the netapp driver goes off the rails and stops writing to NFS and starts writing to Local Host, but that's an ENTIRELY different problem than what we're discussing here.

And yes, I can confirm that 20.07 is the last version of the plugin to be functional.

@gnarl
Copy link
Contributor

gnarl commented Feb 18, 2022

Hi @rgadwagner, we're looking into this issue. Can you let us know which version of Docker and which Operating System and version you are using?

@rgadwagner
Copy link

Plenty of various versions from 18.07 all the way up to 20.10 of docker. The OS is almost exclusively CentOS7. The last time we tried updating past 20.07 the OS Test Bed was the most recent patch (at the time) of Docker 20.10 on a fully patched (at that time) CentOS7.

Please keep in mind that it LOOKS like this only exists with installations that have been brought up from prior versions. To truly test this you're going to want to install a version of the driver at like 18.04 or something like that, run containers off of it and then update to beyond 20.07 netapp...from what i'm understanding the issue doesn't occur with a fresh install or from an installation upgraded with an installation past a certain point.

I can tell you we've had this driver installed since before it became Trident (though we've uninstalled/reinstalled on various machines) and there's a memory at the back of my mind that says there was some major architectural change during one of the versions a few years back.

@gnarl
Copy link
Contributor

gnarl commented Feb 18, 2022

@rgadwagner, thanks for the detailed information. Establishing the scenario that needs to be tested is very helpful. You're right that there have been a few architectural changes over the years. Thanks for sticking with Trident all this time.

@gnarl
Copy link
Contributor

gnarl commented Mar 23, 2022

This issue is fixed with commit 7943641 and will be included in the Trident 22.04 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants