Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msg="GRPC error: rpc error: code = NotFound desc = node XXXXX was not found" #328

Closed
titansmc opened this issue Jan 10, 2020 · 19 comments
Closed

Comments

@titansmc
Copy link

Describe the bug
Following the basic example in the documentation fails to attach the volume to the Pod.

Environment
Provide accurate information about the environment to help us reproduce the issue.

[root@k3n trident-installer]# ./tridentctl -n trident get backend
+----------------------+----------------+--------------------------------------+--------+---------+
|         NAME         | STORAGE DRIVER |                 UUID                 | STATE  | VOLUMES |
+----------------------+----------------+--------------------------------------+--------+---------+
| ontapnas_10.11.5.186 | ontap-nas      | 57a270cb-051a-4107-8146-1111111e7a5 | online |       2 |
+----------------------+----------------+--------------------------------------+--------+---------+


[root@k3n trident-installer]# ./tridentctl -n trident  version
+----------------+----------------+
| SERVER VERSION | CLIENT VERSION |
+----------------+----------------+
| 19.10.0        | 19.10.0        |
+----------------+----------------+

Docker

Client: Docker Engine - Community
 Version:           19.03.5
 API version:       1.39 (downgraded from 1.40)
 Go version:        go1.12.12
 Git commit:        633a0ea
 Built:             Wed Nov 13 07:25:41 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.7
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       2d0083d
  Built:            Thu Jun 27 17:26:28 2019
  OS/Arch:          linux/amd64
  Experimental:     false

k8s version

[root@k3n trident-installer]# kubectl  version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.5", GitCommit:"20c265fef0741dd71a66480e35bd69f18351daea", GitTreeState:"clean", BuildDate:"2019-10-15T19:16:51Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.5", GitCommit:"20c265fef0741dd71a66480e35bd69f18351daea", GitTreeState:"clean", BuildDate:"2019-10-15T19:07:57Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
[root@k3n trident-installer]# kubectl get nodes
NAME          STATUS   ROLES    AGE   VERSION
k1m.domain.com   Ready    master   28d   v1.15.5
k3n.domain.com   Ready    <none>   28d   v1.15.5
k4n.domain.com   Ready    <none>   28d   v1.15.5

To Reproduce
Follow the basic example

Expected behavior
attach the created volume to the Pod

Additional context
I also see in the logs errors related to iSCSI, which I believe we are not using.

time="2019-12-12T09:47:06Z" level=warning msg="Couldn't retrieve volume transaction logs: Unable to find key"
time="2019-12-12T09:47:06Z" level=info msg="Trident bootstrapped successfully."
time="2019-12-12T09:47:06Z" level=info msg="Activating plain CSI helper frontend."
time="2019-12-12T09:47:06Z" level=info msg="Activating CSI frontend."
time="2019-12-12T09:47:06Z" level=info msg="Listening for GRPC connections." name=/plugin/csi.sock net=unix
time="2019-12-12T09:47:06Z" level=error msg="Error gathering initiator names."
time="2019-12-12T09:47:06Z" level=error msg="Could not get iSCSI initiator name." error="exit status 1"

@kmwm3
Copy link

kmwm3 commented Jan 14, 2020

I am having the same issue.

Docker version 18.06.2-ce
K8s version 1.16.3
Trident version 19.10
Storage driver - ontap-nas

tridentctl logs

"Node info not found." node=<node_name>
"GRPC error: rpc error: code = NotFound desc = node <node_name> was not found"

kubectl describe pod that's requesting the pvc
AttachVolume.Attach failed for volume "pvc-245d157b-f450-4fed-8e0b-29affcb6d53b" : rpc error: code = NotFound desc = node <node_name> was not found

I think this may have something to do with an old install that did not clean up properly? How can we completely remove Trident to try again? I have tried clearing out the trident entries in /var/lib/kubelet and in /var/lib/trident. but to no avail so far.

@balaramesh
Copy link
Contributor

@titansmc and @kmwm3 can you share some more info on your k8s environment? Are you running vanilla k8s? What's the underlying OS on your underlying nodes?

@titansmc
Copy link
Author

titansmc commented Jan 14, 2020 via email

@kmwm3
Copy link

kmwm3 commented Jan 15, 2020

@titansmc and @kmwm3 can you share some more info on your k8s environment? Are you running vanilla k8s? What's the underlying OS on your underlying nodes?

I am running vanilla k8s on RHEL 7.7.

@teramucho
Copy link

i have same issue

the problem is trident did not get my cluster node asset

through log

it only join part of cluster node...

so pvc only mount on specific node, else all failed...


time="2020-02-04T09:18:10Z" level=debug msg="Authenticated by HTTPS REST frontend." peerCert=trident-node
time="2020-02-04T09:18:10Z" level=debug msg="REST API call received." duration="1.523µs" method=PUT requestID=bosjdknr0f3d5tg4cl0g route=AddOrUpdateNode uri=/trident/v1/node/ddp-deveco-master02
time="2020-02-04T09:18:10Z" level=info msg="Added a new node." handler=AddOrUpdateNode node=ddp-deveco-master02
time="2020-02-04T09:18:10Z" level=debug msg="REST API call complete." duration=6.158862ms method=PUT requestID=bosjdknr0f3d5tg4cl0g route=AddOrUpdateNode uri=/trident/v1/node/ddp-deveco-master02
time="2020-02-04T09:18:17Z" level=debug msg="REST API call received." duration="2.491µs" method=GET requestID=bosjdmfr0f3d5tg4cl10 route=GetVersion uri=/trident/v1/version
time="2020-02-04T09:18:17Z" level=debug msg="REST API call complete." duration="161.897µs" method=GET requestID=bosjdmfr0f3d5tg4cl10 route=GetVersion uri=/trident/v1/version
time="2020-02-04T09:18:34Z" level=debug msg="Authenticated by HTTPS REST frontend." peerCert=trident-node
time="2020-02-04T09:18:34Z" level=debug msg="REST API call received." duration="1.538µs" method=PUT requestID=bosjdqnr0f3d5tg4cl1g route=AddOrUpdateNode uri=/trident/v1/node/ddp-deveco-master03
time="2020-02-04T09:18:34Z" level=info msg="Added a new node." handler=AddOrUpdateNode node=ddp-deveco-master03
time="2020-02-04T09:18:34Z" level=debug msg="REST API call complete." duration=5.725727ms method=PUT requestID=bosjdqnr0f3d5tg4cl1g route=AddOrUpdateNode uri=/trident/v1/node/ddp-deveco-master03
time="2020-02-04T09:18:58Z" level=debug msg="Storage class updated in cache." name=nfs-client parameters="map[backendType:ontap-nas snapshots:true]" provisioner=csi.trident.netapp.io
time="2020-02-04T09:19:08Z" level=debug msg="REST API call received." duration="3.05µs" method=POST requestID=bosje37r0f3d5tg4cl20 route=AddBackend uri=/trident/v1/backend

@gnarl
Copy link
Contributor

gnarl commented Feb 4, 2020

@teramucho, Kubernetes calls Trident's API to add the node once it is successfully registered. If a node in the cluster isn't added to Trident then that node may not have properly registered. Check the Trident node and driver registrar sidecar logs for errors. Also, check the kubelet logs. If this doesn't resolve your issue please contact NetApp Support.

@gnarl
Copy link
Contributor

gnarl commented Feb 5, 2020

All, a fix was just merged to address a situation where K8S DNS is not configured properly which can lead to the error as reported in this issue. Trident patches that contain the fix will be released in the near future. Thanks for your patience.

@gnarl
Copy link
Contributor

gnarl commented Feb 28, 2020

This issue was fixed with the Trident 20.01.1 release.

@presidenten
Copy link

presidenten commented Apr 25, 2020

@gnarl Still got the issue on one of our clusters:

 $ tridentctl -n trident get backend
+------------------+----------------+--------------------------------------+--------+---------+
|       NAME       | STORAGE DRIVER |                 UUID                 | STATE  | VOLUMES |
+------------------+----------------+--------------------------------------+--------+---------+
| <redacted>       | ontap-nas      | <redacted>                           | online |       1 |
+------------------+----------------+--------------------------------------+--------+---------+
$
$
$ tridentctl -n trident version
+----------------+----------------+
| SERVER VERSION | CLIENT VERSION |
+----------------+----------------+
| 20.01.1        | 20.01.0        |
+----------------+----------------+

Trident cant find a few of the nodes in the cluster:

time="2020-04-25T14:28:49Z" level=error msg="Node info not found." node=node020
time="2020-04-25T14:28:49Z" level=error msg="GRPC error: rpc error: code = NotFound desc = node node020 was not found"
time="2020-04-25T14:28:49Z" level=error msg="Node info not found." node=node020
time="2020-04-25T14:28:49Z" level=error msg="GRPC error: rpc error: code = NotFound desc = node node020 was not found"
time="2020-04-25T14:28:50Z" level=error msg="Node info not found." node=node018
time="2020-04-25T14:28:50Z" level=error msg="GRPC error: rpc error: code = NotFound desc = node node018 was not found"
time="2020-04-25T14:28:50Z" level=error msg="Node info not found." node=node018
time="2020-04-25T14:28:50Z" level=error msg="GRPC error: rpc error: code = NotFound desc = node node018 was not found"

Any ideas what to try to get them up and running?

These machines were correctly connected before. Now we reinstalled the cluster (as training for new ops) and then the nodes dont get added anymore.

@ramancde
Copy link

Is there a latest update on this issue.
Do we have the fix

@bigg01
Copy link

bigg01 commented Jun 29, 2020

I have this problem to running OCP4 80% of the nodes are working the other 20% fails.

./tridentctl version
+----------------+----------------+
| SERVER VERSION | CLIENT VERSION |
+----------------+----------------+
| 20.04.0 | 20.04.0 |
+----------------+----------------+

Server Version: 4.4.9
Kubernetes Version: v1.17.1+912792b

The node is missing because the Trident object was not created
"oc get tridentnode"

@gnarl gnarl reopened this Jun 30, 2020
@gnarl
Copy link
Contributor

gnarl commented Jul 22, 2020

Hi @presidenten, @ramancde, and @bigg01 we've investigated the issue and have not been able to reproduce it. If you see the issue again please contact NetApp support and provide Trident logs so that we can determine what is causing the issue.

@torirevilla
Copy link
Contributor

torirevilla commented Aug 6, 2020

There are two likely scenarios why Trident does not find a Kubernetes node. It can be because of a networking issue within Kubernetes or a DNS issue. The Trident node daemonset that runs on each Kubernetes node must be able to communicate with the Trident controller to register the node with Trident. If networking changes occurred after Trident was installed this problem may only be observed with new Kubernetes nodes that are added to the cluster.

@khatrig
Copy link

khatrig commented Sep 3, 2020

There are two likely scenarios why Trident does not find a Kubernetes node. It can be because of a networking issue within Kubernetes or a DNS issue. The Trident node daemonset that runs on each Kubernetes node must be able to communicate with the Trident controller to register the node with Trident. If networking changes occurred after Trident was installed this problem may only be observed with new Kubernetes nodes that are added to the cluster.

This matches the kind of issue I am facing. Only newly added nodes won't register with the trident. I tried restarting the trident pods, tried removing/adding the impacted nodes but nothing helps. There have been no networking changes on the cluster and I don't see any networking/DNS related issues on the cluster.

Any pointers on how I can investigate this further?

@oleimann
Copy link

oleimann commented Sep 3, 2020

The same error about not finding the node (not registered with Trident controller) seems to happen with K8s 1.17 and Trident 20.07 when the Autoscaler of Kubernetes adds a node to bring a pod in - the PV for the pod doesn't get added as a consequence, and the Pod is Pending.
Do nodes in the "free pool" need to be prepared with Trident somehow, so the daemon is available when the Node starts up, and it can register ?)

@gnarl
Copy link
Contributor

gnarl commented Sep 3, 2020

@khatrig and @oleimann,

As indicated above we haven't been able to reproduce this issue yet. Please open a case with NetApp support so that we can collect additional information.

To open a case with NetApp, please go to https://mysupport.netapp.com/site/.

  • Bottom left, Click on 'Contact Support'
  • Find the appropriate number from your region to call in, or login.
  • Note: Trident is not listed on the page, but is a supported product by NetApp based on a supported Netapp storage SN.
  • Open the case on the NetApp storage SN, and provide the description of the problem.
  • Be sure to mention the product is Trident on Kubernetes, and provide the details. Mention this GitHub.
  • The case will be directed to Trident support engineers for response.

@khatrig
Copy link

khatrig commented Sep 15, 2020

In my case, it turned out to be an issue with DNS on some nodes, trident-csi pod running on some nodes could not resolve trident-csi.trident service hence could not register the node.

@gnarl
Copy link
Contributor

gnarl commented Sep 15, 2020

@khatrig thanks for updating this issue.

@gnarl
Copy link
Contributor

gnarl commented Oct 7, 2020

For everyone that encountered this reported issue it was determined that either a DNS or a networking issue kept the Trident node DaemonSet from registering with the Trident controller. Commit 8e51987 improves the Info log message to help the Trident user resolve this registration issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests