nvidia-driver-daemonset pod CrashLoopBackOff after restart #83

rbo · 2020-09-20T14:53:47Z

1. Quick Debug Checklist

Are you running on an Ubuntu 18.04 node?
Are you running Kubernetes v1.13+?
Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
Do you have i2c_core and ipmi_msghandler loaded on the nodes?
Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

$ kubectl describe clusterpolicies --all-namespaces
Name:         cluster-policy
Namespace:
Labels:       <none>
Annotations:  <none>
API Version:  nvidia.com/v1
Kind:         ClusterPolicy
Metadata:
  Creation Timestamp:  2020-09-20T12:31:58Z
  Generation:          1
  Resource Version:    511518
  Self Link:           /apis/nvidia.com/v1/clusterpolicies/cluster-policy
  UID:                 25d885e7-63b5-4f1c-b0c5-d0dd4f1cc52e
Spec:
  Dcgm Exporter:
    Image:       dcgm-exporter
    Repository:  nvidia
    Version:     1.7.2-2.0.0-rc.9-ubi8
  Device Plugin:
    Image:       k8s-device-plugin
    Repository:  nvidia
    Version:     1.0.0-beta6-ubi8
  Driver:
    Image:       driver
    Repository:  nvidia
    Version:     440.64.00
  Operator:
    Default Runtime:  crio
  Toolkit:
    Image:       container-toolkit
    Repository:  nvidia
    Version:     1.0.2-ubi8
Status:
  State:  ready
Events:   <none>
$

1. Issue or feature description

If you delete the nvidia-driver-daemonset pod after a successful NVIDIA GPU Operator 1.1.7-r2 installation on OpenShift 4.4.16 the new pod stucks in CrashLoopBackOff. The problem is

+ echo 'Could not unload NVIDIA driver kernel modules, driver is in use'
Could not unload NVIDIA driver kernel modules, driver is in use

nvidia-driver-daemonset-phpf7-nvidia-driver-ctr.log

Of course, the kernel module is still loaded. Rebooting the node solves the problem because the daemon set an install and load the kernel again. If you delete the pod again, the problem still exists.

Technically it is not really a problem, GPUs work well any time. But the user experience not very handy.

Actions

2. Steps to reproduce the issue

Delete one if the nvidia-driver-daemonset pod's.

3. Information to attach (optional if deemed irrelevant)

The text was updated successfully, but these errors were encountered:

pichuang · 2020-09-22T07:41:05Z

Same issue in here

creydr · 2020-09-25T08:33:12Z

We have the same issue on vanilla kubernetes

Are you running on an Ubuntu 18.04 node?
Are you running Kubernetes v1.13+?
Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
Do you have i2c_core and ipmi_msghandler loaded on the nodes?
only ipmi_msghandler cause of Is the i2c_core module required with a generic kernel version #24
Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

I our case this leads to unallocatable GPUs:

$ kubectl describe node xy
...
Capacity:
  ...
  nvidia.com/gpu:     0
Allocatable:
  ...
  nvidia.com/gpu:     0

pdmack · 2020-09-25T23:19:58Z

Is this an install on a standalone GPU host with the drivers are already installed? Also, @creydr can you find an existing issue or create an issue for your specific issue?

creydr · 2020-09-28T09:16:09Z

The drivers have not been installed beforehand. @pdmack we have the some issue (getting a Could not unload NVIDIA driver kernel modules, driver is in use). The only thing / difference (?!?) is that the capacity & allocateable resources are unset as well.

geoberle · 2020-10-02T18:34:22Z

Is this an install on a standalone GPU host with the drivers are already installed? Also, @creydr can you find an existing issue or create an issue for your specific issue?

We see the same problem on a fresh RHCOS 4.4 node where no modifications have been made to the operatingsystem whatsoever (no drivers installed prior to the operator installation)

shivamerla · 2020-12-17T18:42:48Z

Its a known issue currently, where driver container always try to unload and install, reload drivers with same version. We will look into this with future releases.

maximveksler · 2021-03-20T12:14:26Z

We are seeing the same issue, on gpu-operator 1.6.2

root@e2689e558c04:/code/infrastructure/ansible# kubectl get pods -A
NAMESPACE                NAME                                                          READY   STATUS             RESTARTS   AGE
kube-system              helm-install-traefik-xldb8                                    0/1     Completed          0          2d21h
prometheus               svclb-prometheus-grafana-dgcsr                                1/1     Running            3          2d21h
kube-system              metrics-server-86cbb8457f-tkgrm                               1/1     Running            3          2d21h
kube-system              coredns-854c77959c-7lh5v                                      1/1     Running            3          2d21h
kube-system              local-path-provisioner-5ff76fc89d-cpfd5                       1/1     Running            3          2d21h
prometheus               prometheus-kube-prometheus-operator-8cfb4bbd6-wl545           1/1     Running            3          2d21h
kube-system              svclb-traefik-6dv4f                                           2/2     Running            6          2d21h
prometheus               prometheus-prometheus-kube-prometheus-prometheus-0            2/2     Running            7          2d21h
prometheus               prometheus-grafana-8ff59c97-r82tg                             2/2     Running            6          2d21h
default                  influxdb-0                                                    1/1     Running            3          2d21h
kube-system              traefik-6f9cbd9bd4-6k6sx                                      1/1     Running            3          2d21h
prometheus               svclb-prometheus-grafana-hhp4d                                1/1     Running            7          2d21h
kube-system              svclb-traefik-9425f                                           2/2     Running            21         2d21h
gpu-operator             gpu-operator-node-feature-discovery-master-7994f664cc-qp5tj   1/1     Running            0          23s
gpu-operator             gpu-operator-node-feature-discovery-worker-zvdzh              1/1     Running            0          23s
gpu-operator             gpu-operator-node-feature-discovery-worker-vknf7              1/1     Running            0          23s
gpu-operator             gpu-operator-bdbb55b77-qgzjx                                  1/1     Running            0          23s
gpu-operator-resources   nvidia-driver-daemonset-mkf8v                                 0/1     CrashLoopBackOff   1          19s

root@e2689e558c04:/code/infrastructure/ansible#  kubectl describe pod nvidia-driver-daemonset-mkf8v -n gpu-operator-resources
Name:         nvidia-driver-daemonset-mkf8v
Namespace:    gpu-operator-resources
Priority:     0
Node:         decider/172.19.0.2
Start Time:   Sat, 20 Mar 2021 12:12:15 +0000
Labels:       app=nvidia-driver-daemonset
              controller-revision-hash=5cd4745b76
              pod-template-generation=1
Annotations:  scheduler.alpha.kubernetes.io/critical-pod:
Status:       Running
IP:           10.42.1.65
IPs:
  IP:           10.42.1.65
Controlled By:  DaemonSet/nvidia-driver-daemonset
Containers:
  nvidia-driver-ctr:
    Container ID:  docker://00034c05e6db44f2c9fb5abd4e0d14d70fbef42e9b4eb246c994c8bb4ce169a4
    Image:         nvcr.io/nvidia/driver:460.32.03-ubuntu18.04
    Image ID:      docker-pullable://nvcr.io/nvidia/driver@sha256:a2064f81deaaa907fdab3bc68af5c0ad321e573aee562736f8863097d71a71aa
    Port:          <none>
    Host Port:     <none>
    Command:
      nvidia-driver
    Args:
      init
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       ContainerCannotRun
      Message:      OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init
caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: driver error: f
ailed to process request: unknown
      Exit Code:    128
      Started:      Sat, 20 Mar 2021 12:14:55 +0000
      Finished:     Sat, 20 Mar 2021 12:14:55 +0000
    Ready:          False
    Restart Count:  5
    Environment:    <none>
    Mounts:
      /dev/log from dev-log (rw)
      /etc/containers/oci/hooks.d from config (rw)
      /run/nvidia from run-nvidia (rw)
      /var/log from var-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from nvidia-driver-token-rmh9t (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  run-nvidia:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia
    HostPathType:
  var-log:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log
    HostPathType:
  dev-log:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/log
    HostPathType:
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      nvidia-driver
    Optional:  false
  nvidia-driver-token-rmh9t:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  nvidia-driver-token-rmh9t
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  nvidia.com/gpu.present=true
Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists
                 node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                 node.kubernetes.io/unreachable:NoExecute op=Exists
                 node.kubernetes.io/unschedulable:NoSchedule op=Exists
                 nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  5m4s                  default-scheduler  Successfully assigned gpu-operator-resources/nvidia-driver-daemonset-mkf8v to de
cider
  Normal   Pulled     3m49s (x5 over 5m3s)  kubelet            Container image "nvcr.io/nvidia/driver:460.32.03-ubuntu18.04" already present on
 machine
  Normal   Created    3m49s (x5 over 5m3s)  kubelet            Created container nvidia-driver-ctr
  Warning  Failed     3m48s (x5 over 5m3s)  kubelet            Error: failed to start container "nvidia-driver-ctr": Error response from daemon
: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: Running ho
ok #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: driver error: failed to process r
equest: unknown
  Warning  BackOff    3m3s (x10 over 5m1s)  kubelet            Back-off restarting failed container

The os is 18.04.

No nvidia driver is installed on the OS level.

Distribution is k3s

root@e2689e558c04:/code/infrastructure/ansible# kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"6b1d87acf3c8253c123756b9e61dac642678305f", GitTreeState:"c
lean", BuildDate:"2021-03-18T01:10:43Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4+k3s1", GitCommit:"838a906ab5eba62ff529d6a3a746384eba810758", GitTreeSta
te:"clean", BuildDate:"2021-02-22T19:49:27Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}

Docker is

root@e2689e558c04:/code/infrastructure/ansible# docker --version
Docker version 20.10.5, build 55c4c88

lgc0313 · 2021-03-24T03:25:39Z

same error on ocp4.6.19 and gpu-operator 1.6.2

jackjliew · 2021-07-20T12:56:27Z

same error on ocp 4.6.29 and gpu-operator 1.71

shivamerla · 2021-07-20T14:26:59Z

@jackjliew v1.8.0 has a new initContainer added to driver pod which will handle cleanup of driver modules to avoid this. Let me know if you want to verify with early access bits.

mattpielm · 2021-09-28T00:46:42Z

So no solution until 1.8.x?
(same issue, ocp 4.7.24 nvidia operator 1.7.1, nodes were fine, restarted daemonset to update the proxy variable on the pods)

========== NVIDIA Software Installer ==========

+ echo -e '\n========== NVIDIA Software Installer ==========\n'
Starting installation of NVIDIA driver version 460.73.01 for Linux kernel version 4.18.0-305.10.2.el8_4.x86_64

+ echo -e 'Starting installation of NVIDIA driver version 460.73.01 for Linux kernel version 4.18.0-305.10.2.el8_4.x86_64\n'
+ exec
+ flock -n 3
+ echo 653899
+ trap 'echo '\''Caught signal'\''; exit 1' HUP INT QUIT PIPE TERM
+ trap _shutdown EXIT
+ _unload_driver
+ rmmod_args=()
+ local rmmod_args
+ local nvidia_deps=0
+ local nvidia_refs=0
+ local nvidia_uvm_refs=0
+ local nvidia_modeset_refs=0
Stopping NVIDIA persistence daemon...
+ echo 'Stopping NVIDIA persistence daemon...'
+ '[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
+ '[' -f /var/run/nvidia-gridd/nvidia-gridd.pid ']'
Unloading NVIDIA driver kernel modules...
+ echo 'Unloading NVIDIA driver kernel modules...'
+ '[' -f /sys/module/nvidia_modeset/refcnt ']'
+ nvidia_modeset_refs=0
+ rmmod_args+=("nvidia-modeset")
+ (( ++nvidia_deps ))
+ '[' -f /sys/module/nvidia_uvm/refcnt ']'
+ nvidia_uvm_refs=0
+ rmmod_args+=("nvidia-uvm")
+ (( ++nvidia_deps ))
+ '[' -f /sys/module/nvidia/refcnt ']'
+ nvidia_refs=155
+ rmmod_args+=("nvidia")
+ '[' 155 -gt 2 ']'
+ echo 'Could not unload NVIDIA driver kernel modules, driver is in use'
Could not unload NVIDIA driver kernel modules, driver is in use
+ return 1
+ exit 1
+ _shutdown
+ _unload_driver
+ rmmod_args=()
+ local rmmod_args
+ local nvidia_deps=0
+ local nvidia_refs=0
+ local nvidia_uvm_refs=0
Stopping NVIDIA persistence daemon...
+ local nvidia_modeset_refs=0
+ echo 'Stopping NVIDIA persistence daemon...'
+ '[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
+ '[' -f /var/run/nvidia-gridd/nvidia-gridd.pid ']'
+ echo 'Unloading NVIDIA driver kernel modules...'
Unloading NVIDIA driver kernel modules...
+ '[' -f /sys/module/nvidia_modeset/refcnt ']'
+ nvidia_modeset_refs=0
+ rmmod_args+=("nvidia-modeset")
+ (( ++nvidia_deps ))
+ '[' -f /sys/module/nvidia_uvm/refcnt ']'
+ nvidia_uvm_refs=0
+ rmmod_args+=("nvidia-uvm")
+ (( ++nvidia_deps ))
+ '[' -f /sys/module/nvidia/refcnt ']'
+ nvidia_refs=155
+ rmmod_args+=("nvidia")
+ '[' 155 -gt 2 ']'
+ echo 'Could not unload NVIDIA driver kernel modules, driver is in use'
Could not unload NVIDIA driver kernel modules, driver is in use
+ return 1
+ return 1

shivamerla · 2021-09-28T00:52:24Z

@mattpielm This has been fixed with v1.8.x. Please use v1.8.2 which was released last week to RH catalog. By the way, you don't need to edit proxy variables manually with driver Daemonset, GPU Operator will automatically inject them if they are added in Cluster-wide proxy.

pichuang mentioned this issue Sep 22, 2020

Errors with running nVidia GPU Operator 1.1.7-r2 on OpenShift 4.3.28 #75

Closed

11 tasks

shivamerla self-assigned this Dec 17, 2020

maximveksler mentioned this issue Mar 20, 2021

nvidia-device-plugin-daemonset pod is still crashingoff! #166

Closed

btong04 mentioned this issue Mar 25, 2021

Create a getting started on K8s page NVIDIA/spark-rapids#1932

Merged

shivamerla closed this as completed Mar 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvidia-driver-daemonset pod CrashLoopBackOff after restart #83

nvidia-driver-daemonset pod CrashLoopBackOff after restart #83

rbo commented Sep 20, 2020

pichuang commented Sep 22, 2020

creydr commented Sep 25, 2020 •

edited

Loading

pdmack commented Sep 25, 2020

creydr commented Sep 28, 2020

geoberle commented Oct 2, 2020

shivamerla commented Dec 17, 2020

maximveksler commented Mar 20, 2021 •

edited

Loading

lgc0313 commented Mar 24, 2021

jackjliew commented Jul 20, 2021

shivamerla commented Jul 20, 2021 •

edited

Loading

mattpielm commented Sep 28, 2021 •

edited

Loading

shivamerla commented Sep 28, 2021

nvidia-driver-daemonset pod CrashLoopBackOff after restart #83

nvidia-driver-daemonset pod CrashLoopBackOff after restart #83

Comments

rbo commented Sep 20, 2020

1. Quick Debug Checklist

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

pichuang commented Sep 22, 2020

creydr commented Sep 25, 2020 • edited Loading

pdmack commented Sep 25, 2020

creydr commented Sep 28, 2020

geoberle commented Oct 2, 2020

shivamerla commented Dec 17, 2020

maximveksler commented Mar 20, 2021 • edited Loading

lgc0313 commented Mar 24, 2021

jackjliew commented Jul 20, 2021

shivamerla commented Jul 20, 2021 • edited Loading

mattpielm commented Sep 28, 2021 • edited Loading

shivamerla commented Sep 28, 2021

creydr commented Sep 25, 2020 •

edited

Loading

maximveksler commented Mar 20, 2021 •

edited

Loading

shivamerla commented Jul 20, 2021 •

edited

Loading

mattpielm commented Sep 28, 2021 •

edited

Loading