Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-driver-daemonset pod CrashLoopBackOff after restart #83

Closed
1 of 16 tasks
rbo opened this issue Sep 20, 2020 · 12 comments
Closed
1 of 16 tasks

nvidia-driver-daemonset pod CrashLoopBackOff after restart #83

rbo opened this issue Sep 20, 2020 · 12 comments
Assignees

Comments

@rbo
Copy link

rbo commented Sep 20, 2020

1. Quick Debug Checklist

  • Are you running on an Ubuntu 18.04 node?
  • Are you running Kubernetes v1.13+?
  • Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
  • Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)
$ kubectl describe clusterpolicies --all-namespaces
Name:         cluster-policy
Namespace:
Labels:       <none>
Annotations:  <none>
API Version:  nvidia.com/v1
Kind:         ClusterPolicy
Metadata:
  Creation Timestamp:  2020-09-20T12:31:58Z
  Generation:          1
  Resource Version:    511518
  Self Link:           /apis/nvidia.com/v1/clusterpolicies/cluster-policy
  UID:                 25d885e7-63b5-4f1c-b0c5-d0dd4f1cc52e
Spec:
  Dcgm Exporter:
    Image:       dcgm-exporter
    Repository:  nvidia
    Version:     1.7.2-2.0.0-rc.9-ubi8
  Device Plugin:
    Image:       k8s-device-plugin
    Repository:  nvidia
    Version:     1.0.0-beta6-ubi8
  Driver:
    Image:       driver
    Repository:  nvidia
    Version:     440.64.00
  Operator:
    Default Runtime:  crio
  Toolkit:
    Image:       container-toolkit
    Repository:  nvidia
    Version:     1.0.2-ubi8
Status:
  State:  ready
Events:   <none>
$

1. Issue or feature description

If you delete the nvidia-driver-daemonset pod after a successful NVIDIA GPU Operator 1.1.7-r2 installation on OpenShift 4.4.16 the new pod stucks in CrashLoopBackOff. The problem is

+ echo 'Could not unload NVIDIA driver kernel modules, driver is in use'
Could not unload NVIDIA driver kernel modules, driver is in use

nvidia-driver-daemonset-phpf7-nvidia-driver-ctr.log

Of course, the kernel module is still loaded. Rebooting the node solves the problem because the daemon set an install and load the kernel again. If you delete the pod again, the problem still exists.

Technically it is not really a problem, GPUs work well any time. But the user experience not very handy.

Actions

2. Steps to reproduce the issue

Delete one if the nvidia-driver-daemonset pod's.

3. Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods --all-namespaces

  • kubernetes daemonset status: kubectl get ds --all-namespaces

  • If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME

  • If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME

  • Output of running a container on the GPU machine: docker run -it alpine echo foo

  • Docker configuration file: cat /etc/docker/daemon.json

  • Docker runtime configuration: docker info | grep runtime

  • NVIDIA shared directory: ls -la /run/nvidia

  • NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit

  • NVIDIA driver directory: ls -la /run/nvidia/driver

  • kubelet logs journalctl -u kubelet > kubelet.logs

@pichuang
Copy link

Same issue in here

@creydr
Copy link
Contributor

creydr commented Sep 25, 2020

We have the same issue on vanilla kubernetes

  • Are you running on an Ubuntu 18.04 node?
  • Are you running Kubernetes v1.13+?
  • Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
  • Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • only ipmi_msghandler cause of Is the i2c_core module required with a generic kernel version #24
  • Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

I our case this leads to unallocatable GPUs:

$ kubectl describe node xy
...
Capacity:
  ...
  nvidia.com/gpu:     0
Allocatable:
  ...
  nvidia.com/gpu:     0

@pdmack
Copy link

pdmack commented Sep 25, 2020

Is this an install on a standalone GPU host with the drivers are already installed? Also, @creydr can you find an existing issue or create an issue for your specific issue?

@creydr
Copy link
Contributor

creydr commented Sep 28, 2020

The drivers have not been installed beforehand. @pdmack we have the some issue (getting a Could not unload NVIDIA driver kernel modules, driver is in use). The only thing / difference (?!?) is that the capacity & allocateable resources are unset as well.

@geoberle
Copy link

geoberle commented Oct 2, 2020

Is this an install on a standalone GPU host with the drivers are already installed? Also, @creydr can you find an existing issue or create an issue for your specific issue?

We see the same problem on a fresh RHCOS 4.4 node where no modifications have been made to the operatingsystem whatsoever (no drivers installed prior to the operator installation)

@shivamerla
Copy link
Contributor

Its a known issue currently, where driver container always try to unload and install, reload drivers with same version. We will look into this with future releases.

@shivamerla shivamerla self-assigned this Dec 17, 2020
@maximveksler
Copy link

maximveksler commented Mar 20, 2021

We are seeing the same issue, on gpu-operator 1.6.2

root@e2689e558c04:/code/infrastructure/ansible# kubectl get pods -A
NAMESPACE                NAME                                                          READY   STATUS             RESTARTS   AGE
kube-system              helm-install-traefik-xldb8                                    0/1     Completed          0          2d21h
prometheus               svclb-prometheus-grafana-dgcsr                                1/1     Running            3          2d21h
kube-system              metrics-server-86cbb8457f-tkgrm                               1/1     Running            3          2d21h
kube-system              coredns-854c77959c-7lh5v                                      1/1     Running            3          2d21h
kube-system              local-path-provisioner-5ff76fc89d-cpfd5                       1/1     Running            3          2d21h
prometheus               prometheus-kube-prometheus-operator-8cfb4bbd6-wl545           1/1     Running            3          2d21h
kube-system              svclb-traefik-6dv4f                                           2/2     Running            6          2d21h
prometheus               prometheus-prometheus-kube-prometheus-prometheus-0            2/2     Running            7          2d21h
prometheus               prometheus-grafana-8ff59c97-r82tg                             2/2     Running            6          2d21h
default                  influxdb-0                                                    1/1     Running            3          2d21h
kube-system              traefik-6f9cbd9bd4-6k6sx                                      1/1     Running            3          2d21h
prometheus               svclb-prometheus-grafana-hhp4d                                1/1     Running            7          2d21h
kube-system              svclb-traefik-9425f                                           2/2     Running            21         2d21h
gpu-operator             gpu-operator-node-feature-discovery-master-7994f664cc-qp5tj   1/1     Running            0          23s
gpu-operator             gpu-operator-node-feature-discovery-worker-zvdzh              1/1     Running            0          23s
gpu-operator             gpu-operator-node-feature-discovery-worker-vknf7              1/1     Running            0          23s
gpu-operator             gpu-operator-bdbb55b77-qgzjx                                  1/1     Running            0          23s
gpu-operator-resources   nvidia-driver-daemonset-mkf8v                                 0/1     CrashLoopBackOff   1          19s
root@e2689e558c04:/code/infrastructure/ansible#  kubectl describe pod nvidia-driver-daemonset-mkf8v -n gpu-operator-resources
Name:         nvidia-driver-daemonset-mkf8v
Namespace:    gpu-operator-resources
Priority:     0
Node:         decider/172.19.0.2
Start Time:   Sat, 20 Mar 2021 12:12:15 +0000
Labels:       app=nvidia-driver-daemonset
              controller-revision-hash=5cd4745b76
              pod-template-generation=1
Annotations:  scheduler.alpha.kubernetes.io/critical-pod:
Status:       Running
IP:           10.42.1.65
IPs:
  IP:           10.42.1.65
Controlled By:  DaemonSet/nvidia-driver-daemonset
Containers:
  nvidia-driver-ctr:
    Container ID:  docker://00034c05e6db44f2c9fb5abd4e0d14d70fbef42e9b4eb246c994c8bb4ce169a4
    Image:         nvcr.io/nvidia/driver:460.32.03-ubuntu18.04
    Image ID:      docker-pullable://nvcr.io/nvidia/driver@sha256:a2064f81deaaa907fdab3bc68af5c0ad321e573aee562736f8863097d71a71aa
    Port:          <none>
    Host Port:     <none>
    Command:
      nvidia-driver
    Args:
      init
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       ContainerCannotRun
      Message:      OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init
caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: driver error: f
ailed to process request: unknown
      Exit Code:    128
      Started:      Sat, 20 Mar 2021 12:14:55 +0000
      Finished:     Sat, 20 Mar 2021 12:14:55 +0000
    Ready:          False
    Restart Count:  5
    Environment:    <none>
    Mounts:
      /dev/log from dev-log (rw)
      /etc/containers/oci/hooks.d from config (rw)
      /run/nvidia from run-nvidia (rw)
      /var/log from var-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from nvidia-driver-token-rmh9t (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  run-nvidia:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia
    HostPathType:
  var-log:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log
    HostPathType:
  dev-log:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/log
    HostPathType:
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      nvidia-driver
    Optional:  false
  nvidia-driver-token-rmh9t:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  nvidia-driver-token-rmh9t
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  nvidia.com/gpu.present=true
Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists
                 node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                 node.kubernetes.io/unreachable:NoExecute op=Exists
                 node.kubernetes.io/unschedulable:NoSchedule op=Exists
                 nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  5m4s                  default-scheduler  Successfully assigned gpu-operator-resources/nvidia-driver-daemonset-mkf8v to de
cider
  Normal   Pulled     3m49s (x5 over 5m3s)  kubelet            Container image "nvcr.io/nvidia/driver:460.32.03-ubuntu18.04" already present on
 machine
  Normal   Created    3m49s (x5 over 5m3s)  kubelet            Created container nvidia-driver-ctr
  Warning  Failed     3m48s (x5 over 5m3s)  kubelet            Error: failed to start container "nvidia-driver-ctr": Error response from daemon
: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: Running ho
ok #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: driver error: failed to process r
equest: unknown
  Warning  BackOff    3m3s (x10 over 5m1s)  kubelet            Back-off restarting failed container

The os is 18.04.

No nvidia driver is installed on the OS level.

Distribution is k3s

root@e2689e558c04:/code/infrastructure/ansible# kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"6b1d87acf3c8253c123756b9e61dac642678305f", GitTreeState:"c
lean", BuildDate:"2021-03-18T01:10:43Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4+k3s1", GitCommit:"838a906ab5eba62ff529d6a3a746384eba810758", GitTreeSta
te:"clean", BuildDate:"2021-02-22T19:49:27Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}

Docker is

root@e2689e558c04:/code/infrastructure/ansible# docker --version
Docker version 20.10.5, build 55c4c88

@lgc0313
Copy link

lgc0313 commented Mar 24, 2021

same error on ocp4.6.19 and gpu-operator 1.6.2

@jackjliew
Copy link

same error on ocp 4.6.29 and gpu-operator 1.71

@shivamerla
Copy link
Contributor

shivamerla commented Jul 20, 2021

@jackjliew v1.8.0 has a new initContainer added to driver pod which will handle cleanup of driver modules to avoid this. Let me know if you want to verify with early access bits.

@mattpielm
Copy link

mattpielm commented Sep 28, 2021

So no solution until 1.8.x?
(same issue, ocp 4.7.24 nvidia operator 1.7.1, nodes were fine, restarted daemonset to update the proxy variable on the pods)

========== NVIDIA Software Installer ==========

+ echo -e '\n========== NVIDIA Software Installer ==========\n'
Starting installation of NVIDIA driver version 460.73.01 for Linux kernel version 4.18.0-305.10.2.el8_4.x86_64

+ echo -e 'Starting installation of NVIDIA driver version 460.73.01 for Linux kernel version 4.18.0-305.10.2.el8_4.x86_64\n'
+ exec
+ flock -n 3
+ echo 653899
+ trap 'echo '\''Caught signal'\''; exit 1' HUP INT QUIT PIPE TERM
+ trap _shutdown EXIT
+ _unload_driver
+ rmmod_args=()
+ local rmmod_args
+ local nvidia_deps=0
+ local nvidia_refs=0
+ local nvidia_uvm_refs=0
+ local nvidia_modeset_refs=0
Stopping NVIDIA persistence daemon...
+ echo 'Stopping NVIDIA persistence daemon...'
+ '[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
+ '[' -f /var/run/nvidia-gridd/nvidia-gridd.pid ']'
Unloading NVIDIA driver kernel modules...
+ echo 'Unloading NVIDIA driver kernel modules...'
+ '[' -f /sys/module/nvidia_modeset/refcnt ']'
+ nvidia_modeset_refs=0
+ rmmod_args+=("nvidia-modeset")
+ (( ++nvidia_deps ))
+ '[' -f /sys/module/nvidia_uvm/refcnt ']'
+ nvidia_uvm_refs=0
+ rmmod_args+=("nvidia-uvm")
+ (( ++nvidia_deps ))
+ '[' -f /sys/module/nvidia/refcnt ']'
+ nvidia_refs=155
+ rmmod_args+=("nvidia")
+ '[' 155 -gt 2 ']'
+ echo 'Could not unload NVIDIA driver kernel modules, driver is in use'
Could not unload NVIDIA driver kernel modules, driver is in use
+ return 1
+ exit 1
+ _shutdown
+ _unload_driver
+ rmmod_args=()
+ local rmmod_args
+ local nvidia_deps=0
+ local nvidia_refs=0
+ local nvidia_uvm_refs=0
Stopping NVIDIA persistence daemon...
+ local nvidia_modeset_refs=0
+ echo 'Stopping NVIDIA persistence daemon...'
+ '[' -f /var/run/nvidia-persistenced/nvidia-persistenced.pid ']'
+ '[' -f /var/run/nvidia-gridd/nvidia-gridd.pid ']'
+ echo 'Unloading NVIDIA driver kernel modules...'
Unloading NVIDIA driver kernel modules...
+ '[' -f /sys/module/nvidia_modeset/refcnt ']'
+ nvidia_modeset_refs=0
+ rmmod_args+=("nvidia-modeset")
+ (( ++nvidia_deps ))
+ '[' -f /sys/module/nvidia_uvm/refcnt ']'
+ nvidia_uvm_refs=0
+ rmmod_args+=("nvidia-uvm")
+ (( ++nvidia_deps ))
+ '[' -f /sys/module/nvidia/refcnt ']'
+ nvidia_refs=155
+ rmmod_args+=("nvidia")
+ '[' 155 -gt 2 ']'
+ echo 'Could not unload NVIDIA driver kernel modules, driver is in use'
Could not unload NVIDIA driver kernel modules, driver is in use
+ return 1
+ return 1

@shivamerla
Copy link
Contributor

@mattpielm This has been fixed with v1.8.x. Please use v1.8.2 which was released last week to RH catalog. By the way, you don't need to edit proxy variables manually with driver Daemonset, GPU Operator will automatically inject them if they are added in Cluster-wide proxy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants