Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble attaching volume #884

Closed
yvan opened this issue Mar 28, 2019 · 36 comments
Closed

Trouble attaching volume #884

yvan opened this issue Mar 28, 2019 · 36 comments

Comments

@yvan
Copy link

yvan commented Mar 28, 2019

Having an issue where I'm getting multi attach errors when I try to attach a pvc. This issue was already brought up again 6 days ago #615. I'm just reopening it here as per the instruction on that thread.

What happened:

Pods cannot attach a pvc because it's bound somewhere else (though it should not be).

What I expect to happen:

Pods should be able to bind a pvc.

How to reproduce:

Not sure as I don't know why pvc's that are not in use would be attached or seen as attached by k8s.

k8s version:

Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.0", GitCommit:"ddf47ac13c1a9483ea035a79cd7c10005ff21a6d", GitTreeState:"clean", BuildDate:"2018-12-03T21:04:45Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.5", GitCommit:"51dd616cdd25d6ee22c83a858773b607328a18ec", GitTreeState:"clean", BuildDate:"2019-01-16T18:14:49Z", GoVersion:"go1.10.7", Compiler:"gc", Platform:"linux/amd64"}

azure region:

west europe

kubectl describe pod hub-7476649468-qfj75 -n res-jhub:

Name:               hub-7476649468-qfj75
Namespace:          res-jhub
Priority:           0
PriorityClassName:  <none>
Node:               aks-agentpool-57634498-0/10.237.233.4
Start Time:         Thu, 28 Mar 2019 10:12:33 +0000
Labels:             app=jupyterhub
                    component=hub
                    hub.jupyter.org/network-access-proxy-api=true
                    hub.jupyter.org/network-access-proxy-http=true
                    hub.jupyter.org/network-access-singleuser=true
                    pod-template-hash=7476649468
                    release=res-jhub
Annotations:        checksum/config-map: c9a28304bf7ebba72288eca12557c9ef656850d388cb6ddc9131ba46476eec32
                    checksum/secret: XXX
Status:             Pending
IP:
Controlled By:      ReplicaSet/hub-7476649468
Containers:
  hub:
    Container ID:
    Image:         jupyterhub/k8s-hub:0.8.0
    Image ID:
    Port:          8081/TCP
    Host Port:     0/TCP
    Command:
      jupyterhub
      --config
      /srv/jupyterhub_config.py
      --upgrade-db
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:     200m
      memory:  512Mi
    Environment:
      PYTHONUNBUFFERED:              1
      HELM_RELEASE_NAME:             res-jhub
      POD_NAMESPACE:                 res-jhub (v1:metadata.namespace)
      CONFIGPROXY_AUTH_TOKEN:        <set to the key 'proxy.token' in secret 'hub-secret'>                            Optional: false
      AAD_TENANT_ID:                 <set to the key 'XXX' in secret 'tenant-id-secret'>                Optional: false
      AAD_CLIENT_ID:                 <set to the key 'XXX' in secret 'client-id-secret'>                Optional: false
      AAD_CLIENT_SECRET:             <set to the key 'XX' in secret 'client-id-secret-secret'>  Optional: false
      KUBERNETES_PORT_443_TCP_ADDR:  ds-cluster-1c087a4c.hcp.westeurope.azmk8s.io
      KUBERNETES_PORT:               tcp://ds-cluster-1c087a4c.hcp.westeurope.azmk8s.io:443
      KUBERNETES_PORT_443_TCP:       tcp://ds-cluster-1c087a4c.hcp.westeurope.azmk8s.io:443
      KUBERNETES_SERVICE_HOST:       ds-cluster-1c087a4c.hcp.westeurope.azmk8s.io
    Mounts:
      /etc/jupyterhub/config/ from config (rw)
      /etc/jupyterhub/secret/ from secret (rw)
      /srv/jupyterhub from hub-db-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from hub-token-dqmsr (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      hub-config
    Optional:  false
  secret:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  hub-secret
    Optional:    false
  hub-db-dir:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  hub-db-dir
    ReadOnly:   false
  hub-token-dqmsr:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  hub-token-dqmsr
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason              Age                 From                               Message
  ----     ------              ----                ----                               -------
  Normal   Scheduled           37m                 default-scheduler                  Successfully assigned res-jhub/hub-7476649468-qfj75 to aks-agentpool-57634498-0
  Warning  FailedAttachVolume  37m                 attachdetach-controller            Multi-Attach error for volume "pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9" Volume is already exclusively attached to one node and can't be attached to another
  Warning  FailedMount         88s (x16 over 35m)  kubelet, aks-agentpool-57634498-0  Unable to mount volumes for pod "hub-7476649468-qfj75_res-jhub(fd95bd73-5141-11e9-bb1e-166ce5dfcaae)": timeout expired waiting for volumes to attach or mount for pod "res-jhub"/"hub-7476649468-qfj75". list of unmounted volumes=[hub-db-dir]. list of unattached volumes=[config secret hub-db-dir hub-token-dqmsr]
  Warning  FailedAttachVolume  12s (x26 over 37m)  attachdetach-controller            AttachVolume.Attach failed for volume "pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9" : Attach volume "kubernetes-dynamic-pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9" to instance "/subscriptions/XXXSUBIDXXX/resourceGroups/XXX/providers/Microsoft.Compute/virtualMachines/aks-agentpool-57634498-0" failed with compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="AttachDiskWhileBeingDetached" Message="Cannot attach data disk 'kubernetes-dynamic-pvc-d1bf467f-43dd-11e9-aff5-9a447838f109' to VM 'aks-agentpool-57634498-0' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again." Target="dataDisks"

how many disks mouting into one VM in parallel:

kubectl get pvc -n res-jhub

NAME                            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
claim-res1         Bound    pvc-9d33792c-41a9-11e9-978a-be1137772178   25Gi       RWO           default        19d
claim-res2         Bound    pvc-4452b002-44a9-11e9-aff5-9a447838f109   25Gi       RWO            default        16d
claim-res3  Bound    pvc-7d700c3e-3b39-11e9-93d5-dee1946e6ce9   25Gi       RWO            default        28d
claim-res4           Bound    pvc-7b7976d7-3a46-11e9-93d5-dee1946e6ce9   25Gi       RWO            default        29d
hub-db-dir                      Bound    pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9   1Gi        RWO            default        29d
jupyterhub-shared-res-volume    Bound    pvc-cf599c83-44ee-11e9-aff5-9a447838f109   50Gi       RWX            azurefile      15d

The hub pod (whose describe is posted above) mounts the 1Gi hub-db-dir claim.
Every user pod that tries to spawn mounts both one of claim-res(1-4) and also mounts jupyterhub-shares-res-volume which is an azurefile.

what vms:

4 nodes/vms that have spec: Standard D16s v3 (16 vcpus, 64 GB memory)

No disk, cpu, or memory pressure that is in the node descriptions.

Other Similar Issues:

#477
#615

Not sure If related but my image puller also seems to be failing because it is looking for a file in the kubelet folder that it expects but cannot find.

Error: open /var/lib/kubelet/pods/5da90308-5141-11e9-bb1e-166ce5dfcaae/etc-hosts: no such file or directory
@yvan
Copy link
Author

yvan commented Mar 28, 2019

@andyzhangx any thoughts? i'm a bit uncomfortable to just force the unattach my PVCs.

@yvan
Copy link
Author

yvan commented Mar 28, 2019

One of my pvc's that failed to attach described:

kubectl describe pvc claim-resX-n res-jhub

Name:          claim-resX
Namespace:     res-jhub
StorageClass:  default
Status:        Bound
Volume:        pvc-7b7976d7-3a46-11e9-93d5-dee1946e6ce9
Labels:        app=jupyterhub
               chart=jupyterhub-0.8.0
               component=singleuser-storage
               heritage=jupyterhub
               release=res-jhub
Annotations:   hub.jupyter.org/username: XXX
               pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-class: default
               volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/azure-disk
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      25Gi
Access Modes:  RWO
Events:        <none>
Mounted By:    <none>

@andyzhangx
Copy link
Contributor

@yvan could you check the status of VM aks-agentpool-57634498-0 and disk kubernetes-dynamic-pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9?
You may also try to use az vm update -g <group> -n <name> to update that vm status to workaround.

@yvan
Copy link
Author

yvan commented Mar 28, 2019

kubectl get no

NAME                       STATUS   ROLES   AGE   VERSION
aks-agentpool-57634498-0   Ready    agent   34d   v1.12.5

there is no such pvc as:

kubernetes-dynamic-pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9

This seems to refer to:

hub-db-dir                      Bound    pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9   1Gi        RWO            default        29d

All my PVCs always have status bound, even when they are not in use by a user or an app. It never caused an issue like this before. Just started experiencing this this morning.

@andyzhangx
Copy link
Contributor

@yvan, I mean could you goto azure portal to check the status of VM aks-agentpool-57634498-0 and disk kubernetes-dynamic-pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9?

@yvan
Copy link
Author

yvan commented Mar 28, 2019

There's a problem with the node aks-agentpool-57634498-0, it has status 'Running':

Unknown

I actually see no such data disk (kubernetes-dynamic-pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9) in the portal. Here are all my PVCs with form name 'kuberenetes-dynamic-pvc':

aks-agentpool-57634498-0

kubernetes-dynamic-pvc-5f0150b2-3633-11e9-93d5-dee1946e6ce9
kubernetes-dynamic-pvc-3ad459ae-3629-11e9-93d5-dee1946e6ce9
kubernetes-dynamic-pvc-d1bf467f-43dd-11e9-aff5-9a447838f109

aks-agentpool-57634498-1

kubernetes-dynamic-pvc-7d700c3e-3b39-11e9-93d5-dee1946e6ce9

aks-agentpool-57634498-2

kubernetes-dynamic-pvc-4452b002-44a9-11e9-aff5-9a447838f109

aks-agentpool-57634498-3

kubernetes-dynamic-pvc-8a085d64-367c-11e9-93d5-dee1946e6ce9
kubernetes-dynamic-pvc-75a4463a-4fb4-11e9-bb1e-166ce5dfcaae

There is one with a VERY similar name on aks-agentpool-57634498-1 but it differs by 0d7740b9-3a43. This node has status of 'Running.'

@andyzhangx
Copy link
Contributor

could you run az vm update -g <group> -n aks-agentpool-57634498-0 to update your vm? If that does not work, you may file an azure ticket to fix that.

@yvan
Copy link
Author

yvan commented Mar 28, 2019

I gave it a go, the result:

az vm update -g MC_risc-ml_ds-cluster_westeurope -n aks-agentpool-57634498-0

Cannot attach data disk 'kubernetes-dynamic-pvc-d1bf467f-43dd-11e9-aff5-9a447838f109' to VM 'aks-agentpool-57634498-0' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again.

@andyzhangx
Copy link
Contributor

could you help find that pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9 by:

  • find the pv name
    kubectl get pvc pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9
  • kubectl get pv PV-NAME -o yaml

You would get the full resource path of that disk, check whether that disk exists or not

@yvan
Copy link
Author

yvan commented Mar 28, 2019

Ok so it exists if I show all namespaces:

kubectl get pvc --all-namespaces

NAMESPACE   NAME                                  STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
...
res-jhub    hub-db-dir                            Bound    pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9   1Gi        RWO            default        29d
...

But if I check the namespace where it should be I get:

kubectl get pvc pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9 -n res-jhub

Error from server (NotFound): persistentvolumeclaims "pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9" not found

@ellieayla
Copy link

@yvan Your PV has "pvc" in the name, creating some confusion. Contrastkubectl get pvc hub-db-dir -n res-jhub and kubectl get pv pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9

@yvan
Copy link
Author

yvan commented Mar 29, 2019

The names are just generated by jupyterhub. I agree it's mildly annoying. At the end of the day I want to understand why this happened and care a lot less about the names.

kubectl get pvc hub-db-dir -n res-jhub

NAME         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
hub-db-dir   Bound    pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9   1Gi        RWO            default        30d
kubectl get pv pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9 -n res-jhub

NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                 STORAGECLASS   REASON   AGE
pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9   1Gi        RWO            Delete           Bound    res-jhub/hub-db-dir   default                 30d

@andyzhangx
Copy link
Contributor

@yvan could you run kubectl get pv pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9 -o yaml -n res-jhub and get the azure resource path of that disk, and then check whether that azure disk exists or not?

@yvan
Copy link
Author

yvan commented Mar 29, 2019

Ok I think this is what you wanted to locate:

kubectl get pv pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9 -o yaml -n res-jhub

...
spec:
  accessModes:
  - ReadWriteOnce
  azureDisk:
    cachingMode: ReadOnly
    diskName: kubernetes-dynamic-pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9
    diskURI: /subscriptions/XXX/resourceGroups/XXX/providers/Microsoft.Compute/disks/kubernetes-dynamic-pvc-0
...

I checked to see if the the diskURI exists and found 2 disks with similar names:

kubernetes-dynamic-pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9 (disk state attached)
kubernetes-dynamic-pvc-7d700c3e-3b39-11e9-93d5-dee1946e6ce9 (disk state attached)

@andyzhangx
Copy link
Contributor

andyzhangx commented Mar 29, 2019

could you check which node is disk kubernetes-dynamic-pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9 is attached to in azure portal? your original error is:

AttachVolume.Attach failed for volume "pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9" : Attach volume "kubernetes-dynamic-pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9" to instance "/subscriptions/XXXSUBIDXXX/resourceGroups/XXX/providers/Microsoft.Compute/virtualMachines/aks-agentpool-57634498-0" failed with compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="AttachDiskWhileBeingDetached" Message="Cannot attach data disk 'kubernetes-dynamic-pvc-d1bf467f-43dd-11e9-aff5-9a447838f109' to VM 'aks-agentpool-57634498-0' because the disk is currently being detached or the last detach operation failed. Please wait until the disk is completely detached and then try again or delete/detach the disk explicitly again." Target="dataDisks"

@yvan
Copy link
Author

yvan commented Mar 29, 2019

1 - the disk is attached to node: aks-agentpool-57634498-0

Also mysteriously a bunch of disk pressure messages (that have definitely not been popping up over 14-22d) have appeared in my event log for my 4th node:

kubectl describe no aks-agentpool-57634498-3

Events:
  Type     Reason                 Age                From                               Message
  ----     ------                 ----               ----                               -------
  Warning  EvictionThresholdMet   16m (x4 over 14d)  kubelet, aks-agentpool-57634498-3  Attempting to reclaim ephemeral-storage
  Normal   NodeHasDiskPressure    16m (x4 over 14d)  kubelet, aks-agentpool-57634498-3  Node aks-agentpool-57634498-3 status is now: NodeHasDiskPressure
  Warning  FreeDiskSpaceFailed    13m                kubelet, aks-agentpool-57634498-3  failed to garbage collect required amount of images. Wanted to free 2705995366 bytes, but freed 0 bytes
  Normal   NodeHasNoDiskPressure  11m (x7 over 22d)  kubelet, aks-agentpool-57634498-3  Node aks-agentpool-57634498-3 status is now: NodeHasNoDiskPressure

seems related to pulling images. no events for aks-agentpool-57634498-0. Maybe related to kubernetes/kubernetes#32542.

@yvan
Copy link
Author

yvan commented Apr 25, 2019

small update: in the end i just waited a day and the cluster eventually helped clear up the resources. it seems connected to a broader service outage/issue mounting disks on AKS.

@Gangareddy
Copy link

A similar issue arose in my cluster, the pod could not start as the PV corresponding to its PVC is bound to another node and it could not detach the disk(PV). This happened after changing the service principal on all my nodes. This seems very no-deterministic in nature, yet this happens occasionally in aks-engine generated clusters.

@andyzhangx
Copy link
Contributor

@Gangareddy you may also change the service principal on the master node, otherwise detach disk operation could not succeed. In that condition, you could manually detach that disk from the agent node, and k8s will automatically attach disk PV to the new node.

@Gangareddy
Copy link

@andyzhangx: I have updated the service principal on master nodes as well. Thought manually detaching disk would complicate self-healing nature of AKS. However, I was able to create additional disks from the snapshots of the disks that were stuck to the VM. Made changes to my PVs (persistent volumes) to use new creatly disks from snapshots. But I wonder, Is a VM reboot necessary after changing service principal on the VM.?

@andyzhangx
Copy link
Contributor

@Gangareddy pls follow this guide to reset service principal: https://docs.microsoft.com/en-us/azure/aks/update-credentials, on agent node, you only need to restart kubelet:

sudo systemctl daemon-reload
sudo systemctl restart kubelet

@mcurry-brierley
Copy link

mcurry-brierley commented Oct 23, 2019

Unable to Attach PVCs to a basic K8s deploy in Azure, when is K8s going to be production ready? This is just sad.
I am on the newest stable build in azure, and I have tried standard/premium storage and a more powerful node, as the PVCs just time out and ruin the deploy.

This cluster is brand new created today.

@andyzhangx
Copy link
Contributor

@mcurry-brierley could you provide more details about this issue? .e.g

  • pod description with error details
  • k8s verison
  • vmss or vmas (recently we hit a disk attach/detach issue in vmss, team are hotfixing this issue)

@mcurry-brierley
Copy link

mcurry-brierley commented Oct 25, 2019 via email

@andyzhangx
Copy link
Contributor

@mcurry-brierley I would say you may happen to hit this vmss disk attach issue which only happens in these two weeks, vmss team are hotfixing this disk attach issue. Just set up a VM Availability Set VM based cluster, I am pretty sure it won't have such issue. Just ping me if you hit such issue in slack.

@vijaygos
Copy link

vijaygos commented Oct 28, 2019

@andyzhangx , we seem to be hit by this on our production environments as well. I'll ping you on teams (internal) to see what we can do with this problem. In general, I agree with @mcurry-brierley on how annoying this is. We have had way too many issues with VMSS in the last 6 months and I am really tempted to track their team down and ask them for their SLA and where they have been wrt to the SLA in the last 6 months.

@davestephens
Copy link

@vijaygos @andyzhangx Has there been any progress on this? VMSS are totally unusable with AKS, which obviously means multiple nodepools are out of the window.

Following a lengthy support call with Andy Wu in your support team, he advised I give up on my previous cluster and create a new one. As a test I've deleted a pod to reschedule it elsewhere, and already seeing the Volume is already exclusively attached to one node and can't be attached to another issue - and the cluster is literally an hour old with nothing else running on it.

Surely loads of people must be seeing this, is it ever going to be fixed?

@andyzhangx
Copy link
Contributor

@davestephens Your issue Volume is already exclusively attached to one node and can't be attached to another looks like a different issue, the error info is from k8s volume controller which means the volume is not unmounted from the previous node. Could you open a new issue and provide more details, e.g. aks version, vmas or vmss, full describe pod info, etc.
VMSS issue has been resolved last weekend.

@mcurry-brierley
Copy link

@vijaygos and I have a point. This is a production product?
Asking me to completely remove and redeploy a kubernetes cluster just because your cloud isn't properly configured is not acceptable. Also attaching a PVC is a BASE FUNCTION. It not working for ANY amount of time is UNACCEPTABLE. Dont offer services if they are not enterprise grade.

@langecode
Copy link

We are actually experiencing something similar here. We had a failing instance in a VMSS based cluster. After deleting the instance it seems the Kubernetes control pane still sees the disks as attached. Looking in the Azure portal (or using the Azure CLI) the disks are unattached however starting up the POD we get the following status:

Events:
  Type     Reason              Age                From                                      Message
  ----     ------              ----               ----                                      -------
  Normal   Scheduled           6m3s               default-scheduler                         Successfully assigned monitoring/prometheus-prometheus-operator-prometheus-0 to aks-default-79370661-vmss000006
  Warning  FailedAttachVolume  6m3s               attachdetach-controller                   Multi-Attach error for volume "pvc-135c092c-fed0-11e9-b544-fa7d139c501a" Volume is already exclusively attached to one node and can't be attached to
another
  Warning  FailedMount         106s (x2 over 4m)  kubelet, aks-default-79370661-vmss000006  Unable to mount volumes for pod "prometheus-prometheus-operator-prometheus-0_monitoring(35a44efd-054a-11ea-bebb-c2acabd91d50)": timeout expired waiting for volumes to attach or mount for pod "monitoring"/"prometheus-prometheus-operator-prometheus-0". list of unmounted volumes=[prometheus-prometheus-operator-prometheus-db]. list of unattached volumes=[prometheus-prometheus-operator-prometheus-db config config-out prometheus-prometheus-operator-prometheus-rulefiles-0 prometheus-operator-prometheus-token-qzk8s]

Its like Kubernetes has cached the information that this disk is attached to the node which is no longer part of the cluster. We are running the latest non-preview version on AKS 1.14.8.

@mustermania
Copy link

hi all, I'm not sure if this is the same issue or not... but, I performed a k8s version upgrade on our non-prod node today. During that, one of the nodes died and caused problems. After restarting that node, the service that runs on that node wouldn't redeploy and is stuck in a perpetual Multi-Attach error for volume "pvc-77113250-15a7-11e9-ad5d-0a58ac1f0bbd" Volume is already used by pod(s) jenkins-695975cbb7-pmvvv . However, when I describe that pv, it says it is mounted to the correct, new pod. Additionally, the pvc storage disk is associated with correct node.

@mcurry-brierley

This comment has been minimized.

@jabba2324
Copy link

jabba2324 commented Dec 3, 2019

I'm also experiencing this problem intermittently when making deployments:

The following error is given:

Multi-Attach error for volume "pvc-e9b72e86-129a-11ea-9a02-9abdbf393c78" Volume is already used by pod(s) prometheus-server-5c8b68f5cd-qrskq

Unable to mount volumes for pod "prometheus-server-7b887899b7-l95n2_monitoring(6784bec0-15b3-11ea-9a02-9abdbf393c78)": timeout expired waiting for volumes to attach or mount for pod "monitoring"/"prometheus-server-7b887899b7-l95n2". 

list of unmounted volumes=[storage-volume]. list of unattached volumes=[config-volume storage-volume prometheus-server-token-vqcc9] 

We aern't in production yet but quite hesitant unless this issue is resolved.
Thanks

@andyzhangx
Copy link
Contributor

I'm also experiencing this problem intermittently when making deployments:

The following error is given:

Multi-Attach error for volume "pvc-e9b72e86-129a-11ea-9a02-9abdbf393c78" Volume is already used by pod(s) prometheus-server-5c8b68f5cd-qrskq

Unable to mount volumes for pod "prometheus-server-7b887899b7-l95n2_monitoring(6784bec0-15b3-11ea-9a02-9abdbf393c78)": timeout expired waiting for volumes to attach or mount for pod "monitoring"/"prometheus-server-7b887899b7-l95n2". 

list of unmounted volumes=[storage-volume]. list of unattached volumes=[config-volume storage-volume prometheus-server-token-vqcc9] 

We aern't in production yet but quite hesitant unless this issue is resolved.
Thanks

the error info is from k8s volume controller which means the volume is not unmounted from the previous node. Did the volume attach succeeded finally?
BTW, the original vmss disk issue is already fixed.

@andyzhangx
Copy link
Contributor

andyzhangx commented Jan 6, 2020

back to this question again, there are two kinds of Multi-Attach error issues:

  strategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
  • Multi-Attach error for volume "pvc-0d7740b9-3a43-11e9-93d5-dee1946e6ce9" Volume is already exclusively attached to one node and can't be attached to another (fixed in Nov.2019)
    This is a transient issue, and we already fixed this issue on AKS in Nov.2019, and also we have added disk attach/detach self-healing feature

I will close issue. Let me know if you have any question.

@ItayZviCohen
Copy link

Happened to me as well when deploying several helm charts. Created a cluster without vmss and the problem was solved.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests