Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VolumePopulator scheduling issue - CSI democrati-csi local-hostpath using the volsync volumepopulator. #1255

Closed
tesshuflower opened this issue May 9, 2024 · 4 comments

Comments

@tesshuflower
Copy link
Contributor

Taken from conversation in #1019.

If I'm understanding this correctly, this is about a storage driver that creates volume snapshots that are not portable across nodes.

When the volumepopulator is used to create a pvc from snapshot and volumebinding mode is WaitForFirstConsumer, the PVC may get assigned to a node that is not compatible with the volumesnapshot, and restoring the snapshot fails.

          > > 👋 this issue also happens with CSI democrati-csi local-hostpath using the volsync volumepopulator.

democratic-csi/democratic-csi#329
seems to be a time based racecondition.

@danielsand I don't think this issue was specifically about the volumepopulator - would you be able to explain the scenario where you're hitting the issue?

The linked issue wasnt about the volumepopulator,
democrati csi local-hostpath + volume snapshots + volsync didnt worked for some folks.

Just a reference it on what was is currently running on my end and what is working. (CSI and volume snapshots work as they should)

Volumepopulator is failing at random currently on my setup.
The wrong node gets picked by the volume populator and WaitForFirstConsumer is specified.

Will circle back when I push the topic again.

Originally posted by @danielsand in #1019 (comment)

@tesshuflower
Copy link
Contributor Author

@danielsand please update if I've misunderstood, but I tried to put a summary above.

I'm guessing your issue is tied to the volumepopulator and that you have a volumesnapshot that is not portable across nodes.

What happens with the volumepopulator when you have a storageclass that uses WaitForFirstConsumer is that the volumepopulator will not do anything until the PVC gets assigned to a node. The volumepopulator itself doesn't get involved with node assignment.

Essentially when a consumer wants to use the PVC ,it will get schduled to a node, and at that point the volume populator would try to provision a temp pvc with the snapshot contents on the same node that has been assigned to your original volumepopulator pvc. I think this can be an issue if the scheduler has chosen a node that is not compatible with your volume snapshot.

You may be able to work around this by using a storageclass with Immediate for your volumepopulator PVC, at least as a test. This way the volumepopulator would immediately attempt to provision a PVC from the volumesnapshot - at that point the storage driver may set the node assignment since the volumesnapshot requires a specific node.

@danielsand
Copy link

danielsand commented May 21, 2024

@tesshuflower kudos for pushing this and please assign the ticket to me.
Working on the topic this week again and try your proposal.
Will try to provide more solid input after some trial && error.

@danielsand
Copy link

danielsand commented May 24, 2024

@tesshuflower spend 2 days on the issue.

the state i found february (?) where it worked sometimes and then not is not reproducible anymore.
In fact currently the snapshots get not created since the volumes get not created and are in a pending state without much of logs or errors on all involved CSI components.

(each of them is pointing to the next one in logs... with no visible error)

Since yesterday csi-snapshotter 8.0.0 was released
(with again breaking changes and rework of internal hooks...)

The validating logic for VolumeSnapshots, VolumeSnapshotContents, VolumeGroupSnapshots, and
VolumeGroupSnapshotContents has been replaced by CEL validation rules. The validating webhook
is now only being used for VolumeSnapshotClasses and VolumeGroupSnapshotClasses to ensure
that there's at most one class per CSI Driver. The validation webhook is deprecated and will be removed in the next release. (kubernetes-csi/external-snapshotter#1091, @leonardoce)

i will stop right here and wait what will come next... the haystack is just to big.

kudos to you @tesshuflower for your commitment and effort.

From my side the ticket is obsolete and can be closed.

cheers

@tesshuflower
Copy link
Contributor Author

thanks @danielsand and thanks for the info about the external-snapshotter, I do need to look into updating volsync tests to use this latest release. Will close this issue for now, but please re-open if you encounter this again going forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants