Scaling issue: kube apiserver throttles external-provisioner when 100 PVCs created at same time #68

saad-ali · 2018-04-03T18:10:54Z

The CSI external-provisioner has scaling issues.

When 100 PVCs are created at same time, the CSI external-provisioner hammers the kube apiserver with requests and gets throttled, causing all sorts of issues.

sbezverk · 2018-04-03T18:54:27Z

@cduchesne Please try the same test using this PR for external provisioner:
#66
I am curious if you see the same scalability issue.

orainxiong · 2018-06-05T12:47:41Z

@sbezverk, I have hit the same issue. After getting throttled, it will take over 30mins to create a Volume. I will try your PR.

orainxiong · 2018-06-06T03:33:22Z

@sbezverk

The problem is still not resolved. I am not sure if I understand correctly, the root cause is that external-provisioner excessively talk against Kube-API-Server.

Every time external-provisioner perform the operations of provisioning and deleting Volume, a sudden surge of requests comes in Kube-API-Server, and then the external-provisioner get throttled. As a result, all operations will takes unpredictable times and cause errors.

We can go through the logic of provisioning to figure out more details.

Watch both Events and PersistentVolumeClaim to detect whether the operation of provisioning is successful or failure:

	successCh, err := ctrl.watchProvisioning(claim, stopCh)

	pvcWatch, err := ctrl.claimSource.Watch(options)
	if err != nil {
		return nil, err
	}

	failWatch, err := ctrl.getPVCEventWatch(claim, v1.EventTypeWarning, "ProvisioningFailed")
	if err != nil {
		pvcWatch.Stop()
		return nil, err
	}

	successWatch, err := ctrl.getPVCEventWatch(claim, v1.EventTypeNormal, "ProvisioningSucceeded")
	if err != nil {
		failWatch.Stop()
		pvcWatch.Stop()
		return nil, err
	}

List PersistentVolumeClaim to get leadership information to determine whether

	oldLeaderElectionRecord, err := le.config.Lock.Get()

Update PersistentVolumeClaim's annotation to acquire an exclusive lock before perform the operation of provisioning

	if err = le.config.Lock.Update(leaderElectionRecord); err != nil {
		glog.Errorf("Failed to update lock %s: %v", le.config.Lock.Describe(), err)
		return false
	}

Once external-provisioner gets throttled, the operation of watching Events will take over 10 seconds.

I0606 09:48:19.214596       1 request.go:480] Throttling request took 12.833541313s, request: GET:https://10.96.0.1:443/api/v1/namespaces/default/events?fieldSelector=involvedObject.name%3Dqcfs-pvc-only-114%2CinvolvedObject.namespace%3Ddefault%2CinvolvedObject.kind%3DPersistentVolumeClaim%2CinvolvedObject.uid%3D516852a4-6927-11e8-b1b9-525400e717a6%2Ctype%3DWarning%2Creason%3DProvisioningFailed
I0606 09:48:29.614493       1 request.go:480] Throttling request took 10.389247175s, request: GET:https://10.96.0.1:443/api/v1/namespaces/default/events?fieldSelector=involvedObject.namespace%3Ddefault%2CinvolvedObject.kind%3DPersistentVolumeClaim%2CinvolvedObject.uid%3D516852a4-6927-11e8-b1b9-525400e717a6%2CinvolvedObject.name%3Dqcfs-pvc-only-114%2Ctype%3DNormal%2Creason%3DProvisioningSucceeded

Worse, because of the policy of timeout of LeaderElector and provision, external-provisioner will retry periodically and produce more requests.

It will make things worse.

saad-ali · 2018-06-06T21:47:23Z

Thanks for surfacing this @orainxiong.

Problem sounds like the external-provisioner (and probably external-attacher and driver-registrar), open connections to kube API server without thinking twice. If you provision a lot of volumes at once, the kube API server gets mad and throttles everything.

We should do the following:

Create a worker pool so that no more then N operations can be carried out against the API server at once.
Allow N to be configured via a flag to the binary.
Incorporate Kubernetes informer cache in to sidecar containers so that reads/writes are optimized.

orainxiong · 2018-06-07T03:32:35Z

@saad-ali Many thanks for your reply.

Incorporate Kubernetes informer cache in to sidecar containers so that reads/writes are optimized

As you mentioned, external-provisioner should leverage Informer cache to avoid talking directly against Kube-API-Server. And perhaps, I think we could use the issue CSI attach/detach should use shared informer for reference.

If possible, I hope to have the opportunity to ask you @saad-ali some views on the granularity of lock of LeaderElector. The granularity of lock which implemented with ProvisionPVCLock is very fine and allows multiple external-provisioner work parallel to improve efficiency. However, the disadvantage is the improvement comes with increasing the complexity of logic and overhead of Kube-API-Server. From my point view, the disadvantage outweights the advantage. I am wondering why we don't copy the convention way like existingControl Plane which allows one controller active and others passive.

saad-ali · 2018-06-08T17:21:04Z

Spoke with Jan. He agrees with @orainxiong: Major source of issues is probably the leader election logic -- each provisioner tries to write an annotation trying to be leader, if not successful each will retry after some time.

We should focus on making leader election scalable.

saad-ali · 2018-06-08T17:22:04Z

Temporary workaround may be to disable leader election per PV and instead use leader election per provisioner.

wongma7 · 2018-06-13T19:44:37Z

Yes n pvcs * m provisioners is unacceptably bad. The original code is ancient, I will copy how HA is done in kube. It is only like this for the rare case where a person wants multiple provisioner instances (pointing to different storage sources w/ same properties) to serve the same storage class. But that isn't a common use-case and this is a lazy and inefficient way of achieving it anyway.

An alternative was sharding but I don't think it's possible

orainxiong · 2018-06-14T02:00:33Z

@saad-ali Thanks for your reply.

As we discussed above, this issue is mainly due to the leadership election and the abuse of API Server. I try to fix the problem through the following steps which requires the substantial changes to external-provisioner and external-storage. I am not sure whether it is right or not. Looking forward to your suggestions.

Implement leader election per provisioner as Kube-Controller-Manager does. Meanwhile, the role of external-provisioner-runner should be added more permissions to access both configmaps, endpoints and secrets.
Delete both function lockProvisionClaimOperation and related codes within external-storage.
Note: Storage provisioner should perform lock election with themselves rather than relying on external-storage.

But, these are not enough, both of external-storage and external-provisioner are supposed to use informer cache instead of talking directly against API Server.

There is a PR show more details which mainly involve following files :

cmd/csi-provisioner/csi-provisioner.go
pkg/controller/controller.go
vendor/github.com/kubernetes-incubator/external-storage/lib/controller/controller.go

These changes resulted in significantly reducing the time-consuming of provision and delete. In my environment, the time we take for provisioning 100 CSI PVC is reduced from more than 50 mins to 100 seconds.

vladimirvivien · 2018-06-18T20:33:35Z

Where are we with this? Is @orainxiong PR the work around we want to pursue ?

vladimirvivien · 2018-06-18T20:41:21Z

@orainxiong can you please create a proper PR so we can comment on it. Thanks.

orainxiong · 2018-06-19T02:14:24Z

@vladimirvivien Here is PR #104

If any questions, please let me know. Many thanks.

orainxiong · 2018-07-29T05:38:42Z

@saad-ali @sbezverk @wongma7

I have taken some time to find a workaround to identify the exactly time-consuming functions for provisioning a bunch of volumes these days.

I hope it works.

With existing tool net/http/pprof and go-torch out there, we are able to have flame graph in order to figure out the related code-path more accurately.

From the flame graph above, we could find out that function addClaim takes over 35% of CPU time during sample time.

More specifically, the function controller.(*ProvisionController).watchProvisioning takes about 27% of CPU time, and the function leaderelection.(*LeaderElector) takes about 9%. Both of two are called by addClaim.

With modifying leader lock from per PVC to per Provisioner and using informer cache to avoid talking against api server directly, as we discussed before, the workaround is able to take effects significantly.

From the flame graph following, the function addClaim has been reduced to about 14% of CPU time during sample time with the same test case.

The external-provisioner is able to completely finish the job of provisioning 100 PVC (1GiB, RWO) at once within 60 seconds.

There is still some space left to optimize, but I think it is probably enough.

humblec · 2018-07-30T13:25:57Z

@orainxiong nice summary and illustration of the issue! The leader lock change looks to be a decent fix (atleast as an interim) for now.

wongma7 · 2018-08-13T19:15:13Z

added https://github.com/kubernetes-incubator/external-storage/releases/tag/v5.0.0 . just need to bump this repo's deps and edit https://github.com/kubernetes-csi/external-provisioner/blob/master/deploy/kubernetes/statefulset.yaml#L31 slightly to look like https://github.com/kubernetes-incubator/external-storage/blob/master/aws/efs/deploy/rbac.yaml#L19

jsafrane · 2019-01-15T15:01:26Z

We removed per-pvc leader election a while ago and I think that this issue is fixed in 1.0. I created 100 PVCs with host path CSI driver and all PVs were created within ~50 seconds (using single VM with 2 CPUs and local-up-cluster.sh), without any API throttling.

1000 PVCs were provisioned in ~8.5 minutes.

OCPBUGS-17264: USPTREAM: 969: build(deps): bump golang.org/x/tools from 0.9.3 to 0.12.0

Fixbug: check driver timeout

saad-ali mentioned this issue Jun 7, 2018

external-provisioner should NOT issue delete calls between retries. #94

Closed

wongma7 mentioned this issue Jun 14, 2018

limit provision synchronous operation numbers kubernetes-retired/external-storage#553

Closed

wongma7 mentioned this issue Jun 21, 2018

Minimizing need to delete provisioned volume kubernetes/kubernetes#65100

Closed

orainxiong mentioned this issue Jun 25, 2018

fix apiserver throttling issue #104

Closed

orainxiong mentioned this issue Jul 2, 2018

optimize leader election and abuse api server kubernetes-retired/external-storage#837

Closed

orainxiong mentioned this issue Aug 13, 2018

AlreadyExists error cause repeated provisioning a volume #124

Closed

jsafrane closed this as completed Jan 15, 2019

ShyamsundarR mentioned this issue Jun 13, 2019

Fix leader election in deployment ceph/ceph-csi#428

Closed

jsafrane mentioned this issue Mar 31, 2020

Delay PublishVolumeRequest until OFFLINE-resizable Volume is resized kubernetes-csi/external-attacher#207

Closed

dobsonj pushed a commit to dobsonj/external-provisioner that referenced this issue Oct 16, 2023

Merge pull request kubernetes-csi#68 from jsafrane/4.14-bump-net

890aaa1

OCPBUGS-17264: USPTREAM: 969: build(deps): bump golang.org/x/tools from 0.9.3 to 0.12.0

kbsonlong pushed a commit to kbsonlong/external-provisioner that referenced this issue Dec 29, 2023

Merge pull request kubernetes-csi#68 from wnxn/bugfix/check_driver

cf1ef8e

Fixbug: check driver timeout

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling issue: kube apiserver throttles external-provisioner when 100 PVCs created at same time #68

Scaling issue: kube apiserver throttles external-provisioner when 100 PVCs created at same time #68

saad-ali commented Apr 3, 2018

sbezverk commented Apr 3, 2018

orainxiong commented Jun 5, 2018

orainxiong commented Jun 6, 2018

saad-ali commented Jun 6, 2018

orainxiong commented Jun 7, 2018

saad-ali commented Jun 8, 2018

saad-ali commented Jun 8, 2018

wongma7 commented Jun 13, 2018

orainxiong commented Jun 14, 2018

vladimirvivien commented Jun 18, 2018

vladimirvivien commented Jun 18, 2018

orainxiong commented Jun 19, 2018

orainxiong commented Jul 29, 2018 •

edited

Loading

humblec commented Jul 30, 2018

wongma7 commented Aug 13, 2018

jsafrane commented Jan 15, 2019

Scaling issue: kube apiserver throttles external-provisioner when 100 PVCs created at same time #68

Scaling issue: kube apiserver throttles external-provisioner when 100 PVCs created at same time #68

Comments

saad-ali commented Apr 3, 2018

sbezverk commented Apr 3, 2018

orainxiong commented Jun 5, 2018

orainxiong commented Jun 6, 2018

saad-ali commented Jun 6, 2018

orainxiong commented Jun 7, 2018

saad-ali commented Jun 8, 2018

saad-ali commented Jun 8, 2018

wongma7 commented Jun 13, 2018

orainxiong commented Jun 14, 2018

vladimirvivien commented Jun 18, 2018

vladimirvivien commented Jun 18, 2018

orainxiong commented Jun 19, 2018

orainxiong commented Jul 29, 2018 • edited Loading

humblec commented Jul 30, 2018

wongma7 commented Aug 13, 2018

jsafrane commented Jan 15, 2019

orainxiong commented Jul 29, 2018 •

edited

Loading