Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling issue: kube apiserver throttles external-provisioner when 100 PVCs created at same time #68

Closed
saad-ali opened this issue Apr 3, 2018 · 16 comments

Comments

@saad-ali
Copy link
Member

saad-ali commented Apr 3, 2018

Reported by @cduchesne

The CSI external-provisioner has scaling issues.

When 100 PVCs are created at same time, the CSI external-provisioner hammers the kube apiserver with requests and gets throttled, causing all sorts of issues.

@sbezverk
Copy link
Contributor

sbezverk commented Apr 3, 2018

@cduchesne Please try the same test using this PR for external provisioner:
#66
I am curious if you see the same scalability issue.

@orainxiong
Copy link

@sbezverk, I have hit the same issue. After getting throttled, it will take over 30mins to create a Volume. I will try your PR.

@orainxiong
Copy link

@sbezverk

The problem is still not resolved. I am not sure if I understand correctly, the root cause is that external-provisioner excessively talk against Kube-API-Server.

Every time external-provisioner perform the operations of provisioning and deleting Volume, a sudden surge of requests comes in Kube-API-Server, and then the external-provisioner get throttled. As a result, all operations will takes unpredictable times and cause errors.

We can go through the logic of provisioning to figure out more details.

  • Watch both Events and PersistentVolumeClaim to detect whether the operation of provisioning is successful or failure:
	successCh, err := ctrl.watchProvisioning(claim, stopCh)

	pvcWatch, err := ctrl.claimSource.Watch(options)
	if err != nil {
		return nil, err
	}

	failWatch, err := ctrl.getPVCEventWatch(claim, v1.EventTypeWarning, "ProvisioningFailed")
	if err != nil {
		pvcWatch.Stop()
		return nil, err
	}

	successWatch, err := ctrl.getPVCEventWatch(claim, v1.EventTypeNormal, "ProvisioningSucceeded")
	if err != nil {
		failWatch.Stop()
		pvcWatch.Stop()
		return nil, err
	}
  • List PersistentVolumeClaim to get leadership information to determine whether
	oldLeaderElectionRecord, err := le.config.Lock.Get()
  • Update PersistentVolumeClaim's annotation to acquire an exclusive lock before perform the operation of provisioning
	if err = le.config.Lock.Update(leaderElectionRecord); err != nil {
		glog.Errorf("Failed to update lock %s: %v", le.config.Lock.Describe(), err)
		return false
	}

Once external-provisioner gets throttled, the operation of watching Events will take over 10 seconds.

I0606 09:48:19.214596       1 request.go:480] Throttling request took 12.833541313s, request: GET:https://10.96.0.1:443/api/v1/namespaces/default/events?fieldSelector=involvedObject.name%3Dqcfs-pvc-only-114%2CinvolvedObject.namespace%3Ddefault%2CinvolvedObject.kind%3DPersistentVolumeClaim%2CinvolvedObject.uid%3D516852a4-6927-11e8-b1b9-525400e717a6%2Ctype%3DWarning%2Creason%3DProvisioningFailed
I0606 09:48:29.614493       1 request.go:480] Throttling request took 10.389247175s, request: GET:https://10.96.0.1:443/api/v1/namespaces/default/events?fieldSelector=involvedObject.namespace%3Ddefault%2CinvolvedObject.kind%3DPersistentVolumeClaim%2CinvolvedObject.uid%3D516852a4-6927-11e8-b1b9-525400e717a6%2CinvolvedObject.name%3Dqcfs-pvc-only-114%2Ctype%3DNormal%2Creason%3DProvisioningSucceeded

Worse, because of the policy of timeout of LeaderElector and provision, external-provisioner will retry periodically and produce more requests.

It will make things worse.

@saad-ali
Copy link
Member Author

saad-ali commented Jun 6, 2018

Thanks for surfacing this @orainxiong.

Problem sounds like the external-provisioner (and probably external-attacher and driver-registrar), open connections to kube API server without thinking twice. If you provision a lot of volumes at once, the kube API server gets mad and throttles everything.

We should do the following:

  • Create a worker pool so that no more then N operations can be carried out against the API server at once.
  • Allow N to be configured via a flag to the binary.
  • Incorporate Kubernetes informer cache in to sidecar containers so that reads/writes are optimized.

@orainxiong
Copy link

@saad-ali Many thanks for your reply.

Incorporate Kubernetes informer cache in to sidecar containers so that reads/writes are optimized

As you mentioned, external-provisioner should leverage Informer cache to avoid talking directly against Kube-API-Server. And perhaps, I think we could use the issue CSI attach/detach should use shared informer for reference.

If possible, I hope to have the opportunity to ask you @saad-ali some views on the granularity of lock of LeaderElector. The granularity of lock which implemented with ProvisionPVCLock is very fine and allows multiple external-provisioner work parallel to improve efficiency. However, the disadvantage is the improvement comes with increasing the complexity of logic and overhead of Kube-API-Server. From my point view, the disadvantage outweights the advantage. I am wondering why we don't copy the convention way like existingControl Plane which allows one controller active and others passive.

@saad-ali
Copy link
Member Author

saad-ali commented Jun 8, 2018

Spoke with Jan. He agrees with @orainxiong: Major source of issues is probably the leader election logic -- each provisioner tries to write an annotation trying to be leader, if not successful each will retry after some time.

We should focus on making leader election scalable.

@saad-ali
Copy link
Member Author

saad-ali commented Jun 8, 2018

Temporary workaround may be to disable leader election per PV and instead use leader election per provisioner.

@wongma7
Copy link
Contributor

wongma7 commented Jun 13, 2018

Yes n pvcs * m provisioners is unacceptably bad. The original code is ancient, I will copy how HA is done in kube. It is only like this for the rare case where a person wants multiple provisioner instances (pointing to different storage sources w/ same properties) to serve the same storage class. But that isn't a common use-case and this is a lazy and inefficient way of achieving it anyway.

An alternative was sharding but I don't think it's possible

@orainxiong
Copy link

@saad-ali Thanks for your reply.

As we discussed above, this issue is mainly due to the leadership election and the abuse of API Server. I try to fix the problem through the following steps which requires the substantial changes to external-provisioner and external-storage. I am not sure whether it is right or not. Looking forward to your suggestions.

  • Implement leader election per provisioner as Kube-Controller-Manager does. Meanwhile, the role of external-provisioner-runner should be added more permissions to access both configmaps, endpoints and secrets.
  • Delete both function lockProvisionClaimOperation and related codes within external-storage.
    Note: Storage provisioner should perform lock election with themselves rather than relying on external-storage.

But, these are not enough, both of external-storage and external-provisioner are supposed to use informer cache instead of talking directly against API Server.

There is a PR show more details which mainly involve following files :

  • cmd/csi-provisioner/csi-provisioner.go
  • pkg/controller/controller.go
  • vendor/github.com/kubernetes-incubator/external-storage/lib/controller/controller.go

These changes resulted in significantly reducing the time-consuming of provision and delete. In my environment, the time we take for provisioning 100 CSI PVC is reduced from more than 50 mins to 100 seconds.

@vladimirvivien
Copy link
Member

Where are we with this? Is @orainxiong PR the work around we want to pursue ?

@vladimirvivien
Copy link
Member

@orainxiong can you please create a proper PR so we can comment on it. Thanks.

@orainxiong
Copy link

@vladimirvivien Here is PR #104

If any questions, please let me know. Many thanks.

@orainxiong
Copy link

orainxiong commented Jul 29, 2018

@saad-ali @sbezverk @wongma7

I have taken some time to find a workaround to identify the exactly time-consuming functions for provisioning a bunch of volumes these days.

I hope it works.

With existing tool net/http/pprof and go-torch out there, we are able to have flame graph in order to figure out the related code-path more accurately.

image

From the flame graph above, we could find out that function addClaim takes over 35% of CPU time during sample time.

More specifically, the function controller.(*ProvisionController).watchProvisioning takes about 27% of CPU time, and the function leaderelection.(*LeaderElector) takes about 9%. Both of two are called by addClaim.

image

image

With modifying leader lock from per PVC to per Provisioner and using informer cache to avoid talking against api server directly, as we discussed before, the workaround is able to take effects significantly.

From the flame graph following, the function addClaim has been reduced to about 14% of CPU time during sample time with the same test case.
image

The external-provisioner is able to completely finish the job of provisioning 100 PVC (1GiB, RWO) at once within 60 seconds.

There is still some space left to optimize, but I think it is probably enough.

@humblec
Copy link
Contributor

humblec commented Jul 30, 2018

@orainxiong nice summary and illustration of the issue! The leader lock change looks to be a decent fix (atleast as an interim) for now.

@jsafrane
Copy link
Contributor

We removed per-pvc leader election a while ago and I think that this issue is fixed in 1.0. I created 100 PVCs with host path CSI driver and all PVs were created within ~50 seconds (using single VM with 2 CPUs and local-up-cluster.sh), without any API throttling.

1000 PVCs were provisioned in ~8.5 minutes.

dobsonj pushed a commit to dobsonj/external-provisioner that referenced this issue Oct 16, 2023
OCPBUGS-17264: USPTREAM: 969: build(deps): bump golang.org/x/tools from 0.9.3 to 0.12.0
kbsonlong pushed a commit to kbsonlong/external-provisioner that referenced this issue Dec 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants