Ensure peer-pods deployment is complete w.r.to webhook and VM resource limits #1976

bpradipt · 2024-08-03T07:37:14Z

Currently the default operator based deployment doesn't deploy the complete stack

Mutating webhook is not deployed: This affects resource management of peer pod VMs
Node extended resources are not advertised: This affects resource accounting and management of peer pod VMs

The following diagram shows the high level resource accounting and management for peer-pods

Ref: old deck on the resource accounting and management for peer-pods - https://docs.google.com/presentation/d/1GWNgQdRC5WxrXz_0XCW3DGIfzQHkO4MaN-8BlRPuTDc/edit#slide=id.g13a9839f269_0_0

The node extended resources are advertised by the peerpodconfig-ctrl. The earlier intention was to use peerpodconfig-ctrl to deploy all the required components for cloud-api-adaptor, but we are not yet there. This delay in implementation also gives us an opportunity to re-think the right approach.

Few questions that comes to my mind:

Remove peerpodconfig-ctrl and advertise the node resources via cloud-api-adaptor itself?
Should we focus on the new design for managing the VMs - RFC: Simpler management of VM instances and PeerPod objects #1534 ?

Additionally there is the issue deploying all the components via operator. There is some initial work that has happened it has created issues in the past with the release and test workflow. So this needs to be re-looked as well.

I'm starting this issue to kickstart the discussion so that we can address this important issue for the 0.10.0 release

cc @yoheiueda @mkulke @stevenhorsman @snir911 @huoqifeng

mkulke · 2024-08-05T09:53:08Z

Few questions that comes to my mind:

Remove peerpodconfig-ctrl and advertise the node resources via cloud-api-adaptor itself?

Should we focus on the new design for managing the VMs - RFC: Simpler management of VM instances and PeerPod objects #1534 ?

Can we look at those questions individually or are they inherently coupled? I think with regards to cloud resource management
any robust solution will have to look at state management outside of the daemonset pod's memory. It could be a k8s controller-based solution like in the linked RFC. An alternative to that would be to use a persistent database and a control loop in the daemonset (afaiu that's what the GARM server does to manage resources). I'd be leaning towards the controller-based solution.

bpradipt · 2024-08-07T16:59:08Z

Few questions that comes to my mind:

Remove peerpodconfig-ctrl and advertise the node resources via cloud-api-adaptor itself?

Should we focus on the new design for managing the VMs - RFC: Simpler management of VM instances and PeerPod objects #1534 ?

Can we look at those questions individually or are they inherently coupled?

We can look at it individually.

bpradipt · 2024-09-06T14:10:58Z

Raised a PR to remove peerpodconfig-ctrl #2027

This PR in isolation is not of much use unless the webhook is also deployed as part of the install to ensure there is a max limit to the number of cloud instances that can be created by cloud-api-adaptor.

This was referenced Aug 3, 2024

Webhook: Move all extended resources into one file #1095

Closed

Use operator to update PeerPod components #1214

Closed

Deploy webhook, peerpod and peerpodconfig controllers via make target #1260

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure peer-pods deployment is complete w.r.to webhook and VM resource limits #1976

Ensure peer-pods deployment is complete w.r.to webhook and VM resource limits #1976

bpradipt commented Aug 3, 2024

mkulke commented Aug 5, 2024

bpradipt commented Aug 7, 2024

bpradipt commented Sep 6, 2024

Ensure peer-pods deployment is complete w.r.to webhook and VM resource limits #1976

Ensure peer-pods deployment is complete w.r.to webhook and VM resource limits #1976

Comments

bpradipt commented Aug 3, 2024

mkulke commented Aug 5, 2024

bpradipt commented Aug 7, 2024

bpradipt commented Sep 6, 2024