Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure peer-pods deployment is complete w.r.to webhook and VM resource limits #1976

Open
bpradipt opened this issue Aug 3, 2024 · 3 comments

Comments

@bpradipt
Copy link
Member

bpradipt commented Aug 3, 2024

Currently the default operator based deployment doesn't deploy the complete stack

  • Mutating webhook is not deployed: This affects resource management of peer pod VMs
  • Node extended resources are not advertised: This affects resource accounting and management of peer pod VMs

The following diagram shows the high level resource accounting and management for peer-pods

image

Ref: old deck on the resource accounting and management for peer-pods - https://docs.google.com/presentation/d/1GWNgQdRC5WxrXz_0XCW3DGIfzQHkO4MaN-8BlRPuTDc/edit#slide=id.g13a9839f269_0_0

The node extended resources are advertised by the peerpodconfig-ctrl. The earlier intention was to use peerpodconfig-ctrl to deploy all the required components for cloud-api-adaptor, but we are not yet there. This delay in implementation also gives us an opportunity to re-think the right approach.

Few questions that comes to my mind:

  1. Remove peerpodconfig-ctrl and advertise the node resources via cloud-api-adaptor itself?
  2. Should we focus on the new design for managing the VMs - RFC: Simpler management of VM instances and PeerPod objects #1534 ?

Additionally there is the issue deploying all the components via operator. There is some initial work that has happened it has created issues in the past with the release and test workflow. So this needs to be re-looked as well.

I'm starting this issue to kickstart the discussion so that we can address this important issue for the 0.10.0 release

cc @yoheiueda @mkulke @stevenhorsman @snir911 @huoqifeng

@mkulke
Copy link
Contributor

mkulke commented Aug 5, 2024

Few questions that comes to my mind:

  1. Remove peerpodconfig-ctrl and advertise the node resources via cloud-api-adaptor itself?
  2. Should we focus on the new design for managing the VMs - RFC: Simpler management of VM instances and PeerPod objects #1534 ?

Can we look at those questions individually or are they inherently coupled? I think with regards to cloud resource management
any robust solution will have to look at state management outside of the daemonset pod's memory. It could be a k8s controller-based solution like in the linked RFC. An alternative to that would be to use a persistent database and a control loop in the daemonset (afaiu that's what the GARM server does to manage resources). I'd be leaning towards the controller-based solution.

@bpradipt
Copy link
Member Author

bpradipt commented Aug 7, 2024

Few questions that comes to my mind:

  1. Remove peerpodconfig-ctrl and advertise the node resources via cloud-api-adaptor itself?
  2. Should we focus on the new design for managing the VMs - RFC: Simpler management of VM instances and PeerPod objects #1534 ?

Can we look at those questions individually or are they inherently coupled?

We can look at it individually.

@bpradipt
Copy link
Member Author

bpradipt commented Sep 6, 2024

Raised a PR to remove peerpodconfig-ctrl #2027

This PR in isolation is not of much use unless the webhook is also deployed as part of the install to ensure there is a max limit to the number of cloud instances that can be created by cloud-api-adaptor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🆕 New
Development

No branches or pull requests

2 participants