Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for Support of Pod Scheduling Readiness #3581

Merged
merged 1 commit into from
Sep 6, 2024

Conversation

ykcai-daniel
Copy link
Contributor

Proposal to add support for Pod Scheduling Readiness as mentioned in #3555

@volcano-sh-bot
Copy link
Contributor

Welcome @ykcai-daniel!

It looks like this is your first PR to volcano-sh/volcano.

Thank you, and welcome to Volcano. 😃

@volcano-sh-bot volcano-sh-bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jul 10, 2024
@ykcai-daniel
Copy link
Contributor Author

@Monokaix Please have a look first. Will fix the CI problems soon.

@Monokaix
Copy link
Member

Monokaix commented Jul 11, 2024

You can sign off your commit with git commit -s: )

# Pod Scheduling Readiness
## Motivation

Pod Scheduling Readiness is a beta feature in Kubernetes v1.27. Users expect Volcano to be aware of it. By specifying/removing a Pod's `.spec.schedulingGates`, which is an array of strings, users can control when a Pod is ready to be considered for scheduling. For Pods with none-empty `schedulingGates`, it will only be removed by kube-scheduler once all the gates are removed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will only be removed by kube-scheduler once all the gates are removed.

Actually the schedulingGates field is usually removed by external controllers and kube-scheduler just check this filed to determine whether to schedule.


Pod Scheduling Readiness is a beta feature in Kubernetes v1.27. Users expect Volcano to be aware of it. By specifying/removing a Pod's `.spec.schedulingGates`, which is an array of strings, users can control when a Pod is ready to be considered for scheduling. For Pods with none-empty `schedulingGates`, it will only be removed by kube-scheduler once all the gates are removed.

The following example illustrates why support for Scheduling Readiness is needed in Volcano. Suppose we have implemented an external quota manager responsible for reviewing all incoming pod requests for capacity/quota requirements. Only once these requests receive approval from the quota manager are they considered eligible for scheduling. The pods schedulingGates feature can be handy when implementing this funtionality
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a full stop of this passage.

```
2. **Plugins**: Scheduler Plugins such as proportion will register functions to determine whether a PodGroup is enqueueable or allocatable. These functions calculate the resources already used in the cluster based on states of each PodGroup. For example, the proportion plugin determines whether a task is enqueuable when `sum(Inqueue)+sum(Running)+cur_job<TotalResource+elastic`. Since the scheduling gated jobs do not occupy resources in the cluster, by having the scheduling gated job in `SchGated` rather than `Inqueue`, they will not be summed up when calculating total used resources, which reflects the actual situation.
3. **Actions**: Transition to `SchGated` happens to and from `Inqueue` first in the allocation action. Then, other actions will skip the gated pods.
4. **Controllers**: One alternative design of state transition is to directly transition from `Pending` to `SchGated`. However, this is not possible because Controllers only create Pods for a job once it is inqueued (see [delayed-pod-creation](./delay-pod-creation.md)). We can only know a Pod is scheduling gated in the Inqueue state. Therefore, we need to transition to SchGated from Inqueue.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Therefore, we need to transition to SchGated from Inqueue.

We should clarify that only the first time pg is created should we transit pg to Inqueue from Pending, and when pods of current pg are created, we transit pg to SchGated from Inqueue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should also emphasize atomicity of Pod creation: Once the state of a pg is transitioned from Pending to Inqueue, I think all Pods of that pg will be created. In other words, the scheduler will never see a Job with partially created Pods. If this is not the case, the following situation might occur:

Consider a Job with four pods: p1,p2,p3,p4
Job enqueued ->p1,p2 created with gates -> p1,p2 gates removed -> Job allocated to node -> p3, p4 created with gates
We will end up with a Running Job but with scheduling gates

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

goog catch!

3. **Actions**: Transition to `SchGated` happens to and from `Inqueue` first in the allocation action. Then, other actions will skip the gated pods.
4. **Controllers**: One alternative design of state transition is to directly transition from `Pending` to `SchGated`. However, this is not possible because Controllers only create Pods for a job once it is inqueued (see [delayed-pod-creation](./delay-pod-creation.md)). We can only know a Pod is scheduling gated in the Inqueue state. Therefore, we need to transition to SchGated from Inqueue.
## Granularity of Pod Scheduling Readiness
Pod scheduling readiness is a field of the spec of a pod. However, Volcano schedules Jobs, which consists of many tasks, each corresponding to a Pod. It is possible that some of these pods are scheduling gated while others are not. To align the granularity of the Pod Scheduling Readiness and Job, we see scheduling gates as a property of a Job: as long as there is one gated pod, the job is scheduling gated. This is consistent with the workload of Volcano: most Volcano Jobs need to run as a whole and cannot be partially run.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add that only voclano jobs have this limitation, and for normal pod/Deployment/Statefulset, etc, it's still a pod level gated feature.

@Monokaix
Copy link
Member

Antoher part that we should concern is the observability, when a pod is scheduling gated, what's the behavior of a pod and its pg? should we reports some events to let users know what happen, maybe we can also refer to kube-scheduler to determine whethet we should report events and what's the frequency of the event reported.

@volcano-sh-bot volcano-sh-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 15, 2024
@ykcai-daniel
Copy link
Contributor Author

Proposal will be updated based on the new design, main changes include:

  • no longer add new state of PodGroup. Instead, update the plugins for correct inqueue resource computation
  • In allocation and preempt actions, skip scheduling gated tasks instead of the entire job

@volcano-sh-bot volcano-sh-bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 16, 2024
@ykcai-daniel
Copy link
Contributor Author

@Monokaix design doc updated according to latest implementation #3658

@ykcai-daniel
Copy link
Contributor Author

/assign @wpeng102

4. **Controllers**: K8S native resources like Deployment support template level removal of scheduling gates. In other words, if a Deployment has scheduling gates in its pod template, by patching the Deployment and remove the scheduling gates, all its pods will be deleted and recreated without gates. However, for Vcjob, currently the job controller cannot detect changes in PodTemplate and cannot support this feature. Despite this, it is uncommon to remove scheduling gates from PodTemplate and the Pod Scheduling Gates feature is usually used at pod level. What's more, scheduling gates are often added by webhooks instead of in the Job template (more details in K8S KEP). Therefore, we choose to not align this behavior with K8S.

## Limitations
1. **Vcjob support removing scheduling gates in template**: As mentioned above, currently, if we remove the scheduling gates field in a Vcjob, the gates of its pods are not removed. This behavior of Vcjob is different from native K8S resources.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are limitations not our expectations, please modify them: )

@volcano-sh-bot volcano-sh-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed retest-not-required-docs-only size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 6, 2024
Signed-off-by: ykcai-daniel <1155141377@link.cuhk.edu.hk>
@volcano-sh-bot volcano-sh-bot added retest-not-required-docs-only size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 6, 2024
@Monokaix
Copy link
Member

Monokaix commented Sep 6, 2024

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Sep 6, 2024
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: william-wang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 6, 2024
@volcano-sh-bot volcano-sh-bot merged commit 9ea246b into volcano-sh:master Sep 6, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. retest-not-required-docs-only size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants