Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cherry-pick for release 1.9] cherry-pick bugfixs #3464

Merged
merged 24 commits into from
May 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
a642181
queue realcapability change to min dimension of queue capability and …
flyingfang Nov 21, 2023
266fd73
sort victims in reclaim action
lowang-bh Apr 6, 2024
434112f
feat: add ut common framework
lowang-bh Mar 9, 2024
c1d84aa
fix: ut didn't really run because init kubeclient failed; remove gomo…
lowang-bh Mar 9, 2024
de7b48b
refact reclaim ut with common test framework
lowang-bh Mar 24, 2024
b40464d
fix ut: reclaimee pod is randomly evicted
lowang-bh Apr 12, 2024
ae6a2e0
test: add case about reclaimee job sorting
lowang-bh Apr 13, 2024
8bc2f06
export the logs of Volcano components when the e2e test workflow fails
panoswoo Apr 25, 2024
bb3ed7c
move device-sharing logics from predicates to standalone deviceshare …
archlitchi Jan 29, 2024
2f6943e
fix: FilterNode err miss predicateStatus
googs1025 May 1, 2024
70e6cfb
ignore PredicateFn err info for preempt & reclaim scheduler plugin
May 9, 2024
322416f
capacity plugin implementation
Monokaix Nov 16, 2023
ae02178
Add capacity plugin e2e test
Monokaix May 9, 2024
20745a3
make generate-yaml to update crd
Monokaix May 10, 2024
dcc33be
Add user guide for capacity plugin
Monokaix May 11, 2024
14a8581
Remove list and watch secret in controller ClusterRole since it is no…
lekaf974 May 7, 2024
6ef29c5
fixing the panic when there is a data race for IgnoredDevicesList
belo4ya Feb 16, 2024
669f543
update go to 1.21
hwdef Feb 18, 2024
e910bc1
Fix queue metrics when there are no jobs in it
Monokaix May 13, 2024
d4dc70e
update e2e kind and kubectl ersion to adapt k8s 1.29
Apr 11, 2024
0ffc9fa
migrate vgpu to external
Monokaix Apr 19, 2024
2fb2c13
fix: task-topology plugin cannot handle the tasks whose name contains…
Feb 6, 2024
3dbc5af
update Score for deviceshare plugin
archlitchi May 15, 2024
1218c9e
Signed-off-by: yangqz <yangqz@tydic.com>
May 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/code_verify.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ jobs:
- name: Install Go
uses: actions/setup-go@v4
with:
go-version: 1.20.x
go-version: 1.21.x

- name: Checkout code
uses: actions/checkout@v3
Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/e2e_parallel_jobs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ jobs:
- name: Install Go
uses: actions/setup-go@v4
with:
go-version: 1.20.x
go-version: 1.21.x

- name: Install musl
run: |
Expand All @@ -31,8 +31,8 @@ jobs:

- name: Install dependences
run: |
GO111MODULE="on" go install sigs.k8s.io/kind@v0.15.0
curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.23.0/bin/linux/amd64/kubectl && sudo install kubectl /usr/local/bin/kubectl
GO111MODULE="on" go install sigs.k8s.io/kind@v0.21.0
curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.29.0/bin/linux/amd64/kubectl && sudo install kubectl /usr/local/bin/kubectl
- name: Checkout code
uses: actions/checkout@v3

Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/e2e_scheduling_actions.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ jobs:
- name: Install Go
uses: actions/setup-go@v4
with:
go-version: 1.20.x
go-version: 1.21.x

- name: Install musl
run: |
Expand All @@ -31,8 +31,8 @@ jobs:

- name: Install dependences
run: |
GO111MODULE="on" go install sigs.k8s.io/kind@v0.15.0
curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.23.0/bin/linux/amd64/kubectl && sudo install kubectl /usr/local/bin/kubectl
GO111MODULE="on" go install sigs.k8s.io/kind@v0.21.0
curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.29.0/bin/linux/amd64/kubectl && sudo install kubectl /usr/local/bin/kubectl
- name: Checkout code
uses: actions/checkout@v3

Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/e2e_scheduling_basic.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ jobs:
- name: Install Go
uses: actions/setup-go@v4
with:
go-version: 1.20.x
go-version: 1.21.x

- name: Install musl
run: |
Expand All @@ -31,8 +31,8 @@ jobs:

- name: Install dependences
run: |
GO111MODULE="on" go install sigs.k8s.io/kind@v0.15.0
curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.23.0/bin/linux/amd64/kubectl && sudo install kubectl /usr/local/bin/kubectl
GO111MODULE="on" go install sigs.k8s.io/kind@v0.21.0
curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.29.0/bin/linux/amd64/kubectl && sudo install kubectl /usr/local/bin/kubectl
- name: Checkout code
uses: actions/checkout@v3

Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/e2e_sequence.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ jobs:
- name: Install Go
uses: actions/setup-go@v4
with:
go-version: 1.20.x
go-version: 1.21.x

- name: Install musl
run: |
Expand All @@ -31,8 +31,8 @@ jobs:

- name: Install dependences
run: |
GO111MODULE="on" go install sigs.k8s.io/kind@v0.15.0
curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.23.0/bin/linux/amd64/kubectl && sudo install kubectl /usr/local/bin/kubectl
GO111MODULE="on" go install sigs.k8s.io/kind@v0.21.0
curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.29.0/bin/linux/amd64/kubectl && sudo install kubectl /usr/local/bin/kubectl
- name: Checkout code
uses: actions/checkout@v3

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/e2e_spark.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ jobs:
- name: Install Go
uses: actions/setup-go@v4
with:
go-version: 1.20.x
go-version: 1.21.x
- name: Set up Docker Buildx
id: buildx
uses: docker/setup-buildx-action@v2
Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/e2e_vcctl.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ jobs:
- name: Install Go
uses: actions/setup-go@v4
with:
go-version: 1.20.x
go-version: 1.21.x

- name: Install musl
run: |
Expand All @@ -31,8 +31,8 @@ jobs:

- name: Install dependences
run: |
GO111MODULE="on" go install sigs.k8s.io/kind@v0.15.0
curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.23.0/bin/linux/amd64/kubectl && sudo install kubectl /usr/local/bin/kubectl
GO111MODULE="on" go install sigs.k8s.io/kind@v0.21.0
curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.29.0/bin/linux/amd64/kubectl && sudo install kubectl /usr/local/bin/kubectl
- name: Checkout code
uses: actions/checkout@v3

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/fossa.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ jobs:
- uses: actions/checkout@v3
- uses: actions/setup-go@v4
with:
go-version: 1.20.x
go-version: 1.21.x
- run: go version
# Runs a set of commands to initialize and analyze with FOSSA
- name: run FOSSA analysis
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/licenses_lint.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ jobs:
- name: Install Go
uses: actions/setup-go@v4
with:
go-version: 1.20.x
go-version: 1.21.x
- name: Checkout code
uses: actions/checkout@v3
- name: generate license mirror
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/release.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ jobs:
- name: Install Go
uses: actions/setup-go@v4
with:
go-version: 1.20.x
go-version: 1.21.x

- name: Install musl
run: |
Expand Down
10 changes: 10 additions & 0 deletions config/crd/volcano/bases/scheduling.volcano.sh_queues.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,16 @@ spec:
x-kubernetes-int-or-string: true
description: ResourceList is a set of (resource name, quantity) pairs.
type: object
deserved:
additionalProperties:
anyOf:
- type: integer
- type: string
pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
x-kubernetes-int-or-string: true
description: The amount of resources configured by the user. This
part of resource can be shared with other queues and reclaimed back.
type: object
extendClusters:
description: extendCluster indicate the jobs in this Queue will be
dispatched to these clusters.
Expand Down
10 changes: 10 additions & 0 deletions config/crd/volcano/v1beta1/scheduling.volcano.sh_queues.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,16 @@ spec:
x-kubernetes-int-or-string: true
description: ResourceList is a set of (resource name, quantity) pairs.
type: object
deserved:
additionalProperties:
anyOf:
- type: integer
- type: string
pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
x-kubernetes-int-or-string: true
description: The amount of resources configured by the user. This part
of resource can be shared with other queues and reclaimed back.
type: object
extendClusters:
description: extendCluster indicate the jobs in this Queue will be dispatched
to these clusters.
Expand Down
191 changes: 191 additions & 0 deletions docs/user-guide/how_to_use_capacity_plugin.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
# Capacity Plugin User Guide

## Introduction

Capacity plugin is a replacement of proportion plugin, but instead of dividing the queue's deserved resources by weight, it realizes elastic queue capacity management i.e., queue's resource borrowing and lending mechanism by specifying the amount of deserved resources for each dimension resource of the queue.

A queue can use the idle resources of other queues, and when other queues submit jobs, they can reclaim the resources that have been lent, and the amount of reclaimed resources is the amount of queue's deserved resources. For more detail, please see [Capacity scheduling design](../design/capacity-scheduling.md)

## Environment setup

### Install volcano

Refer to [Install Guide](https://github.com/volcano-sh/volcano/blob/master/installer/README.md) to install volcano.

After installed, update the scheduler configuration:

```shell
kubectl edit cm -n volcano-system volcano-scheduler-configmap
```

Make sure capacity plugin are enabled and remove proportion plugin.

Note: capacity and proportion plugin are in conflict, the two plugins cannot be used together.

```yaml
kind: ConfigMap
apiVersion: v1
metadata:
name: volcano-scheduler-configmap
namespace: volcano-system
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
- name: gang
enablePreemptable: false
- name: conformance
- plugins:
- name: drf
enablePreemptable: false
- name: predicates
- name: capacity # add this field and remove proportion plugin.
- name: nodeorder
- name: binpack
```

## Config queue's deserved resources

Assume there are two nodes and two queues named queue1 and queue2 in your kubernetes cluster, and each node has 4 CPU and 16Gi memory, then there will be total 8 CPU and 32Gi memory in your cluster.

```yaml
allocatable:
cpu: "4"
memory: 16Gi
pods: "110"
```

config queue1's deserved field with 2 cpu and 8Gi memory.

```yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: queue1
spec:
reclaimable: true
deserved: # set the deserved field.
cpu: 2
memeory: 8Gi
```

config queue2's deserved field with 6 cpu and 24Gi memory.

```yaml
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: queue2
spec:
reclaimable: true
deserved: # set the deserved field.
cpu: 6
memory: 24Gi
```

## Submit pods to each queue

First, submit a deployment named demo-1 to queue1 with replicas=8 and each pod requests 1 cpu and 4Gi memory, because queue2 is idle, so queue1 can use the whole clusters' resources, and you can see that 8 pods are in Running state.

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: demo-1
spec:
selector:
matchLabels:
app: demo-1
replicas: 8
template:
metadata:
labels:
app: demo-1
annotations:
scheduling.volcano.sh/queue-name: "queue1" # set the queue
spec:
schedulerName: volcano
containers:
- name: nginx
image: nginx:1.14.2
resources:
requests:
cpu: 1
memory: 4Gi
ports:
- containerPort: 80
```

Expected result:

```shell
$ kubectl get po
NAME READY STATUS RESTARTS AGE
demo-1-7bc649f544-2wjg7 1/1 Running 0 5s
demo-1-7bc649f544-cvsmr 1/1 Running 0 5s
demo-1-7bc649f544-j5lzp 1/1 Running 0 5s
demo-1-7bc649f544-jvlbx 1/1 Running 0 5s
demo-1-7bc649f544-mzgg2 1/1 Running 0 5s
demo-1-7bc649f544-ntrs2 1/1 Running 0 5s
demo-1-7bc649f544-nv424 1/1 Running 0 5s
demo-1-7bc649f544-zd6d9 1/1 Running 0 5s
```

Then submit a deployment named demo-2 to queue2 with replicas=8 and each pod requests 1 cpu and 4Gi memory.

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: demo-2
spec:
selector:
matchLabels:
app: demo-2
replicas: 8
template:
metadata:
labels:
app: demo-2
annotations:
scheduling.volcano.sh/queue-name: "queue2" # set the queue
spec:
schedulerName: volcano
containers:
- name: nginx
image: nginx:1.14.2
resources:
requests:
cpu: 1
memory: 4Gi
ports:
- containerPort: 80
```

Because queue1 occupied queue2's resources, so queue2 will reclaim its deserved resources with 6 cpu and 24Gi memory. And each pod of demo-2 request 1 cpu and 4Gi memory, so there will be 6 Pods in Running state of demo-2, and demo-1's pods will be evicted.

Finally, you can see that there are 2 Running pods in demo-1(belongs to queue1), and 6 Running pods in demo-2(belongs to queue2), which meets queue's deserved resources respectively.

```shell
$ kubectl get po
NAME READY STATUS RESTARTS AGE
demo-1-7bc649f544-4vvdv 0/1 Pending 0 37s
demo-1-7bc649f544-c6mds 0/1 Pending 0 37s
demo-1-7bc649f544-j5lzp 1/1 Running 0 14m
demo-1-7bc649f544-mzgg2 1/1 Running 0 14m
demo-1-7bc649f544-pqdgk 0/1 Pending 0 37s
demo-1-7bc649f544-tx6wp 0/1 Pending 0 37s
demo-1-7bc649f544-wmshq 0/1 Pending 0 37s
demo-1-7bc649f544-wrhrr 0/1 Pending 0 37s
demo-2-6dfb86c49b-2jvgm 0/1 Pending 0 37s
demo-2-6dfb86c49b-dnjzv 1/1 Running 0 37s
demo-2-6dfb86c49b-fzvmp 1/1 Running 0 37s
demo-2-6dfb86c49b-jlf69 1/1 Running 0 37s
demo-2-6dfb86c49b-k62f7 1/1 Running 0 37s
demo-2-6dfb86c49b-k9b9v 1/1 Running 0 37s
demo-2-6dfb86c49b-rpzvg 0/1 Pending 0 37s
demo-2-6dfb86c49b-zch7w 1/1 Running 0 37s
```

Loading
Loading