Scalability tests #1931

trasc · 2024-03-29T14:41:38Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Add scalability test to be able to detect regressions in Kueue's overall scheduling performance.

Which issue(s) this PR fixes:

Relates to #1912

Special notes for your reviewer:

This should be the fist stage of this change, once this is merged and we have a CI job set up, we should refine the generator config and the expected metrics value.

Does this PR introduce a user-facing change?

Added scalability test for scheduling performance

k8s-ci-robot · 2024-03-29T14:41:41Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

netlify · 2024-03-29T14:41:56Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`f3a7d05`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/661fd89884e7b600089b4b59

alculquicondor · 2024-04-02T15:28:31Z

/assign @mimowo

trasc · 2024-04-04T12:26:59Z

/test all

mimowo · 2024-04-08T12:38:51Z

ack

mimowo

Generally I would like to steer the first iteration of the PR in this direction:

drop custom code to track individual workloads, events, queues with controllers
we only track metrics, scraped in 1s intervals by default. A report is saved for a subset of metrics, only guage and counter, in csv as time series. I'm ok with the metrics scraping as a follow up.

I think this approach will be much easier to maintain going forward. It will also create a healthy incentive to add metrics, which will be used on production clusters.

test/scalability/README.md

test/scalability/default_rangespec.yaml

test/scalability/checker/checker_test.go

test/scalability/default_generator_config.yaml

test/scalability/default_rangespec.yaml

test/scalability/runner/generator/generator.go

test/scalability/runner/recorder/recorder.go

trasc · 2024-04-10T13:02:53Z

Generally I would like to steer the first iteration of the PR in this direction:

drop custom code to track individual workloads, events, queues with controllers

we only track metrics, scraped in 1s intervals by default. A report is saved for a subset of metrics, only guage and counter, in csv as time series. I'm ok with the metrics scraping as a follow up.

I think this approach will be much easier to maintain going forward. It will also create a healthy incentive to add metrics, which will be used on production clusters.

Let's go with the set of features already developed and add new functionality in follow-ups.

mimowo · 2024-04-10T13:16:36Z

Let's go with the set of features already developed and add new functionality in follow-ups.

I would like to drop the bits indicated in the comments to simplify the code, there is a maintenance cost if we commit them.

trasc · 2024-04-10T18:07:10Z

/hold

trasc · 2024-04-11T10:45:47Z

Relying on kueue's internal metrics in this scenario is problematic because:

Adds extra CPU consumption in minimalkueue: Serving the scraping will add extra CPU consumption and make the critical path (scheduling) less relevant in the overall consumption.
Are unflexible: No workload class specific measurements can be extracted. In my opinion being able track the time to admission for specific type of workload (having different priorities) is important.
Are not very precise:
- Scraping at 1s intervals while the workloads we are using run for at most 1s (90% of them way under) can miss important hot spots.
- With some exceptions, the timings computed for the metrics are based on some k8s api stored time (creation, condition last transition) which have seconds resolution.

It needs to be mentioned that the runner components that are extracting the measurements have additional attributes, (managing when a workload is declared finished, managing workload's eviction , determine when the scenario execution has ended.)

mimowo

LGTM overall, thanks for the cleanup.

test/scalability/runner/main.go

Makefile

test/scalability/README.md

trasc · 2024-04-16T13:30:43Z

LGTM overall, thanks for the cleanup.

We should not look at it as a cleanup, is just a split in two PRs.

mimowo · 2024-04-16T14:10:35Z

We should not look at it as a cleanup, is just a split in two PRs.

Right, but let's park the second part until we play more with the tool, and have some idea if this is really needed, because it adds significant complexity to the code.

mimowo · 2024-04-16T14:11:51Z

/lgtm
/approve

/assign @alculquicondor
For the top-level Makefile

k8s-ci-robot · 2024-04-16T14:11:57Z

LGTM label has been added.

Git tree hash: 7798498d85c2e3dc8dce9db1952b349225ae70d3

Makefile

alculquicondor · 2024-04-16T15:52:05Z

Makefile

+endif
+
+ifdef SCALABILITY_KUEUE_LOGS
+SCALABILITY_EXTRA_ARGS +=  --withLogs=true --logToFile=true


what is the file that it goes to?

bin/run-scalability/minimalkueue.err.log and bin/run-scalability/minimalkueue.out.log

(Just a side note, 1820 was really good, only from log size POV it got the size of bin/run-scalability/minimalkueue.err.log from around 3GB to under 100MB )

FYI @gabesaba, as the person who discovered the issue :)

oh, actually, @gabesaba's was #1897
It was probably a combination of both.

alculquicondor · 2024-04-16T15:56:22Z

test/scalability/README.md

@@ -0,0 +1,64 @@
+# Scalability test
+
+Is a test meant to detect regressions int the Kueue's overall scheduling capabilities. 


maybe we should put all of this inside test/performance/scheduling?

The existing tests in test/performance are scalability tests as well, just a different level.

Another potential directory structure could be:

test/performance (for this tool) and test/performance_e2e (for the cl2 based things).

We can but let's come up with a precise naming scheme, not only for the code location but also the artifacts and make targets, otherwise it will be hard to follow the terminology.

Yes, the target names should match. Maybe we can put all the targets for "performance" inside its own Makefile. so it would look like:

make test/performance/jobs test/performance/scheduling

But we can leave them for a follow up.

Let's go with the follow-up then

test/scalability/README.md

alculquicondor · 2024-04-16T16:06:50Z

test/scalability/default_generator_config.yaml

@@ -0,0 +1,31 @@
+- className: cohort


There are no templates for the objects, right? They are all generated in code?

In short yes, we can extend the schema for this file if needed, but to keep it simple for now is better to just "hardcode" the genaration of namespace, LQs, ResourceFlavor ....

alculquicondor · 2024-04-17T12:57:03Z

/approve

k8s-ci-robot · 2024-04-17T12:57:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mimowo, trasc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

test/scalability/README.md

test/scalability/checker/checker_test.go

test/scalability/default_rangespec.yaml

test/scalability/minimalkueue/main.go

test/scalability/runner/controller/controller.go

alculquicondor · 2024-04-17T14:44:28Z

/lgtm

k8s-ci-robot · 2024-04-17T14:44:33Z

LGTM label has been added.

Git tree hash: d263b36621e23db124057908d6673e1576c52513

* [testing] Make RestConfigToKubeConfig an util function. * [scalability] Initial implementation. * Review remarks. * Review Remarks * Review Remarks

alculquicondor · 2024-05-08T17:49:33Z

/release-note-edit

Added scalability test for scheduling performance

/kind feature

alculquicondor · 2024-05-29T20:25:24Z

/remove-kind feature
/kind cleanup

k8s-ci-robot requested review from alculquicondor and denkensk March 29, 2024 14:41

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 29, 2024

trasc force-pushed the scalability-tests branch from 96f452d to b38e745 Compare April 2, 2024 13:59

k8s-ci-robot assigned mimowo Apr 2, 2024

trasc force-pushed the scalability-tests branch 2 times, most recently from 29752d1 to bc15fd0 Compare April 4, 2024 12:26

trasc force-pushed the scalability-tests branch from bc15fd0 to 9ecbb13 Compare April 4, 2024 14:30

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 5, 2024

trasc changed the title ~~Scalability tests~~ Scalabilit tests Apr 5, 2024

trasc marked this pull request as ready for review April 5, 2024 11:56

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 5, 2024

k8s-ci-robot requested a review from mimowo April 5, 2024 11:56

trasc force-pushed the scalability-tests branch from 8d87598 to effbb88 Compare April 5, 2024 14:01

trasc changed the title ~~Scalabilit tests~~ Scalability tests Apr 9, 2024

mimowo reviewed Apr 10, 2024

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 10, 2024

mimowo reviewed Apr 16, 2024

View reviewed changes

test/scalability/runner/main.go Outdated Show resolved Hide resolved

Makefile Show resolved Hide resolved

test/scalability/README.md Show resolved Hide resolved

Review remarks.

0b1d04d

k8s-ci-robot assigned alculquicondor Apr 16, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 16, 2024

alculquicondor reviewed Apr 16, 2024

View reviewed changes

Review Remarks

b8767b0

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 17, 2024

k8s-ci-robot requested review from alculquicondor and mimowo April 17, 2024 06:38

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 17, 2024

alexandear reviewed Apr 17, 2024

View reviewed changes

Review Remarks

f3a7d05

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 17, 2024

k8s-ci-robot merged commit c691eef into kubernetes-sigs:main Apr 17, 2024
14 checks passed

k8s-ci-robot added this to the v0.7 milestone Apr 17, 2024

trasc deleted the scalability-tests branch April 17, 2024 15:02

vsoch pushed a commit to researchapps/kueue that referenced this pull request Apr 18, 2024

Scalability tests (kubernetes-sigs#1931)

a9a7ca9

* [testing] Make RestConfigToKubeConfig an util function. * [scalability] Initial implementation. * Review remarks. * Review Remarks * Review Remarks

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label May 8, 2024

k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. and removed kind/feature Categorizes issue or PR as related to a new feature. labels May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalability tests #1931

Scalability tests #1931

trasc commented Mar 29, 2024 •

edited by k8s-ci-robot

Loading

k8s-ci-robot commented Mar 29, 2024

netlify bot commented Mar 29, 2024 •

edited

Loading

alculquicondor commented Apr 2, 2024

trasc commented Apr 4, 2024

mimowo commented Apr 8, 2024

mimowo left a comment

trasc commented Apr 10, 2024

mimowo commented Apr 10, 2024

trasc commented Apr 10, 2024

trasc commented Apr 11, 2024

mimowo left a comment

trasc commented Apr 16, 2024

mimowo commented Apr 16, 2024

mimowo commented Apr 16, 2024

k8s-ci-robot commented Apr 16, 2024

alculquicondor Apr 16, 2024

trasc Apr 17, 2024

alculquicondor Apr 17, 2024

alculquicondor Apr 17, 2024

alculquicondor Apr 16, 2024

trasc Apr 17, 2024

alculquicondor Apr 17, 2024

trasc Apr 17, 2024

alculquicondor Apr 16, 2024

trasc Apr 17, 2024

alculquicondor commented Apr 17, 2024

k8s-ci-robot commented Apr 17, 2024

alculquicondor commented Apr 17, 2024

k8s-ci-robot commented Apr 17, 2024

alculquicondor commented May 8, 2024

alculquicondor commented May 29, 2024

		@@ -0,0 +1,64 @@
		# Scalability test

		Is a test meant to detect regressions int the Kueue's overall scheduling capabilities.

Scalability tests #1931

Scalability tests #1931

Conversation

trasc commented Mar 29, 2024 • edited by k8s-ci-robot Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Mar 29, 2024

netlify bot commented Mar 29, 2024 • edited Loading

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

alculquicondor commented Apr 2, 2024

trasc commented Apr 4, 2024

mimowo commented Apr 8, 2024

mimowo left a comment

Choose a reason for hiding this comment

trasc commented Apr 10, 2024

mimowo commented Apr 10, 2024

trasc commented Apr 10, 2024

trasc commented Apr 11, 2024

mimowo left a comment

Choose a reason for hiding this comment

trasc commented Apr 16, 2024

mimowo commented Apr 16, 2024

mimowo commented Apr 16, 2024

k8s-ci-robot commented Apr 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor commented Apr 17, 2024

k8s-ci-robot commented Apr 17, 2024

alculquicondor commented Apr 17, 2024

k8s-ci-robot commented Apr 17, 2024

alculquicondor commented May 8, 2024

alculquicondor commented May 29, 2024

trasc commented Mar 29, 2024 •

edited by k8s-ci-robot

Loading

netlify bot commented Mar 29, 2024 •

edited

Loading