-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scalability tests #1931
Scalability tests #1931
Conversation
Skipping CI for Draft Pull Request. |
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
/assign @mimowo |
29752d1
to
bc15fd0
Compare
/test all |
ack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I would like to steer the first iteration of the PR in this direction:
- drop custom code to track individual workloads, events, queues with controllers
- we only track metrics, scraped in 1s intervals by default. A report is saved for a subset of metrics, only guage and counter, in csv as time series. I'm ok with the metrics scraping as a follow up.
I think this approach will be much easier to maintain going forward. It will also create a healthy incentive to add metrics, which will be used on production clusters.
Let's go with the set of features already developed and add new functionality in follow-ups. |
I would like to drop the bits indicated in the comments to simplify the code, there is a maintenance cost if we commit them. |
/hold |
Relying on kueue's internal metrics in this scenario is problematic because:
It needs to be mentioned that the runner components that are extracting the measurements have additional attributes, (managing when a workload is declared finished, managing workload's eviction , determine when the scenario execution has ended.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, thanks for the cleanup.
We should not look at it as a cleanup, is just a split in two PRs. |
Right, but let's park the second part until we play more with the tool, and have some idea if this is really needed, because it adds significant complexity to the code. |
/lgtm /assign @alculquicondor |
LGTM label has been added. Git tree hash: 7798498d85c2e3dc8dce9db1952b349225ae70d3
|
endif | ||
|
||
ifdef SCALABILITY_KUEUE_LOGS | ||
SCALABILITY_EXTRA_ARGS += --withLogs=true --logToFile=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the file that it goes to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bin/run-scalability/minimalkueue.err.log
and bin/run-scalability/minimalkueue.out.log
(Just a side note, 1820 was really good, only from log size POV it got the size of bin/run-scalability/minimalkueue.err.log
from around 3GB to under 100MB )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @gabesaba, as the person who discovered the issue :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test/scalability/README.md
Outdated
@@ -0,0 +1,64 @@ | |||
# Scalability test | |||
|
|||
Is a test meant to detect regressions int the Kueue's overall scheduling capabilities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should put all of this inside test/performance/scheduling?
The existing tests in test/performance are scalability tests as well, just a different level.
Another potential directory structure could be:
test/performance
(for this tool) and test/performance_e2e
(for the cl2 based things).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can but let's come up with a precise naming scheme, not only for the code location but also the artifacts and make targets, otherwise it will be hard to follow the terminology.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the target names should match. Maybe we can put all the targets for "performance" inside its own Makefile. so it would look like:
make test/performance/jobs test/performance/scheduling
But we can leave them for a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go with the follow-up then
@@ -0,0 +1,31 @@ | |||
- className: cohort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no templates for the objects, right? They are all generated in code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In short yes, we can extend the schema for this file if needed, but to keep it simple for now is better to just "hardcode" the genaration of namespace, LQs, ResourceFlavor ....
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, mimowo, trasc The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
LGTM label has been added. Git tree hash: d263b36621e23db124057908d6673e1576c52513
|
* [testing] Make RestConfigToKubeConfig an util function. * [scalability] Initial implementation. * Review remarks. * Review Remarks * Review Remarks
/release-note-edit
/kind feature |
/remove-kind feature |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Add scalability test to be able to detect regressions in Kueue's overall scheduling performance.
Which issue(s) this PR fixes:
Relates to #1912
Special notes for your reviewer:
This should be the fist stage of this change, once this is merged and we have a CI job set up, we should refine the generator config and the expected metrics value.
Does this PR introduce a user-facing change?