Fix transitions of Requeued condition #2063

mbobrovskyi · 2024-04-25T08:14:45Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

fixed Requeued condition transition on activate/deactivate workload
fixed Requeued condition transition on WaitForPodsReady

Which issue(s) this PR fixes:

Fixes #2038

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

k8s-ci-robot · 2024-04-25T08:14:54Z

Hi @mbobrovskyi. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

netlify · 2024-04-25T08:15:03Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`c1fa647`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/66351cc7a09ac500085fc9df

trasc · 2024-04-25T08:16:45Z

/assign
/ok-to-test
/test all

trasc · 2024-04-25T08:26:23Z

pkg/controller/core/workload_controller.go

+
+		// If a deactivated workload is re-activated we need to set the WorkloadRequeued condition to true if already unset.
+		if workload.IsUnsetRequeuedByDeactivation(&wl) {
+			workload.SetRequeuedCondition(&wl, "ReactivateWorkload", "The workload is reactivated")


Can we have an API constant for "ReactivateWorkload"

alculquicondor · 2024-04-25T18:13:07Z

cc @mimowo

mimowo · 2024-04-26T11:47:52Z

cc @mimowo

ack

mbobrovskyi · 2024-04-26T11:54:17Z

/test pull-kueue-test-scheduling-perf-main

mimowo

LGTM overall, I'm yet wondering if we handle properly QueueStopped scenario.

mimowo · 2024-04-26T12:21:49Z

pkg/workload/workload.go

+	if cond == nil || cond.Status != metav1.ConditionFalse || cond.Reason != kueue.WorkloadEvictedByPodsReadyTimeout {
+		return nil, false
+	}
+	return cond, true


Suggested change

if cond == nil || cond.Status != metav1.ConditionFalse || cond.Reason != kueue.WorkloadEvictedByPodsReadyTimeout {

return nil, false

}

return cond, true

return cond != nil && cond.Status == metav1.ConditionFalse && cond.Reason == kueue.WorkloadEvictedByPodsReadyTimeout

as above, much easier to read

mimowo · 2024-04-26T12:25:13Z

pkg/controller/core/workload_controller.go

+		if condition, isUnset := workload.IsUnsetRequeuedByPodsReadyTimeout(&wl); condition != nil && isUnset {
+			var requeueAfter time.Duration
+			if wl.Status.RequeueState != nil && wl.Status.RequeueState.RequeueAt != nil {
+				requeueAfter = time.Until(wl.Status.RequeueState.RequeueAt.Time)


Compute the time using the r.clock. should be available on the main branch. It allows us to use fakeClock for testing.

pkg/controller/jobs/job/job_controller_test.go

mimowo · 2024-04-26T12:43:37Z

apis/kueue/v1beta1/workload_types.go

+const (
+	// WorkloadRequeuedByReactivation indicates that the workload was requeued
+	// because spec.active is set to true.
+	WorkloadRequeuedByReactivation = "ReactivateWorkload"


Suggested change

WorkloadRequeuedByReactivation = "ReactivateWorkload"

WorkloadReactivated = "WorkloadReactivated"

It is a reason. It will be clear that it is Requeued from the condition type.

mimowo · 2024-04-26T12:46:36Z

apis/kueue/v1beta1/workload_types.go

@@ -329,6 +329,12 @@ const (
 	WorkloadEvictedByDeactivation = "InactiveWorkload"
 )

+const (


nit: put together with other workload reasons

pkg/controller/core/workload_controller_test.go

pkg/controller/core/workload_controller.go

pkg/workload/workload.go

test/integration/controller/core/workload_controller_test.go

test/integration/controller/jobs/job/job_controller_test.go

alculquicondor · 2024-05-03T13:23:23Z

test/integration/controller/jobs/job/job_controller_test.go

@@ -1832,6 +1895,150 @@ var _ = ginkgo.Describe("Job controller interacting with scheduler", ginkgo.Orde
 	})
 })

+var _ = ginkgo.Describe("Job controller interacting with scheduler when waitForPodsReady enabled", ginkgo.Ordered, ginkgo.ContinueOnFailure, func() {


I fear that there is high potential for flakiness here.

The ideal solution is to manipulate the passing of time in the test, if you can pass a fake time to every controller.

OTOH, we are not too concerned with what the scheduler is doing, so we could potentially remove it from the equation. What we are concerned mostly is the interaction between the job controller and the workload reconciler.

@mimowo do you have any suggestions on how to improve this test?

With current solution I think we can't manipulate the passing of time in the test. Inside worload_controller we are using another incapsulated waitForPodsReadyConfig.

With scheduler I think it's more realistic case.

I think the test itself is fine, but I also believe there is a potential for flakes in this assert "checking the workload is evicted", because historically 1s was not enough for some asserts, and we use BackoffBaseSeconds: 1.

I think we could pro-actively set the BackoffBaseSeconds to 2 or 3, but I'm also ok to wait until we observe flakes here.

alculquicondor · 2024-05-03T17:56:33Z

Integration passed 1/5

/test pull-kueue-test-integration-main

mbobrovskyi · 2024-05-03T18:07:21Z

Integration passed 2/5

/test pull-kueue-test-integration-main

mbobrovskyi · 2024-05-03T18:16:54Z

Integration passed 3/5

/test pull-kueue-test-integration-main

mbobrovskyi · 2024-05-03T18:27:39Z

Integration passed 4/5

/test pull-kueue-test-integration-main

mbobrovskyi · 2024-05-03T18:40:07Z

Integration passed 5/5

/test pull-kueue-test-integration-main

alculquicondor · 2024-05-03T19:01:06Z

/lgtm
/approve

If @mimowo has any suggestions, we can apply them in a follow up.

k8s-ci-robot · 2024-05-03T19:01:12Z

LGTM label has been added.

Git tree hash: e5287e4c21b2efecc94d6bf8fd1dec28022ab3e2

k8s-ci-robot · 2024-05-03T19:01:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mbobrovskyi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mimowo · 2024-05-06T09:21:42Z

pkg/controller/core/workload_controller.go

+				if requeueAfter > 0 {
+					return reconcile.Result{RequeueAfter: requeueAfter}, nil
+				}
+				workload.SetRequeuedCondition(&wl, kueue.WorkloadBackoffFinished, "The workload backoff was finished", true)


It looks inconsistent: that we don't clearwl.Status.RequeueState for PodsReady timeout, but we clear it on L168 in case of Reactivation. In particular, I'm wondering if this could lead to incrementing the RequeueState.Count here on the second requeue. Please verify and fix if this is the case.

I'm not sure that we need to clear wl.Status.RequeueState in this case, because we need to know how much attempts we have, to deactivate workload if we exceeded the limit here.

func (r *WorkloadReconciler) triggerDeactivationOrBackoffRequeue(ctx context.Context, wl *kueue.Workload) (bool, error) { ... // If requeuingBackoffLimitCount equals to null, the workloads is repeatedly and endless re-queued. requeuingCount := ptr.Deref(wl.Status.RequeueState.Count, 0) + 1 if r.waitForPodsReady.requeuingBackoffLimitCount != nil && requeuingCount > *r.waitForPodsReady.requeuingBackoffLimitCount { wl.Spec.Active = ptr.To(false) if err := r.client.Update(ctx, wl); err != nil { return false, err } r.recorder.Eventf(wl, corev1.EventTypeNormal, kueue.WorkloadEvictedByDeactivation, "Deactivated Workload %q by reached re-queue backoffLimitCount", klog.KObj(wl)) return true, nil } ... }

Or you mean to clear just RequeueAt field?

I'm not sure that we need to clear wl.Status.RequeueState in this case, because we need to know how much attempts we have

Ah, got it, we need to preserve the count so that we can increment it if the PodsReady timeout is exceeded for the next time.

Or you mean to clear just RequeueAt field?

I guess the RequeueAt field is harmless, but might be confusing, since the workload is already requeued, so indeed, I would suggest to clear it.

mimowo · 2024-05-06T09:25:34Z

pkg/controller/core/workload_controller.go

+			return ctrl.Result{}, err
+		}
+		// If stopped cluster queue is started we need to set the WorkloadRequeued condition to true.
+		if isDisabledRequeuedByClusterQueueStopped(&wl) && ptr.Deref(cq.Spec.StopPolicy, kueue.None) == kueue.None {


I would suspect it does not matter what was the StopPolicy that lead to the setting Requeued: false for ClusterQueueStopped. In particular, if the stopPolicy: HoldAndDrain, then I believe we would set the Requeued: false here. However, the extra check ptr.Deref(cq.Spec.StopPolicy, kueue.None) == kueue.None would prevent us to set Requeued: true.

Oh, I think the implementation is correct (as discussed offline), we only set Requeue: true if stopPolicy: None.

mimowo · 2024-05-06T09:32:25Z

If @mimowo has any suggestions, we can apply them in a follow up.

@mbobrovskyi please verify and address (if needed) in a follow up the following comments:

mimowo · 2024-05-06T11:44:59Z

@mbobrovskyi thanks for handling the comments, so the only follow up would be to clear RequeueAt when we set Requeued: true.

mbobrovskyi · 2024-05-06T13:05:24Z

Fixed it on #2143.

alculquicondor · 2024-05-30T14:24:38Z

/release-note-edit

NONE

Because this fixes an unreleased feature.

k8s-ci-robot requested review from alculquicondor and denkensk April 25, 2024 08:14

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 25, 2024

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 25, 2024

k8s-ci-robot assigned trasc Apr 25, 2024

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 25, 2024

trasc reviewed Apr 25, 2024

View reviewed changes

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 26, 2024

mbobrovskyi marked this pull request as ready for review April 26, 2024 08:26

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 26, 2024

k8s-ci-robot requested a review from kerthcet April 26, 2024 08:26

mbobrovskyi mentioned this pull request Apr 26, 2024

REQUEST: New membership for mbobrovskyi kubernetes/org#4915

Closed

11 tasks

mimowo reviewed Apr 26, 2024

View reviewed changes

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 26, 2024

mbobrovskyi force-pushed the fix/transitions-of-requeued-conditions branch from 2fb25cd to e1a6163 Compare April 26, 2024 14:42

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 26, 2024

Fixed transitions of Requeued condition.

f21c166

mbobrovskyi force-pushed the fix/transitions-of-requeued-conditions branch from aab5766 to f21c166 Compare May 3, 2024 09:57

alculquicondor reviewed May 3, 2024

View reviewed changes

mbobrovskyi added 3 commits May 3, 2024 18:02

Optimized "workload evicted due pods ready timeout" test.

438014b

Moved IsDisabledRequeuedByClusterQueueStopped to WorkloadController

de69925

Skip RequeueState testing on reactivation.

c1fa647

k8s-ci-robot assigned alculquicondor May 3, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 3, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 3, 2024

k8s-ci-robot merged commit b088365 into kubernetes-sigs:main May 3, 2024
15 checks passed

k8s-ci-robot added this to the v0.7 milestone May 3, 2024

mimowo reviewed May 6, 2024

View reviewed changes

mbobrovskyi deleted the fix/transitions-of-requeued-conditions branch May 6, 2024 09:24

mimowo reviewed May 6, 2024

View reviewed changes

mimowo mentioned this pull request May 8, 2024

Fix "should requeued after job status ready" test. #2158

Merged

tenzen-y mentioned this pull request May 9, 2024

WaitForPodsReady: Store last requeued count and time #2175

Open

3 tasks

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels May 30, 2024

	WorkloadRequeuedByReactivation = "ReactivateWorkload"
	WorkloadReactivated = "WorkloadReactivated"

Fix transitions of Requeued condition #2063

Fix transitions of Requeued condition #2063

Conversation

mbobrovskyi commented Apr 25, 2024 • edited by k8s-ci-robot Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Apr 25, 2024

netlify bot commented Apr 25, 2024 • edited Loading

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

trasc commented Apr 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor commented Apr 25, 2024

mimowo commented Apr 26, 2024

mbobrovskyi commented Apr 26, 2024

mimowo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor commented May 3, 2024

mbobrovskyi commented May 3, 2024

mbobrovskyi commented May 3, 2024

mbobrovskyi commented May 3, 2024

mbobrovskyi commented May 3, 2024

alculquicondor commented May 3, 2024

k8s-ci-robot commented May 3, 2024

k8s-ci-robot commented May 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbobrovskyi May 6, 2024 • edited Loading

Choose a reason for hiding this comment

mimowo May 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo commented May 6, 2024

mimowo commented May 6, 2024

mbobrovskyi commented May 6, 2024 • edited Loading

alculquicondor commented May 30, 2024

mbobrovskyi commented Apr 25, 2024 •

edited by k8s-ci-robot

Loading

netlify bot commented Apr 25, 2024 •

edited

Loading

mbobrovskyi May 6, 2024 •

edited

Loading

mimowo May 6, 2024 •

edited

Loading

mbobrovskyi commented May 6, 2024 •

edited

Loading