-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix transitions of Requeued condition #2063
Fix transitions of Requeued condition #2063
Conversation
Hi @mbobrovskyi. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
/assign |
|
||
// If a deactivated workload is re-activated we need to set the WorkloadRequeued condition to true if already unset. | ||
if workload.IsUnsetRequeuedByDeactivation(&wl) { | ||
workload.SetRequeuedCondition(&wl, "ReactivateWorkload", "The workload is reactivated") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have an API constant for "ReactivateWorkload"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
cc @mimowo |
ack |
/test pull-kueue-test-scheduling-perf-main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, I'm yet wondering if we handle properly QueueStopped scenario.
pkg/workload/workload.go
Outdated
if cond == nil || cond.Status != metav1.ConditionFalse || cond.Reason != kueue.WorkloadEvictedByPodsReadyTimeout { | ||
return nil, false | ||
} | ||
return cond, true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if cond == nil || cond.Status != metav1.ConditionFalse || cond.Reason != kueue.WorkloadEvictedByPodsReadyTimeout { | |
return nil, false | |
} | |
return cond, true | |
return cond != nil && cond.Status == metav1.ConditionFalse && cond.Reason == kueue.WorkloadEvictedByPodsReadyTimeout |
as above, much easier to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if condition, isUnset := workload.IsUnsetRequeuedByPodsReadyTimeout(&wl); condition != nil && isUnset { | ||
var requeueAfter time.Duration | ||
if wl.Status.RequeueState != nil && wl.Status.RequeueState.RequeueAt != nil { | ||
requeueAfter = time.Until(wl.Status.RequeueState.RequeueAt.Time) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compute the time using the r.clock. should be available on the main branch. It allows us to use fakeClock for testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
apis/kueue/v1beta1/workload_types.go
Outdated
const ( | ||
// WorkloadRequeuedByReactivation indicates that the workload was requeued | ||
// because spec.active is set to true. | ||
WorkloadRequeuedByReactivation = "ReactivateWorkload" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WorkloadRequeuedByReactivation = "ReactivateWorkload" | |
WorkloadReactivated = "WorkloadReactivated" |
It is a reason
. It will be clear that it is Requeued
from the condition type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -329,6 +329,12 @@ const ( | |||
WorkloadEvictedByDeactivation = "InactiveWorkload" | |||
) | |||
|
|||
const ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: put together with other workload reasons
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
2fb25cd
to
e1a6163
Compare
aab5766
to
f21c166
Compare
@@ -1832,6 +1895,150 @@ var _ = ginkgo.Describe("Job controller interacting with scheduler", ginkgo.Orde | |||
}) | |||
}) | |||
|
|||
var _ = ginkgo.Describe("Job controller interacting with scheduler when waitForPodsReady enabled", ginkgo.Ordered, ginkgo.ContinueOnFailure, func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fear that there is high potential for flakiness here.
The ideal solution is to manipulate the passing of time in the test, if you can pass a fake time to every controller.
OTOH, we are not too concerned with what the scheduler is doing, so we could potentially remove it from the equation. What we are concerned mostly is the interaction between the job controller and the workload reconciler.
@mimowo do you have any suggestions on how to improve this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With current solution I think we can't manipulate the passing of time in the test. Inside worload_controller we are using another incapsulated waitForPodsReadyConfig.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With scheduler I think it's more realistic case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the test itself is fine, but I also believe there is a potential for flakes in this assert "checking the workload is evicted", because historically 1s was not enough for some asserts, and we use BackoffBaseSeconds: 1
.
I think we could pro-actively set the BackoffBaseSeconds
to 2
or 3
, but I'm also ok to wait until we observe flakes here.
Integration passed 1/5 /test pull-kueue-test-integration-main |
Integration passed 2/5 /test pull-kueue-test-integration-main |
Integration passed 3/5 /test pull-kueue-test-integration-main |
Integration passed 4/5 /test pull-kueue-test-integration-main |
Integration passed 5/5 /test pull-kueue-test-integration-main |
/lgtm If @mimowo has any suggestions, we can apply them in a follow up. |
LGTM label has been added. Git tree hash: e5287e4c21b2efecc94d6bf8fd1dec28022ab3e2
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, mbobrovskyi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
if requeueAfter > 0 { | ||
return reconcile.Result{RequeueAfter: requeueAfter}, nil | ||
} | ||
workload.SetRequeuedCondition(&wl, kueue.WorkloadBackoffFinished, "The workload backoff was finished", true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks inconsistent: that we don't clearwl.Status.RequeueState
for PodsReady timeout, but we clear it on L168 in case of Reactivation. In particular, I'm wondering if this could lead to incrementing the RequeueState.Count
here on the second requeue. Please verify and fix if this is the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that we need to clear wl.Status.RequeueState
in this case, because we need to know how much attempts we have, to deactivate workload if we exceeded the limit here.
func (r *WorkloadReconciler) triggerDeactivationOrBackoffRequeue(ctx context.Context, wl *kueue.Workload) (bool, error) {
...
// If requeuingBackoffLimitCount equals to null, the workloads is repeatedly and endless re-queued.
requeuingCount := ptr.Deref(wl.Status.RequeueState.Count, 0) + 1
if r.waitForPodsReady.requeuingBackoffLimitCount != nil && requeuingCount > *r.waitForPodsReady.requeuingBackoffLimitCount {
wl.Spec.Active = ptr.To(false)
if err := r.client.Update(ctx, wl); err != nil {
return false, err
}
r.recorder.Eventf(wl, corev1.EventTypeNormal, kueue.WorkloadEvictedByDeactivation,
"Deactivated Workload %q by reached re-queue backoffLimitCount", klog.KObj(wl))
return true, nil
}
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you mean to clear just RequeueAt field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that we need to clear wl.Status.RequeueState in this case, because we need to know how much attempts we have
Ah, got it, we need to preserve the count
so that we can increment it if the PodsReady timeout is exceeded for the next time.
Or you mean to clear just RequeueAt field?
I guess the RequeueAt
field is harmless, but might be confusing, since the workload is already requeued, so indeed, I would suggest to clear it.
return ctrl.Result{}, err | ||
} | ||
// If stopped cluster queue is started we need to set the WorkloadRequeued condition to true. | ||
if isDisabledRequeuedByClusterQueueStopped(&wl) && ptr.Deref(cq.Spec.StopPolicy, kueue.None) == kueue.None { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suspect it does not matter what was the StopPolicy
that lead to the setting Requeued: false
for ClusterQueueStopped. In particular, if the stopPolicy: HoldAndDrain
, then I believe we would set the Requeued: false
here. However, the extra check ptr.Deref(cq.Spec.StopPolicy, kueue.None) == kueue.None
would prevent us to set Requeued: true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think the implementation is correct (as discussed offline), we only set Requeue: true
if stopPolicy: None
.
@mbobrovskyi please verify and address (if needed) in a follow up the following comments: |
@mbobrovskyi thanks for handling the comments, so the only follow up would be to clear |
Fixed it on #2143. |
/release-note-edit
Because this fixes an unreleased feature. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #2038
Special notes for your reviewer:
Does this PR introduce a user-facing change?