-
Notifications
You must be signed in to change notification settings - Fork 53
Fix deletion of elastic task resource requests #379
Conversation
} | ||
podSpec.Containers[idx].Resources = *resources | ||
} | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the problematic line.
@@ -720,7 +720,7 @@ func TestBuildResourcePytorchV1WithZeroWorker(t *testing.T) { | |||
assert.Error(t, err) | |||
} | |||
|
|||
func TestParasElasticConfig(t *testing.T) { | |||
func TestParseElasticConfig(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scope creep: fix typo
Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>
5f9842a
to
331a60c
Compare
@yubofredwang I somehow can't tag you as reviewer but could you please sanity check this? Thanks |
Codecov Report
@@ Coverage Diff @@
## master #379 +/- ##
==========================================
- Coverage 62.96% 62.94% -0.03%
==========================================
Files 154 154
Lines 13019 13016 -3
==========================================
- Hits 8198 8193 -5
- Misses 4208 4210 +2
Partials 613 613
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@fg91 just want to make sure I understand the logic here. Basically for the pytorch, MPI, and tensorflow plugins there are three states resources for each replica can be:
This introduces issues because in the elastic plugin the resources will always be Setting the resources for each replica in flytekit to empty lists for the elastic plugin would achieve the same result right? |
Your understanding is correct that there are these three possible states for resources for each replica and that this is the current behaviour.
It would, yes, I first considered doing this, but I personally find the current logic here odd: Is there a scenario where we would ever want to end up with a pod without any resource requests despite having set them in the task decorator's From the perspective of the elastic plugin I find it unintuitive that in order for the resources set in Considering the function This is why I personally think we should address this in the backend and not in I'd be interested in your thoughts on this @yubofredwang |
The reason of the code logic is that I assume when user does not specify resources at all. They want to use the default resources user set for that namespace. Considering the link here How about having an extra check like this
|
When not specifying any resources at all in a task, e.g. @task
def train() -> str:
print("Trained") the respective pod should have the task defaults configured on the Flyte platform side , not defaults configured e.g. via a In our case the task above runs with the platform defaults we configured in the helm values resources:
limits:
cpu: "1"
memory: 2Gi
requests:
cpu: "1"
memory: 2Gi I would expect the same resources for this task: @task(
task_config=Elastic(
nnodes=2,
nproc_per_node=2,
)
)
def train() -> str:
print("Trained") When using flyte 1.8.0 (helm chart, resources: {} In my opinion this is not the expected behaviour. |
Signed-off-by: Fabio Graetz <fabiograetz@googlemail.com>
TL;DR
Currently, the following elastic task's pods will not have any resource requests (1.8 release):
This PR fixes this.
Type
Are all requirements met?
Complete description
The kubeflow plugin was recently refactored in flytekit (flyteorg/flytekit#1627) as well as in flyteplugins (#345).
After the refactoring, users can set e.g. different resources for master and worker replicas of
PyTorch
tasks:If the user overrides one of the resource requests in the
task_config
, in the backend we override the resources configured on the task level here.Now, in a
@task(task_config=PyTorch()..)
task, when the user doesn't override the resources, we sendtask_models.Resources(requests=[], limits=[])
(which is different fromNone/nil
, see here) to the backend as part of the worker spec. We go into thisif
statement but not into this one within the first. As a consequence, in this case we don't overwrite the resource request set in@task(requests=Resources())
. This makes sense.However, the
Elastic
task config inflytekitplugins
currently doesn't allow overwriting the resource request since elastic pytorch training jobs don't have a differentiation between master and worker replicas. There are only workers which means we can simply use the normal@task(requests=Resources())
to specify their resources.This, however, means that in this case we send
None/nil
to the backend as part of the worker specs resource request. Here we go into theelse
case and simply delete what the user set as@task(requests=Resources())
.This seems wrong and this PR changes this.
Tracking Issue
NA
fixes https://github.com/flyteorg/flyte/issues/
Follow-up issue
NA