Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Feat: Configure elastic training in pytorch plugin #343

Merged
merged 5 commits into from
Apr 24, 2023

Conversation

fg91
Copy link
Member

@fg91 fg91 commented Apr 10, 2023

TL;DR

This PR modifies the pytorch plugin so that it can set an ElasticPolicy in the kubeflow PytorchJob in case a user configures torch elastic training (torchrun) in the task decorator:

from flytekitplugins.kfpytorch import Elastic

@task(
    task_config=Elastic(
        replicas=4,
        nproc_per_node=4,
        ...
    ),
    ...
)
def train(...):
    ...

See this issue for motivation and more details.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

Tracking Issue

Fixes flyteorg/flyte#3614

Follow-up issue

Fabio Grätz added 2 commits April 22, 2023 20:03
Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
Signed-off-by: Fabio Grätz <fabiogratz@googlemail.com>
@fg91
Copy link
Member Author

fg91 commented Apr 23, 2023

Tests are failing since flyteidl needs to be updated first.

@fg91 fg91 marked this pull request as ready for review April 23, 2023 11:01
@fg91 fg91 requested a review from kumare3 April 23, 2023 11:01
@fg91 fg91 self-assigned this Apr 23, 2023
@fg91 fg91 added the enhancement New feature or request label Apr 23, 2023
Signed-off-by: Ketan Umare <ketan.umare@gmail.com>
Signed-off-by: Ketan Umare <ketan.umare@gmail.com>
@codecov
Copy link

codecov bot commented Apr 24, 2023

Codecov Report

Merging #343 (74ea839) into master (1f39163) will increase coverage by 1.43%.
The diff coverage is 100.00%.

❗ Current head 74ea839 differs from pull request most recent head c8de6e2. Consider uploading reports for the commit c8de6e2 to get more accurate results

@@            Coverage Diff             @@
##           master     #343      +/-   ##
==========================================
+ Coverage   62.64%   64.07%   +1.43%     
==========================================
  Files         148      148              
  Lines       12397    10072    -2325     
==========================================
- Hits         7766     6454    -1312     
+ Misses       4036     3023    -1013     
  Partials      595      595              
Flag Coverage Δ
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...o/tasks/plugins/k8s/kfoperators/pytorch/pytorch.go 80.00% <100.00%> (+7.65%) ⬆️

... and 130 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Signed-off-by: Ketan Umare <ketan.umare@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core feature] Support torch elastic training/torchrun
2 participants