Skip to content

Use Kubeflow Pipeline and Argo to run distributed training jobs.

Notifications You must be signed in to change notification settings

typhoonzero/kfpdist

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kubeflow Pipeline distributed training support

kfp-dist-train contains utilities to use together with Kubeflow Pipeline to enable writing distributed training code directly using Kubeflow Pipeline SDK.

Get Started

  1. Setup an Kubeflow environment (maybe use https://github.com/alauda/kubeflow-chart).
  2. Upload the example kfp-dist-train.ipynb into a Notebook instance, or setup local pipeline submit.
  3. Execute the example to submit a workflow, you can configure the number of workers in the Kubeflow web UI. The job should look like below:

Some Roadmap

  • support kfpdist.component(dist=True) decorator as an wrap of dsl.component
  • support parameter server strategy
  • support pytorch

About

Use Kubeflow Pipeline and Argo to run distributed training jobs.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published