GitHub - zhichao-li/ray_dist

Distributed training with ray

The example contains both tensorflow's and pytorch's examples.

The timeHistory.py file is a timer for TensorFlow Distributed training which is not contained in ray. If you want to set the timer, you can see here

Run `tensorflow` example with command

python tensorflow_train_example.py -n 2 --hadoop_conf $HADOOP_CONF_DIR/

-n is the number of node you you want to use.

--hadoop_conf is the yarn's config file path. If you want to run it on local, just delete this arg.

Also, you can add --batch_size to set batch size, the default value is 128.

Run `pytorch` example with command

python train_example.py -n 2 --hadoop_conf $HADOOP_CONF_DIR/

-n is the number of node you you want to use.

--hadoop_conf is the yarn's config file path. If you want to run it on local, just delete this arg.

But in this two files, we didn't ues init_spark_on_local and if you need, you can the change code manually.

Add Timer

You can add the timeHistory.py file into the path PATH TO RAY/python/ray/experimental/sgd/tf and change the file tf_runner.py

from ray.experimental.sgd.tf.timeHistory import TimeHistory
...
...
...
    def step(self):
        ...
        time_callback = TimeHistory()
        history = self.model.fit(self.train_dataset, **fit_default_config, callbacks=[time_callback])
        if history is None:
            stats = {}
        else:
            logger.info(time_callback.batch_time)
            stats = {"train_" + k: v[-1] for k, v in history.history.items()}
            stats["batch_time"] = sum(time_callback.batch_time) / len(time_callback.batch_time)
        ...
...
...

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
examples		examples
readme.md		readme.md
timeHistory.py		timeHistory.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed training with ray

Run `tensorflow` example with command

Run `pytorch` example with command

Add Timer

About

Releases

Packages

Contributors 2

Languages

zhichao-li/ray_dist

Folders and files

Latest commit

History

Repository files navigation

Distributed training with ray

Run tensorflow example with command

Run pytorch example with command

Add Timer

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Run `tensorflow` example with command

Run `pytorch` example with command

Packages