-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support data parallelism with a GPU cluster #369
Comments
Horovod with PyTorch V.S. PyTorch DistributedDataParallel
PyTorch DistributedDataParallel and Horovod distributed training benchmarks
|
libtorch provides a thin wrapper for NCCL/Gloo, ProcessGroupNCCL. It has two advantages comparing with using NCCL directly:
Here is an example: #include <c10d/FileStore.hpp>
#include <c10d/ProcessGroupGloo.hpp>
using namespace ::c10d;
int main(int argc, char** argv) {
int rank = atoi(getenv("RANK"));
int size = atoi(getenv("SIZE"));
auto store = std::make_shared<FileStore>("/tmp/c10d_example", size);
ProcessGroupGloo pg(store, rank, size);
// Create some tensors
const auto ntensors = 10;
std::vector<at::Tensor> tensors;
for (auto i = 0; i < ntensors; i++) {
auto x =
at::ones({1000, 16 * (i + 1)}, at::TensorOptions(at::CPU(at::kFloat)));
tensors.push_back(x);
}
// Kick off work
std::vector<std::shared_ptr<ProcessGroup::Work>> pending;
for (auto i = 0; i < ntensors; i++) {
std::vector<at::Tensor> tmp = {tensors[i]};
pending.push_back(pg.allreduce(tmp));
}
// Wait for work to complete
for (auto& work : pending) {
work->wait();
}
}
|
We want to train a distributed MNIST example, the following is a MVP(Minimum Viable Product) for the target:
After we complete the above things, we could do more optimizations, including:
|
Any progress for distributed training |
Data Parallelism
Data parallelism replicates the model on every device to generates gradients independently and then communicates those gradients at each iteration to keep model replicas consistent.
Following is a survey for support data parallelism in GoTorch.
Solutions
NCCL and Gloo
NCCL provides Broadcast and AllReduce C APIs, we could wrapper them in Go, and use them directly in GoTorch.
Gloo is another collective communications library, which supports both CPU and GPU.
The GPU performance of NCCL is better than Gloo.
PyTorch Distributed Package
It does more optimizations, including bucketing small gradients into a big tensor, overlapping communication and computation.
Please refer to this paper for more details.
Horovod
Horovod is a distributed deep learning training framework for TensorFlow, Keras, and PyTorch. Horovod calls NCCL or Gloo underneath.
Horovod also does many optimizations for communication. It uses the hook mechanism of PyTorch to overlapping communication and computation.
Horovod also supports elastic training.
The elastic training depends on the Gloo library. So, the GPU performance may suffer a little.
An interesting observation: People who want to run TensorFlow with AllReduce distributed strategy will choose Horovod, whereas people who want to run PyTorch with AllReduce distributed strategy will choose
torch.DistributedDataParallel
directly.Summary
So, let's make a summary:
Note 1
Key points to improve the performance:
Note 2
Both Horovod and PyTorch support Gloo backend, so we could support elastic training later if we choose either solution.
The text was updated successfully, but these errors were encountered: