Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve documentation and add distributed.rst #84

Open
wants to merge 1 commit into
base: thd
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/autograd.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ All :class:`Variable` s keep track of in-place operations applied to them, and
if the implementation detects that a variable was saved for backward in one of
the functions, but it was modified in-place afterwards, an error will be raised
once backward pass is started. This ensures that if you're using in-place
functions and not seing any errors, you can be sure that the computed gradients
functions and not seeing any errors, you can be sure that the computed gradients
are correct.


Expand Down
103 changes: 103 additions & 0 deletions docs/source/distributed.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
.. distributed:: torch.

Modes
=====

There are two modes to control the way calculations are distributed: master-worker mode and process-group mode.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the m-w mode and the p-g mode

Each of them is designed to resemble interfaces and APIs familiar to PyTorch users.

The process-group mode API is made to look like the MPI API.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the MPI API is a great reference in a tutorial - maybe a little eplanation of the characteristics instead?

It is based on the idea that every machine in the network should queue jobs for itself.
Its interface is designed to give the programmer more flexibility than the master-worker mode which naturally makes its API a little bit more complex.

.. TODO: code examples

The master-worker mode on the other hand is dedicated to the users familiar with Nvidia CUDA.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's dedicated to users familiar with PyTorch's CUDA API

The API is simple and makes it easy to start using it right away.
This model does not scale well due to the bottleneck in the master node which is responsible for all the planning job queuing in all the worker nodes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not scale well -> is not designed for great scalability
?

Nevertheless, it is sufficient for networks of few machines.

.. TODO: code examples

Distributed computing
=====================

In order to configure connections for a program using THD it is required to set a couple of environmental variables.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think THD is the C++ library, which doesn't concern the users

Depending on the data channel implementation and selected mode it might be required to set a different subset of those variables.

Environmental variables
^^^^^^^^^^^^^^^^^^^^^^^
.. TODO: Explain the usage of the environmental variables.

Here is a list of the environmental variables which configure THD:

- `RANK`
It is used to differentiate nodes within the network from each other.
It should be unique and in range `0`-`$WORLD_SIZE-1`.

- `WORLD_SIZE`
It is used to pass the information about the size of a network to nodes.

- `MASTER_PORT`
This variable should be set to the number of a port which worker nodes use to establish connection with a master node.
It needs to be set in the master node (worker nodes ignore it).

- `MASTER_ADDR`
This variable should set to the address of a master node and the value of `MASTER_PORT` separated by a colon.
For example `server15:29500` or `192.168.70.15:29500`.
It needs to be set in worker nodes (the master node ignores it).

How to launch a distributed program?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Let's assume we have got a network of 3 machines called node0, node1 and node2.
Our PyTorch script is located in `~jim/project` and is called `training.py`.
The MPI examples were tested on the MPICH implementation - specifically, the `mpirun` launcher differs between MPI implementations.

Process-group mode
------------------

MPI
~~~

::
# On node0.
mpirun -hosts node0,node1,node2 -n 3 python ~jim/project/training.py

The world size (number of nodes) has to be passed to `mpirun` using the `-n` option.
It is required by the MPI to how many processes it should spawn on the hosts provided with the `-hosts` option.

It is not required to set any THD-specific environmental variables when using MPI.

.. Warn the user that mpirun might differ between MPI implementations.
Remember that `mpirun` is not standardized by the MPI.
Even if you are not using the MPICH implementation it should be pretty straightforward to set everything up.

TCP & Gloo
~~~~~~~~~~

::
# On node0:
export WORLD_SIZE=3 RANK=0 MASTER_PORT=29500
python ~jim/project/training.py

# On node1:
export WORLD_SIZE=3 RANK=1 MASTER_ADDR=node0:29500
python ~jim/project/training.py

# On node2:
export WORLD_SIZE=3 RANK=2 MASTER_ADDR=node0:29500
python ~jim/project/training.py

node0 acts as a master node here.

.. TODO: Make a distinction between the p-g mode and the m-w mode.

API
===
.. TODO: Document the API.

.. ............................................................................
.. General notes:
TODO: Fix formatting.
TODO: Focus on p-g mode.
2 changes: 1 addition & 1 deletion docs/source/multiprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,6 @@ the current process group, and will keep track of all shared memory allocations.
Once all processes connected to it exit, it will wait a moment to ensure there
will be no new connections, and will iterate over all shared memory files
allocated by the group. If it finds that any of them still exist, they will be
deallocated. We've tested this method and it prooved to be robust to various
deallocated. We've tested this method and it proved to be robust to various
failures. Still, if your system has high enough limits, and ``file_descriptor``
is a supported strategy, we do not recommend switching to this one.