Skip to content

Commit

Permalink
.
Browse files Browse the repository at this point in the history
  • Loading branch information
garrett4wade committed Sep 3, 2024
1 parent f19ed97 commit 9baae20
Show file tree
Hide file tree
Showing 2 changed files with 77 additions and 35 deletions.
36 changes: 1 addition & 35 deletions docs/source/arch.rst
Original file line number Diff line number Diff line change
Expand Up @@ -146,8 +146,7 @@ uses a heuristic allocation mode provided by ReaL. If users wish to make
manual allocations for distributed experiments, they can refer to
`local/ppo_manual.sh
<https://github.com/openpsi-project/ReaLHF/tree/main/examples/scripts/local/ppo_manual.sh>`_
to properly set device mesh strings. Customized Algorithms
=====================
to properly set device mesh strings.

Customized algorithms typically involve implementing a new interface
file and a new experiment configuration file so that the experiment can
Expand Down Expand Up @@ -318,39 +317,6 @@ APIs for the ``search`` allocation mode. Currently not functional.
Parallelism Strategy
====================

Suppose we have a cluster with the dimensions (N, M), where N is the
number of nodes and M is the number of GPUs per node. ReaL will launch N
* M model worker processes, each exclusively occupying a GPU. These
processes will share a global PyTorch process group, and each MFC will
create several sub-groups on their respective device meshes.

For example, suppose N=4, M=8, and we have MFC 1 occupying the first two
nodes, MFC 2 occupying the last three nodes, and MFC 3 occupying the
first node. ReaL will first create process groups on their device meshes
after creating the global group. Next, ReaL will create data, tensor,
and pipeline parallel groups within each sub-group, similar to
Megatron-LM. These groups will be kept in `constants.py
<https://github.com/openpsi-project/ReaLHF/tree/main/realhf/base/constants.py>`_
as per-process global constants.

In the above example, the first node is shared by MFC 1 and 2. When
different MFCs are executed on the same GPU, ReaL switches the process
group by using a ``model_scope`` context defined in `constants.py
<https://github.com/openpsi-project/ReaLHF/tree/main/realhf/base/constants.py>`_.
The model name is provided by the MFC. Within the scope, the 3D
parallelism groups specifically refer to the groups of this MFC.

In summary, there are three levels of process groups in ReaL. The first
level is the data/tensor/pipeline parallel group for a specific MFC. The
intermediate level is the rank within the MFC's sub-group. The outermost
level is the global rank across all nodes in the global group. The
conversion from the first level to the intermediate level is handled by
the ``ProcessTopology`` class in `topology.py
<https://github.com/openpsi-project/ReaLHF/tree/main/realhf/base/topology.py>`_,
and the conversion from the intermediate level to the outermost level is
managed by the ``rank_mapping`` dictionary in `constants.py
<https://github.com/openpsi-project/ReaLHF/tree/main/realhf/base/constants.py>`_.

- `constants.py
<https://github.com/openpsi-project/ReaLHF/tree/main/realhf/base/constants.py>`_

Expand Down
76 changes: 76 additions & 0 deletions docs/source/impl.rst
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,82 @@ configurations define an ``rpcs`` property, which is first processed by
the ``initial_setup`` method in ``realhf/experiments/common/common.py``
and then passed to the ``ExperimentConfig`` object to build the graph.

**********************
Parallelism Strategy
**********************

Suppose we have a cluster with the dimensions (N, M), where N is the
number of nodes and M is the number of GPUs per node. ReaL will launch N
* M model worker processes, each exclusively occupying a GPU. These
processes will share a global PyTorch process group, and each MFC will
create sub-groups on their respective device meshes. Furthermore, the
data, tensor, and pipeline parallel groups are created within each
sub-group, similar to Megatron-LM. These groups will be kept in
`constants.py
<https://github.com/openpsi-project/ReaLHF/tree/main/realhf/base/constants.py>`_
as per-process global constants.

When different MFCs are executed on the same GPU (i.e., their device
meshes are overlapped), ReaL switches the process group by using a
``model_scope`` context defined in `constants.py
<https://github.com/openpsi-project/ReaLHF/tree/main/realhf/base/constants.py>`_.
The model name is provided by the MFC. Within the scope, the 3D
parallelism groups specifically refer to the groups of this MFC.

In summary, there are three levels of process groups in ReaL. The
innermost level is the data/tensor/pipeline parallel group for a
specific MFC. The intermediate level is each MFC's sub-group. The
outermost level is the global PyTorch global group created by
``dist.init_process_group``. The conversion from the innermost level to
the intermediate level is handled by the ``ProcessTopology`` class in
`topology.py
<https://github.com/openpsi-project/ReaLHF/tree/main/realhf/base/topology.py>`_,
and the conversion from the intermediate level to the outermost level is
managed by the ``rank_mapping`` dictionary in `constants.py
<https://github.com/openpsi-project/ReaLHF/tree/main/realhf/base/constants.py>`_.

For example, suppose we have a 2x8 device mesh two MFCs. MFC#1 occupies
the last 1x8 GPUs, aka the second node, and MFC#2 occupies all 2x8 GPUs.
MFC#1 has a parallel strategy of (DP=2,TP=2,PP=2), and MFC#2 has a
parallel strategy of (DP=4,TP=4,PP=1). Denote the GPUs on the first node
as [g0, ..., g7] and the GPUs on the second node as [g8, ..., g15]. The
following process groups will be created:

- The global group: [g0, g1, g2, ..., g15], aka all GPUs.

- MFC#1's sub-group: [g8, g9, g10, g11, g12, g13, g14, g15], aka the
second node.

- MFC#2's sub-group: [g0, g1, g2, ..., g15], aka all GPUs. This is a
virtual group and ReaL will just use the global group when we have
some operations on this sub-group.

- MFC#1's 4 pipeline parallel groups: [g8, g12], [g9, g13], [g10, g14],
[g11, g15].

- MFC#1's 4 tensor parallel groups: [g8, g9], [g10, g11], [g12, g13],
[g14, g15].

- MFC#1's 4 data parallel groups: [g8, g10], [g9, g11], [g12, g14],
[g13, g15].

- MFC#2's pipeline parallel group: [g0, g1, ..., g15]. This is also a
virual group.

- MFC#2's 4 tensor parallel groups: [g0, g1, g2, g3], [g4, g5, g6, g7],
[g8, g9, g10, g11], [g12, g13, g14, g15].

- MFC#2's 4 data parallel groups: [g0, g4, g8, g12], [g1, g5, g9, g13],
[g2, g6, g10, g14], [g3, g7, g11, g15].

The rank mapping from MFC1 to the global group is

.. code:: python
{0: 8, 1: 9, 2: 10, 3: 11, 4: 12, 5: 13, 6: 14, 7: 15}
and the rank mapping of MFC2 is an identical mapping.

************************
Runtime Infrastructure
************************
Expand Down

0 comments on commit 9baae20

Please sign in to comment.