[Fix] fix the 2d ring attn when using multiple machine #6071

wangbluo · 2024-09-25T10:57:32Z

🚨 Issue number

📝 What does this PR do?

The double_ring_groups need to consider the tp groups as the tp axis is the first axis.
And the ranks in double_ring_groups need to transformered into global ranks.

For example, if using the first four cards of two machines, totaling eight cards for ring attention, the ranks of the inner ring group would be [0, 2], [1, 3], [4, 6], [5, 7], while the ranks of the inter ring group would be [0, 4], [1, 5], [2, 6], [3, 7].

Results:

Edenzzzz · 2024-10-04T02:23:42Z

colossalai/shardformer/layer/attn.py

-                inter_ring_group = group
+        world_size = dist.get_world_size()
+        rank = dist.get_rank()
+        groups = int(world_size / sp_size)


If you do world_size / sp_size, it will loop over all tp * pp * dp ranks instead of just tp ranks?

I don't think so, as the tp axis is the first, and sp axis is the next, see as the HybirdPlugin class. groups means the sp groups number. If sp_size = 8, which means there is 8 gpus in one sp group, and if you use two machines to do the trainning(world size = 16), the groups = 2. So it should be pp * dp ranks.

Edenzzzz · 2024-10-04T17:59:23Z

colossalai/shardformer/layer/attn.py

+        groups = int(world_size / sp_size)
+
+        if tp_size > 1:
+            for group_id in range(groups):


Maybe it should look like this?

world_size = 16 sp_size = 4 outer_ring_size = 2 inner_ring_size = sp_size // outer_ring_size tp_size = 2 total_inner_rings = world_size // (outer_ring_size * tp_size) # loop through groups of size (inner_ring * tp) total_inner_size = tp_size * inner_ring_size for j in range(total_inner_rings): # inside each group, duplicate tp group inner_sp_size times for k in range(tp_size): print(f"inner ring ranks: {list(range(k + j * total_inner_size, (j + 1) * total_inner_size, tp_size))}") print("---------------------------------") sp_tp_size = total_inner_size * outer_ring_size n_groups = world_size // sp_tp_size # dp * pp print(f"n_groups: {n_groups}") for i in range(0, n_groups): start = i * sp_tp_size end = (i + 1) * sp_tp_size for j in range(outer_ring_size): for k in range(tp_size): print(f"inter ring ranks: {list(range(start + k + j * tp_size, end, total_inner_size))}")

Yes, I use 4 gpus in two nodes separately in the example ,see as https://arxiv.org/pdf/2406.18485.
However, I believe there is an issue with the current algorithm, as such a rank group cannot proceed with the next step of communication. Additionally, in the CI tests, the author only tested it with 4 GPUs, so this PR may need to be closed for now.

If use this config, which is :
inner group:[0, 1, ][2, 3],[4, 5],[ 6, 7],
inter group: [0,4][1,5][2,6][3,7]，
the communication can be proceed and can be training, seems like the current algorithm need the inner group number equals to the inter group number.
However it's wrong, the paper does not impose such requirements, the inner group can be set to 4, and the inter group to 2.

wangbluo added 3 commits September 25, 2024 18:34

fix the ring attn

cfd9eda

fix the attn

65c8297

fix

6fb1322

wangbluo requested a review from a team as a code owner September 25, 2024 10:57

wangbluo added 3 commits September 25, 2024 19:00

fix

91ed32c

fix

6705dad

fix

3fab921

wangbluo closed this Sep 26, 2024

wangbluo deleted the ring_attention branch September 26, 2024 10:06

wangbluo restored the ring_attention branch September 26, 2024 10:07

wangbluo reopened this Sep 26, 2024

Edenzzzz reviewed Oct 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] fix the 2d ring attn when using multiple machine #6071

[Fix] fix the 2d ring attn when using multiple machine #6071

wangbluo commented Sep 25, 2024

Edenzzzz Oct 4, 2024 •

edited

Loading

wangbluo Oct 7, 2024

Edenzzzz Oct 4, 2024 •

edited

Loading

wangbluo Oct 7, 2024

wangbluo Oct 7, 2024 •

edited

Loading

[Fix] fix the 2d ring attn when using multiple machine #6071

Are you sure you want to change the base?

[Fix] fix the 2d ring attn when using multiple machine #6071

Conversation

wangbluo commented Sep 25, 2024

🚨 Issue number

📝 What does this PR do?

Edenzzzz Oct 4, 2024 • edited Loading

Choose a reason for hiding this comment

wangbluo Oct 7, 2024

Choose a reason for hiding this comment

Edenzzzz Oct 4, 2024 • edited Loading

Choose a reason for hiding this comment

wangbluo Oct 7, 2024

Choose a reason for hiding this comment

wangbluo Oct 7, 2024 • edited Loading

Choose a reason for hiding this comment

Edenzzzz Oct 4, 2024 •

edited

Loading

Edenzzzz Oct 4, 2024 •

edited

Loading

wangbluo Oct 7, 2024 •

edited

Loading