[FSDP2] precompute scale after optimizer.step for dynamic scaling #266

weifengpy · 2024-05-21T20:10:47Z

Goal: improve float8 all-gather perf in FSDP2 by precomputing scales for all float8 params with a single all-reduce

updated README for API usage: call precompute_float8_scale_for_fsdp inside the training loop after optimizer step

from float8_experimental.fsdp_utils import precompute_float8_dynamic_scale_for_fsdp
# inside the training loop
model(input).sum().backward()
optim.step()
precompute_float8_dynamic_scale_for_fsdp(model)

unit test pytest -s test/test_fsdp2/test_fsdp2_eager.py -k test_transformer_parity_dynamic

FSDP pre-forward: shortend from 3ms to 1.8ms because of doing 1 all-reduce instead N small all-reduces

Pre-computing amax: shortened from 5ms to 1.7ms, by switching from torch._foreach_abs + torch.max(a) to torch._foreach_norm(weights, ord=math.inf)

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

bdhirsh · 2024-05-28T15:59:45Z

float8_experimental/float8_dynamic_linear.py

            requires_grad=tensor.requires_grad,
        )

    def __init__(self, tensor: torch.Tensor, mm_config: ScaledMMConfig):
        self._tensor = tensor
        self._mm_config = mm_config
+        # Optional cache for pre-computed fp8 data/scale
+        self._fp8_data: Optional[torch.Tensor] = None


One major requirement for tensor subclasses that I don't think is respected here: __tensor_flatten__ and __tensor_unflatten__ must properly convey every inner tensor on the subclass.

So when we call __tensor_flatten__ on this subclass, if either of _fp8_data/scale/amax are set to valid tensors, they need to be returned there (and similarly __tensor_unflatten__ needs to handle them as extra args)

thanks for pointing this out! This saves me a lot of debugging time. I can give it a try by including _fp8_data/scale/amax in __tensor_flatten__ and __tensor_unflatten__

torch.compile works after patching pytorch/pytorch#127431
will compare traces in 2nd PR

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy · 2024-06-06T08:09:06Z

float8_experimental/float8_linear_utils.py

+    def compute_amaxes(weights: List[DTensor]):
+        max_weights = torch._foreach_norm(weights, ord=math.inf)
+        amax_tensor = torch.vstack(max_weights)
+        amax_tensor = torch.clamp(amax_tensor, EPS)  # R


torch.clamp calls all_reduce. I avoided calling it again in amax_to_scale(clamp_amax=False)

So you are relying on torch.clamp to run the all-reduce implicitly from changing sharding from partial to replicate?

If this fragments the code, could we just all-reduce the amax tensor and then leave the clamp to amax_to_scale? I agree the current way is faster since we are doing one clamp for all amaxes, but in case float8 folks are not happy with this fragmentation, this seems like another way.

thanks for the suggestions. I can collect feedback from float8 folks if they have a preference

can we just comment with what is going on? I think it's fine as long as the code is easy to understand and there is no magic.

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy · 2024-06-06T08:14:16Z

float8_experimental/float8_dynamic_linear.py

@@ -190,9 +191,20 @@ def __repr__(self):
        return f"WeightWithDynamicFloat8CastTensor(tensor={self._tensor}, mm_config={self._mm_config})"

    def fsdp_pre_all_gather(self, mesh):


if _pre_computed_amax, we skip tensor_to_amax and directly do amax_to_scale

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

awgu

This seems reasonable to me! I want to check with float8 folks on the amax_to_scale change.

awgu · 2024-06-06T14:31:44Z

float8_experimental/float8_linear_utils.py

+    def compute_amaxes(weights: List[DTensor]):
+        max_weights = torch._foreach_norm(weights, ord=math.inf)
+        amax_tensor = torch.vstack(max_weights)
+        amax_tensor = torch.clamp(amax_tensor, EPS)  # R


So you are relying on torch.clamp to run the all-reduce implicitly from changing sharding from partial to replicate?

If this fragments the code, could we just all-reduce the amax tensor and then leave the clamp to amax_to_scale? I agree the current way is faster since we are doing one clamp for all amaxes, but in case float8 folks are not happy with this fragmentation, this seems like another way.

vkuzo · 2024-06-06T20:03:46Z

nice! Can we include the intended user API in the PR summary?

vkuzo · 2024-06-06T19:56:27Z

float8_experimental/float8_dynamic_linear.py

@@ -151,6 +151,7 @@ def __new__(cls, tensor: torch.Tensor, mm_config: ScaledMMConfig):
    def __init__(self, tensor: torch.Tensor, mm_config: ScaledMMConfig):
        self._tensor = tensor
        self._mm_config = mm_config
+        self._pre_computed_amax = None


does this need to be added to __tensor_flatten__?

can we add some comments on intended usage of this?

+1 on adding to flatten/unflatten and comments/ intended usage

vkuzo · 2024-06-06T19:58:08Z

float8_experimental/float8_linear_utils.py

@@ -322,3 +328,34 @@ def inner_func():
    for child in fp8_layers:
        # Set a flag to signal amaxes/scales are ready
        child.amax_and_scale_synced = True
+
+
+def precompute_float8_amax(module: nn.Module) -> None:


can we put this in distributed_utils.py?

I think the function name should include that this is intended for FSDP2 with float8 all-gather

moving to fsdp_utils.py according to PR #310

indicating fsdp by renaming to precompute_float8_amax_for_fsdp

vkuzo · 2024-06-06T19:59:51Z

float8_experimental/float8_linear_utils.py

+    weights: List[DTensor] = [float8_linear.weight for float8_linear in float8_linears]
+
+    def compute_amaxes(weights: List[DTensor]):
+        max_weights = torch._foreach_norm(weights, ord=math.inf)


maybe add a comment that this is equivalent to max(abs(w))?

vkuzo · 2024-06-06T20:00:30Z

float8_experimental/float8_linear_utils.py

+    def compute_amaxes(weights: List[DTensor]):
+        max_weights = torch._foreach_norm(weights, ord=math.inf)
+        amax_tensor = torch.vstack(max_weights)
+        amax_tensor = torch.clamp(amax_tensor, EPS)  # R


can we just comment with what is going on? I think it's fine as long as the code is easy to understand and there is no magic.

drisspg · 2024-06-07T00:54:42Z

float8_experimental/float8_utils.py

    """
    scale = torch.empty_like(amax, dtype=torch.float32)
    if float8_dtype in FP8_TYPES:
-        res = torch.finfo(float8_dtype).max / torch.clamp(amax, min=EPS)
+        if clamp_amax:


nit: I think if you have this on a seperate line
amax = clamp(amax, eps) if clamp_amax else amax

makes the logic a lil easier to follow

facebook-github-bot · 2024-07-10T00:02:57Z

@weifengpy has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

vkuzo · 2024-07-10T15:14:30Z

float8_experimental/float8_dynamic_utils.py

@@ -130,20 +146,38 @@ def unwrap(t):
        )

    def __tensor_flatten__(self):
-        return ["_tensor"], self._mm_config
+        if self._precomputed_amax:
+            return ["_tensor", "_precomputed_amax"], self._mm_config


does having Optional[torch.Tensor] as a subclass field work with torch.compile? Or do we not care about torch.compile in this code path?

torch.compile assumes every tensor from __tensor_flatten__ is not None. I added if-else to make torch.compile work. I verified it in pytorch/pytorch#129457

vkuzo · 2024-07-10T15:19:02Z

float8_experimental/float8_utils.py

 ):
    """Converts the amax value of a tensor to the fp8 scale.
    Args:
        amax: The amax value of the tensor.
        float8_dtype: The float8 dtype.
        orig_dtype: The original dtype of the tensor.
+        clamp_amax: default is True. False for FSDP fp8 all-gather since FSDP applied `torch.clamp` during pre-compute after optimizer.step


this is a bit confusing. How about precomputing the scale instead so we don't have to have gotchas like this?

good suggestion! I changed the API to precompute scale and it shows another 9% speed up in unit test vs precomputing amax

fsdp_pre_all_gather is also greatly simplified because of using self._precomputed_scale

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot · 2024-07-10T23:43:31Z

@weifengpy has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

vkuzo · 2024-07-11T14:03:38Z

README.md

+    y.sum().backward()
+    optimizer.step()
+
+    # specific to fsdp2 + float8 with dynamic scaling


should we say that this is specific to FSDP2 with float8 all-gather turned on? Also, maybe we can show how to turn that on, since I don't think it's documented in the README yet? Can be a follow-up PR.

should we say that this is specific to FSDP2 with float8 all-gather turned on?

will change it in this PR

maybe we can show how to turn that on, since I don't think it's documented in the README yet

good catch. will polish README again after landing changes in torchtitan to turn on/off fp8 all-gather

float8_experimental/fsdp_utils.py

vkuzo · 2024-07-11T14:07:32Z

float8_experimental/fsdp_utils.py

+        for m in module.modules()
+    ):
+        raise NotImplementedError("Only supports delayed scaling")
+    float8_linears: List[Float8Linear] = [


is this expensive for real models? if yes, maybe we can offer option to precompute this?

My intuition is that this should be pretty fast as the number of nn.Modules in the model is usually at most in the thousands and this is pure Python overhead. @weifengpy you can check the traces you have if you see any noticeable gaps from this.

just checked the profiler traces. it's roughly 0.15ms cpu overhead (5% of precompute_float8_dynamic_scale_for_fsdp and is tiny portion of 1 training loop). no cuda are launched

thus I am keeping it as is now for simplicity

vkuzo · 2024-07-11T14:08:04Z

float8_experimental/fsdp_utils.py

+    ]
+    weights: List[DTensor] = [float8_linear.weight for float8_linear in float8_linears]
+
+    def compute_scales(weights: List[DTensor]):


optional nit: maybe move outside to prevent nested functions?

curious what is the downside of nested functions

@weifengpy By the way, this was originally a nested function just so that we could try to torch.compile it effectively in the scales = compute_scales(weights) line. Does it still need to be a separate function for torch.compile reasons? If so, we should probably add a comment before the def compute_scales mentioning that it is separate for torch.compile; otherwise, we can consider inlining the function.

I will remove nested functions to make the code easy to read. I profiled the unit test and precompute_float8_scale_for_fsdp takes 1.9ms. that's a tiny portion of the overall training loop. No obvious reason to speed up with torch.compile yet. I can bring back the nested function in case we need torch.compile again.

vkuzo

looks great! had some final comments. thanks for doing this!

awgu · 2024-07-11T15:02:06Z

float8_experimental/fsdp_utils.py

+from float8_experimental.float8_utils import EPS
+
+
+def precompute_float8_scale_for_fsdp(module: nn.Module) -> None:


Should we add a @torch.no_grad() decorator on this?

good catch. adding @torch.no_grad()

awgu · 2024-07-11T15:03:06Z

float8_experimental/fsdp_utils.py

+    Calculate scale for all float8 parameters after optimizer step
+    It performs a single all-reduce instead of many all-reduces for each parameter


suggestion:

Suggested change

Calculate scale for all float8 parameters after optimizer step

It performs a single all-reduce instead of many all-reduces for each parameter

Calculate scale dynamically for all float8 parameters.

This should be run after the optimizer step. It performs a single all-reduce to compute the

amaxes for all float8 weights.

awgu · 2024-07-11T15:04:10Z

float8_experimental/fsdp_utils.py

+    """
+    Calculate scale for all float8 parameters after optimizer step
+    It performs a single all-reduce instead of many all-reduces for each parameter
+    Exmaple usage:


nit (typo):

Suggested change

Exmaple usage:

Example usage:

@vkuzo I assume that there are no docs builds for float8_experimental, so this example is for users who will read the code itself?

Otherwise, we might need to check the formatting -- I recall the format for examples being a bit different.

awgu · 2024-07-11T15:05:34Z

float8_experimental/fsdp_utils.py

+        for m in module.modules()
+    ):
+        raise NotImplementedError("Only supports delayed scaling")
+    float8_linears: List[Float8Linear] = [


My intuition is that this should be pretty fast as the number of nn.Modules in the model is usually at most in the thousands and this is pure Python overhead. @weifengpy you can check the traces you have if you see any noticeable gaps from this.

awgu · 2024-07-11T15:07:49Z

float8_experimental/fsdp_utils.py

+    ]
+    weights: List[DTensor] = [float8_linear.weight for float8_linear in float8_linears]
+
+    def compute_scales(weights: List[DTensor]):


curious what is the downside of nested functions

@weifengpy By the way, this was originally a nested function just so that we could try to torch.compile it effectively in the scales = compute_scales(weights) line. Does it still need to be a separate function for torch.compile reasons? If so, we should probably add a comment before the def compute_scales mentioning that it is separate for torch.compile; otherwise, we can consider inlining the function.

awgu · 2024-07-11T15:08:37Z

float8_experimental/fsdp_utils.py

+            float8_linear.weight._local_tensor._precomputed_scale = scale._local_tensor
+    else:
+        warnings.warn(
+            "Calling precompute_float8_weights without any weights using FSDP fp8 all-gather!"


function name in the warning needs to be updated

I am okay with not including this warning by the way. This was also to help debugging to make sure we actually found weights.

got you. I am removing the warnings for simplicity

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot · 2024-07-11T21:53:37Z

@weifengpy has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-07-12T00:26:14Z

@weifengpy merged this pull request in 6cba2ae.

we have landed fp8 all-gather optimizations in float8_experimental pytorch-labs/float8_experimental#266 this PR proposes torchtitan changes. also include fp8 in CI ``` from float8_experimental.fsdp_utils import precompute_float8_dynamic_scale_for_fsdp # inside the training loop model(input).sum().backward() optim.step() precompute_float8_dynamic_scale_for_fsdp(model) ``` FSDP2 fp8 all-gather are added to CI ``` CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_fp8_linear CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_fp8_linear --training.enable_fsdp_fp8_all_gather CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.enable_fp8_linear --training.enable_fsdp_fp8_all_gather --training.precompute_float8_dynamic_scale_for_fsdp ``` TP fp8 all-gather are locally tested. will add them to CI after uploading a new tokenizer with vacab size 2560 (divisible by 16) ``` CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=4 ./run_llama_train.sh --training.enable_fp8_linear --training.data_parallel_degree 1 --training.tensor_parallel_degree 4 CONFIG_FILE="./train_configs/llama3_8b.toml" NGPU=4 ./run_llama_train.sh --training.enable_fp8_linear --training.data_parallel_degree 2 --training.tensor_parallel_degree 2 ``` precompute scales after optimizer.step <img width="319" alt="Screenshot 2024-07-12 at 5 11 14 PM" src="https://github.com/user-attachments/assets/1c55bd89-9183-42ca-9445-23f3b95e0817"> FSDP2 pre-all-gather do not have any small all-reduces <img width="794" alt="Screenshot 2024-07-12 at 5 13 04 PM" src="https://github.com/user-attachments/assets/1a00dc70-a8ca-4ce1-a93c-316f22efdb08"> TODO * upload tokenizer with vacab size 2560 to enable CI on TP fp8 all-gather * torch.compile complains about fp8 * add delayed scaling and brainstorm about best config option to express fp8 * compare perf between delayed scaling and dynamic scaling https://github.com/pytorch-labs/float8_experimental/pull/312/files

weifengpy added 2 commits May 20, 2024 17:00

[FSDP2] set vocab_size=32 to avoid must be divisible by 16 error

9d5595c

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

precast after optimizer.step and dump profiler traces

e7005c2

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 21, 2024

Merge branch 'main' into fsdp2

e41d589

weifengpy marked this pull request as draft May 23, 2024 23:16

weifengpy added 2 commits May 24, 2024 11:00

precast and preamax unit test

e0bee10

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

remove duplicate vocab

c0ba5a2

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

bdhirsh reviewed May 28, 2024

View reviewed changes

bdhirsh mentioned this pull request May 29, 2024

Nested tensor subclass support pytorch/pytorch#127431

Closed

weifengpy and others added 8 commits May 30, 2024 00:30

fused amax

8da238e

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Merge branch 'main' into fsdp2

ffff5ed

use FP8_TYPES and max

aefa21b

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

commit all changes before cleaning

d4a1db7

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

pre_compute and flatten / unflatten

d36e79b

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

remove unused constant

6f244a2

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

torch.compile works

dc5eab0

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

eager ready

546e979

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy changed the title ~~[DO NOT LAND] precast after optimizer.step and dump profiler traces~~ [FSDP2] pre-compute amax after optimizer.step for dynamic scaling Jun 6, 2024

weifengpy commented Jun 6, 2024

View reviewed changes

linter

229ede6

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy commented Jun 6, 2024

View reviewed changes

linter

d5b3ff6

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy marked this pull request as ready for review June 6, 2024 08:21

weifengpy requested review from vkuzo, awgu and drisspg June 6, 2024 08:21

awgu reviewed Jun 6, 2024

View reviewed changes

vkuzo reviewed Jun 6, 2024

View reviewed changes

drisspg reviewed Jun 7, 2024

View reviewed changes

vkuzo reviewed Jul 10, 2024

View reviewed changes

weifengpy marked this pull request as draft July 10, 2024 21:15

weifengpy and others added 5 commits July 10, 2024 14:16

Merge branch 'main' into fsdp2

e4245e4

remove clamp_amax=True/False

e12c973

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

precompute scale

9ef67fb

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

unit test for precomputing scales

fa2f08a

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

add precompute scale in README

ba085e5

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy changed the title ~~[FSDP2] precompute amax after optimizer.step for dynamic scaling~~ [FSDP2] precompute scale after optimizer.step for dynamic scaling Jul 10, 2024

weifengpy marked this pull request as ready for review July 10, 2024 23:43

weifengpy requested a review from vkuzo July 10, 2024 23:43

vkuzo reviewed Jul 11, 2024

View reviewed changes

float8_experimental/fsdp_utils.py Show resolved Hide resolved

vkuzo reviewed Jul 11, 2024

View reviewed changes

vkuzo approved these changes Jul 11, 2024

View reviewed changes

awgu reviewed Jul 11, 2024

View reviewed changes

weifengpy marked this pull request as draft July 11, 2024 20:55

weifengpy added 2 commits July 11, 2024 14:24

rename to precompute_float8_dynamic_scale_for_fsdp

ac0afb0

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

rename to precompute_float8_dynamic_scale_for_fsdp

8e56dfc

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy marked this pull request as ready for review July 11, 2024 21:53

facebook-github-bot closed this in 6cba2ae Jul 12, 2024

facebook-github-bot added the Merged label Jul 12, 2024

weifengpy mentioned this pull request Jul 13, 2024

enable FSDP2 + fp8 all-gather and fix TP fp8 all-gather pytorch/torchtitan#413

Merged

		@@ -190,9 +191,20 @@ def __repr__(self):
		return f"WeightWithDynamicFloat8CastTensor(tensor={self._tensor}, mm_config={self._mm_config})"

		def fsdp_pre_all_gather(self, mesh):

		from float8_experimental.float8_utils import EPS


		def precompute_float8_scale_for_fsdp(module: nn.Module) -> None:

		Calculate scale for all float8 parameters after optimizer step
		It performs a single all-reduce instead of many all-reduces for each parameter

-    Calculate scale for all float8 parameters after optimizer step
-    It performs a single all-reduce instead of many all-reduces for each parameter
+    Calculate scale dynamically for all float8 parameters.
+    This should be run after the optimizer step. It performs a single all-reduce to compute the
+    amaxes for all float8 weights.

[FSDP2] precompute scale after optimizer.step for dynamic scaling #266

[FSDP2] precompute scale after optimizer.step for dynamic scaling #266

Conversation

weifengpy commented May 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weifengpy Jun 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awgu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vkuzo commented Jun 6, 2024

Choose a reason for hiding this comment

drisspg Jun 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Jul 10, 2024

Choose a reason for hiding this comment

weifengpy Jul 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Jul 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weifengpy Jul 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weifengpy Jul 11, 2024 • edited Loading

Choose a reason for hiding this comment

vkuzo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Jul 11, 2024

facebook-github-bot commented Jul 12, 2024

weifengpy commented May 21, 2024 •

edited

Loading

weifengpy Jun 6, 2024 •

edited

Loading

drisspg Jun 7, 2024 •

edited

Loading

weifengpy Jul 10, 2024 •

edited

Loading

weifengpy Jul 11, 2024 •

edited

Loading

weifengpy Jul 11, 2024 •

edited

Loading