add iter time tracking via cuda events, add data loading times, add columnar display to show both, show avg iter & data loading times at end of training #87

lessw2020 · 2024-02-25T16:35:24Z

This PR adds basic perf timing and display for 'per iter' and 'final iter average' display. (in part based on Andrew's comment about having to open the trace to compare iter timing).

tracking list is housed in TrainState, but I do not save it as part of the state dict as I view this as useful but not saveable info.
iter times are tracked after dataloading is done each iter and after optimizer step. The idea is to make this timing expressly the model training iter (not data loading or post iter other metrics calcs).
'time' is now displayed at each iter along with the usual loss and lr.
at the end of training, assuming more than 3 iters run, then the average iter time is calculated by igoring the first three iters (consider these as warmup esp as cudaCacheAllocator gets warmed up) and displayed.
based on @tianyu-l feedback: I have added data loading times as well.
I used the same timeit.default_timer() from timeit to be consistent.
(cpu side so no synch's needed :)

6 - after fiddling with printf width formatting options, added beautiful aligned columnar display for the per iter updates:
Now:

before:

tianyu-l

This would be useful! Some comments:

I feel it could be useful to have another version where data loading time is included, so that not only we know the marginal time spent on each iter, but also by the difference we know how much time is used/wasted on data loading. It could be especially valuable to our future exploration on more scalable data loading solutions.
According to https://docs.python.org/3/library/timeit.html, time.perf_counter() and timeit.default_timer() are doing the same thing under hood. Let's consolidate them into one, either is fine.
Optionally we can log this to TensorBoard as well, although I assume it shouldn't fluctuate too much other than the first couple of iters.

yifuwang · 2024-02-26T08:26:59Z

train.py

@@ -207,6 +211,11 @@ def main(job_config: JobConfig):
            # updates the scale for next iteration
            scaler.update()

+            # training iteration complete
+            iter_end_time = perf_counter()


Is it guaranteed that a device synchronization has already taken place at this point? I might be missing something, but I don't see anything that guarantees a synchronization between .backward() and iter_end_time = perf_counter(). Any chance we should be measuring with cuda events here instead?

Loss.item is a cpu gpu synch point, so I was thinking the loss Calc above would synch. But that may be incorrect so agree, I'll update to use cuda events to guarantee the timing. Thanks for flagging this!

I think adding a torch.cuda.synchronize() before iter_time_end = perf_counter() should be good. I agree loss.item() is a sync point, but I think it is only called a few lines after :/

https://github.com/pytorch/torchtrain/blob/eafcee6b5d7156ec2db833c693987927b3698075/train.py#L214-L223

Thanks @yifuwang, @awgu and @tianyu-l for all the feedback here!
to update - I went ahead and just moved to cuda events in order to ideally get max precision, and have added the cuda synchronize.
I also tested moving the loss.item() higher to use that as a synch point, but seemed to be cleaner to just stick with pure cuda events and synch.
Anyway, all tested and looks good.
Of interest, the net times were currently the same with and without synch for eager, but that would go out the window as soon as we start using torch.compile, so we definitely do want to keep the synch here.

tianyu-l

LGTM, thanks for adding these time metrics!

@tianyu-l

…olumnar display to show both, show avg iter & data loading times at end of training (#87) This PR adds basic perf timing and display for 'per iter' and 'final iter average' display. (in part based on Andrew's comment about having to open the trace to compare iter timing). 1. tracking list is housed in TrainState, but I do not save it as part of the state dict as I view this as useful but not saveable info. 2. iter times are tracked after dataloading is done each iter and after optimizer step. The idea is to make this timing expressly the model training iter (not data loading or post iter other metrics calcs). 3. 'time' is now displayed at each iter along with the usual loss and lr. 4. at the end of training, assuming more than 3 iters run, then the average iter time is calculated by igoring the first three iters (consider these as warmup esp as cudaCacheAllocator gets warmed up) and displayed. 5. based on @tianyu-l feedback: I have added data loading times as well. I used the same timeit.default_timer() from timeit to be consistent. (cpu side so no synch's needed :) 6 - after fiddling with printf width formatting options, added beautiful aligned columnar display for the per iter updates: Now: <img width="1282" alt="Screenshot 2024-02-26 at 9 39 25 AM" src="https://github.com/pytorch/torchtrain/assets/46302957/9ee2ea7b-5c28-4d41-ba91-d4176c64fc66"> before: <img width="1282" alt="Screenshot 2024-02-26 at 8 39 46 AM" src="https://github.com/pytorch/torchtrain/assets/46302957/37cbfa20-7f1d-4d94-be94-3505ef4498c0">

@tianyu-l

…olumnar display to show both, show avg iter & data loading times at end of training (pytorch#87) This PR adds basic perf timing and display for 'per iter' and 'final iter average' display. (in part based on Andrew's comment about having to open the trace to compare iter timing). 1. tracking list is housed in TrainState, but I do not save it as part of the state dict as I view this as useful but not saveable info. 2. iter times are tracked after dataloading is done each iter and after optimizer step. The idea is to make this timing expressly the model training iter (not data loading or post iter other metrics calcs). 3. 'time' is now displayed at each iter along with the usual loss and lr. 4. at the end of training, assuming more than 3 iters run, then the average iter time is calculated by igoring the first three iters (consider these as warmup esp as cudaCacheAllocator gets warmed up) and displayed. 5. based on @tianyu-l feedback: I have added data loading times as well. I used the same timeit.default_timer() from timeit to be consistent. (cpu side so no synch's needed :) 6 - after fiddling with printf width formatting options, added beautiful aligned columnar display for the per iter updates: Now: <img width="1282" alt="Screenshot 2024-02-26 at 9 39 25 AM" src="https://github.com/pytorch/torchtrain/assets/46302957/9ee2ea7b-5c28-4d41-ba91-d4176c64fc66"> before: <img width="1282" alt="Screenshot 2024-02-26 at 8 39 46 AM" src="https://github.com/pytorch/torchtrain/assets/46302957/37cbfa20-7f1d-4d94-be94-3505ef4498c0">

lessw2020 added 2 commits February 25, 2024 08:23

add iter time, avg iter time

5d733d4

add iter time, avg iter time

eafcee6

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 25, 2024

lessw2020 requested review from wanchaol, tianyu-l, yifuwang and awgu February 25, 2024 16:38

tianyu-l reviewed Feb 26, 2024

View reviewed changes

yifuwang reviewed Feb 26, 2024

View reviewed changes

lessw2020 added 3 commits February 26, 2024 08:06

move to using cuda_events

b3cb640

lint fix

23b7739

add data loading times

02c1fdf

lessw2020 changed the title ~~add iter time tracking and display, avg iter time at end of training~~ add iter time tracking and display via cuda events, add data loading times, show avg iter and data loading times at end of training Feb 26, 2024

beautiful aligned columns for per iter update data

d3b8bfc

tianyu-l approved these changes Feb 26, 2024

View reviewed changes

lessw2020 merged commit ae85e97 into pytorch:main Feb 26, 2024
4 checks passed

lessw2020 deleted the add_perf_iter_timing branch February 26, 2024 18:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add iter time tracking via cuda events, add data loading times, add columnar display to show both, show avg iter & data loading times at end of training #87

add iter time tracking via cuda events, add data loading times, add columnar display to show both, show avg iter & data loading times at end of training #87

lessw2020 commented Feb 25, 2024 •

edited

Loading

tianyu-l left a comment

yifuwang Feb 26, 2024 •

edited

Loading

lessw2020 Feb 26, 2024

awgu Feb 26, 2024

lessw2020 Feb 26, 2024

tianyu-l left a comment

add iter time tracking via cuda events, add data loading times, add columnar display to show both, show avg iter & data loading times at end of training #87

add iter time tracking via cuda events, add data loading times, add columnar display to show both, show avg iter & data loading times at end of training #87

Conversation

lessw2020 commented Feb 25, 2024 • edited Loading

tianyu-l left a comment

Choose a reason for hiding this comment

yifuwang Feb 26, 2024 • edited Loading

Choose a reason for hiding this comment

lessw2020 Feb 26, 2024

Choose a reason for hiding this comment

awgu Feb 26, 2024

Choose a reason for hiding this comment

lessw2020 Feb 26, 2024

Choose a reason for hiding this comment

tianyu-l left a comment

Choose a reason for hiding this comment

lessw2020 commented Feb 25, 2024 •

edited

Loading

yifuwang Feb 26, 2024 •

edited

Loading