Realising a factor 20-30 speedup on GPU #1803

APJansen · 2023-08-30T12:55:52Z

Last week @goord and I started looking at tensorboard profiles of the code running on a GPU. We found and resolved several bottlenecks in the performance, resulting in a total speedup of a factor of 20-30 compared to the current state of the trvl-mask-layers branch of #1788.
As a result, we are able to do a full 17k epoch run of the NNPDF40_nnlo_as_01180_100 runcard with 100 replicas within half an hour.

We have this running, so the time quoted is the actual start to end wall time (to be precise, it took 19 minutes, 9 of which are loading the data and building the model etc).
Most of it still requires a lot of cleanup to integrate it properly though. Currently it crashes just after the fit just because the appropriate changes haven't been made.

Factors contributing to speedup

In no particular order, the factors contributing to the speedup are:

Rewriting the FK table contractions. Several things here:

we have restructured the PDF so that the replicas are in the first rather than the last axis. I'm no expert but as I understand, the values in the last axis are contiguous in memory, and since it is the x and flavour axis that are contracted, it's beneficial to have those last.
We've rewritten the masking from using tf.boolean_mask, where the output shape depends on the values, i.e. the number of Trues, to being a matrix multiplication, and precomputed this. So the FK table for a DY experiment is now of shape (n, x, f, x, f)

Having a single PDF with all the replicas inside, rather than 100 separate PDFs, as in Multi Replica PDF #1782. We saw that the 30% speedup observed there (which becomes much more significant given all the other speedups) is mainly due to kernel loading. That is, all PDFs were computed separately, and now it's not only done in parallel, but you also don't have the overhead of starting a new computation on the GPU every time, which was quite significant.
There was a huge gap between every step, almost as long as a step itself after 1 and 2 were implemented, in which the GPU was idle. I eventually found out it was due to some tensorflow overhead that happens every epoch. Now currently, every epoch means every step. But this can be avoided entirely by the stupid looking trick of copying the input grids by the number of steps you want to run, and then training for 1 "epoch" with a batch size of 1. This almost completely removes the gaps, while doing the same number of steps. (If the memory this takes is an issue, we can limit it to say 1k copies, at the cost of worrying what to do when the desired number of epochs is not divisible by this)
The validation losses are computed from scratch every step, starting from x. This repeats the computation of the training model up to and including the computation of the observables. If this were rewritten to start from the observables, the cost goes to essentially 0 (from about 30% now). (This hasn't been started, in the timing above I cheated by skipping this computation) I think this would also improve readability, to have one model with 3 losses, rather than 3 models.
Even more is possible, for instance even now ~20% of every step is spent on restructuring the FK tables, doing the same every step. I wasn't able to fix this yet. Perhaps some more tinkering with index orders and contraction orders along the lines of 1 can fix this.

Steps remaining

Unfortunately I'll have very little time in the next month to work on this (holidays and other projects). Below I'll list the steps necessary to integrate it, and where help could be useful.

Get the trvl-mask-layers branch to pass the tests.

Some help here would be useful, @Radonirinaunimi would take a look

The changes in point 1 above are in a branch gpu-go-brrr (;P), off of trvl-mask-layers. This can be merged into it.
Once merged they should be tested and reviewed.

What would be super useful here is to have a list of runcards to test, and also the actual results from master. Since this is already much faster, it's a lot easier if we can only run them on this branch, and have something to compare against already.

The rewriting from epochs to batches is independent of all the other changes. If anyone wants to pick that up that'd be great. I started it in Go from one step being an epoch to one step being a batch #1802 UPDATE: this is done, just need testing:
The rewriting of the 3 "models" into one and instead 3 losses (By which btw I don't necessarily mean that we put that part in an actual keras.Loss or something, not sure if that's efficient or not, just to not repeat the computations), I think this is also relatively independent of the rest. If anyone wants to do this that'd be great, it doesn't have the highest payout/effort ratio of all these, so I can also do it myself after the last point. UPDATE: WIP in Avoiding duplicated computations by having a single observable model #1855
The multi replica PDF, this is the most work, and the most specialized, so I think it's best if I focus on this. UPDATE: turned into its own issue Multi Replica PDF #1880, see there for updated progress.

Tensorboard profile

Here is the tensorboard profile with all these improvements, may be nice to see:

The text was updated successfully, but these errors were encountered:

scarlehoff · 2023-09-05T06:51:13Z

This looks fantastic!

Just one question, what effect does it have on CPU?

APJansen · 2023-09-05T08:38:47Z

Good question, I haven't tested as much but it seems point 1 by itself actually slows it down by a factor of 2, for a single replica. Points 3 and 4 and probably 2 as well can only speed it up. I saw some comments I think by you on einsum not being as efficient on the CPU, not sure if that's still the case but that may be it.

So for the clean implementation it may be necessary to put some branchings, maybe reverting back to the old version in key places when the number of replicas is 1 (assuming it only makes sense to run with 1 replica on the CPU).

scarlehoff · 2023-09-08T08:20:22Z

assuming it only makes sense to run with 1 replica on the CPU

I think this is a good assumption. I don't know what computers people have access to, but in my experience it is more convenient to run many small jobs rather than a big one in clusters (mainly due to queues, and not only thinking about nnpdf).

APJansen added enhancement New feature or request help wanted Extra attention is needed labels Aug 30, 2023

APJansen mentioned this issue Aug 30, 2023

Go from one step being an epoch to one step being a batch #1802

Closed

Radonirinaunimi linked a pull request Oct 16, 2023 that will close this issue

Multi dense logistics #1818

Merged

Radonirinaunimi mentioned this issue Oct 16, 2023

Multi dense logistics #1818

Merged

RoyStegeman added the escience label Nov 29, 2023

APJansen closed this as completed in #1818 Dec 4, 2023

APJansen reopened this Dec 4, 2023

scarlehoff closed this as completed Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Realising a factor 20-30 speedup on GPU #1803

Realising a factor 20-30 speedup on GPU #1803

APJansen commented Aug 30, 2023 •

edited

Loading

scarlehoff commented Sep 5, 2023

APJansen commented Sep 5, 2023

scarlehoff commented Sep 8, 2023 •

edited

Loading

Realising a factor 20-30 speedup on GPU #1803

Realising a factor 20-30 speedup on GPU #1803

Comments

APJansen commented Aug 30, 2023 • edited Loading

Factors contributing to speedup

Steps remaining

Tensorboard profile

scarlehoff commented Sep 5, 2023

APJansen commented Sep 5, 2023

scarlehoff commented Sep 8, 2023 • edited Loading

APJansen commented Aug 30, 2023 •

edited

Loading

scarlehoff commented Sep 8, 2023 •

edited

Loading