Go from one step being an epoch to one step being a batch #1802

APJansen · 2023-08-30T12:55:09Z

As part of the changes mentioned in issue #1803, this is a very simple change that results in a factor 2 speedup.

It simply copies the input x grids by the number of epochs, and calls one step a batch rather than an epoch. This avoids some Tensorflow overhead that, with the other improvements mentioned there, take up nearly 50% of the total training time.

The current state is that the fit runs, but some changes need to be made downstream as it's crashing (perhaps just undoing the changes I made just before the fit, just after the fit?).

If anyone wants to take this up, please do.

To illustrate what this does, here is a tensorboard profile without doing this:

These gaps are almost completely removed by this PR.

APJansen · 2023-09-04T14:29:54Z

Tested this on the basic runcard with 1 replica on the CPU. It's about 10% faster even there. It's running fully now, results are identical with master when looking at the validation chi2, although the training chi2 is completely different. But since after 100s of epochs the validations are still identical, it must be a bug in the computation of this training chi2 rather than something in the actual training being different.

APJansen · 2023-09-06T15:15:22Z

The issue was that for the training losses, Keras computes a running average over batches. I just added a function in between that corrects the logs to give the current batch's losses.
There is still something wrong though. The relative difference in training chi2 starts out at 10^-8, so that could be round off errors from taking and "un-taking" the average over batches. But after 1000 epochs it has grown to be of order 1, whereas the numerical errors shouldn't even stack, each computation is independent. Probably they are taking a weighted average biased towards more recent batches or something. (validation remains identical to the last digit)

Apart from this I think there are 2 points remaining:

Is the extra memory use ever an issue? I haven't had any issues, but it does seem very stupid to do it like this. Perhaps it can be rewritten in terms of a data generator that just returns the single datapoint for epochs number of times.
Whether and how to change terminology. Perhaps the cleanest would be to use "steps" rather than "epochs", and implement a step as a batch. But to do that consistently would require a lot of changes, and also the runcard syntax. I imagine that is not wanted. We can also keep using epochs everywhere except the changes here. I can only see it causing issues when implementing a new callback, but since that would start probably by looking at the existing ones, where now only on_batch_end is used, it will probably be ok.

scarlehoff · 2023-09-06T15:47:59Z

Is the extra memory use ever an issue?

Yes. When running many replicas in parallel increasing the memory usage drastically reduce how many can you actually fit in a cluster. If the gain on CPU is only of ~10% it might not be worth it. We might want to have some branching even "run_in_batches=True" in the runcard.

Whether and how to change terminology.

Better not to change the terminology. The on_X_end have to change because these are internal to tensorflow but now we use "epoch" to mean "training iterations" (take into account that most of the people in the collaboration don't actually touch the code and epoch is now part of the vocabulary). The people touching this part of the code are only like 3 including you and they are all aware of the change :P

scarlehoff · 2023-09-06T15:55:04Z

n3fit/src/n3fit/backends/keras_backend/MetaModel.py

+        # This looks stupid, but it's actually faster, as it avoids some Tensorflow overhead
+        # every epoch. Each step is now a batch rather than an epoch
+        for k, v in x_params.items():
+            x_params[k] = tf.repeat(v, epochs, axis=0)
+        y = [tf.repeat(yi, epochs, axis=0) for yi in y]


Suggested change

# This looks stupid, but it's actually faster, as it avoids some Tensorflow overhead

# every epoch. Each step is now a batch rather than an epoch

for k, v in x_params.items():

x_params[k] = tf.repeat(v, epochs, axis=0)

y = [tf.repeat(yi, epochs, axis=0) for yi in y]

# Instead of running # epochs, run #epochs batches in a single epoch

# this avoids some Tensorflow overhead and it's actually faster

for k, v in x_params.items():

x_params[k] = tf.repeat(v, epochs, axis=0)

y = [tf.repeat(yi, epochs, axis=0) for yi in y]

I'd say this is a clever trick, not a stupid one.

RE the memory usage, I wonder whether there's a way of tricking tensorflow to pass a tensor of length 1 in the batch dimension but make it believe there are actually many.
It depends on how the batch-taking is implemented in tensorflow but maybe we can do something like

class FakeTensor(Tensor): def get_batch(self, i): if i < self._epochs: return self._true_tensor return end_signal

I implemented something like this, it works, but unfortunately it seems to completely negate the benefits of the copying. No idea why.

I think you might need to actually trick tensorflow for that to work.

However, I think the timings are quite ok as they are so we can leave this in the backburner for the time being. No need to overoptimize when there are many other bigger problems in the way.

Well they seem ok here, but that's just because the rest is slow ;P It's about a factor 2 speedup once the other optimizations in issue #1803 are implemented.
But sure it can wait until the rest is done. I hoped to fix this quickly before I go on holiday, but of course it's always a bit more tricky than you expect. I think the simplest is to just make say 1000 copies and train for epochs/1000.

It's about a factor 2 speedup once the other optimizations in issue #1803 are implemented.

But is a factor 2 also in CPU? Because if it is only an improvement for GPU I'd say it's better to branch there. GPU and CPU are different enough devices that I think some branching is ok, like the eigen/tensordot thing and in CPU the memory growth can harm more than the 10% gain you quoted before.

I don't know, it reduces the gap between steps from ~50 to ~3 ms, so it depends how much effect the other refactorings have on the CPU. If it's still not significant after that we can always default to the old behavior when the number of replicas is 1.

scarlehoff · 2023-09-08T07:04:23Z

The issue was that for the training losses, Keras computes a running average over batches. I just added a function in between that corrects the logs to give the current batch's losses.
There is still something wrong though. The relative difference in training chi2 starts out at 10^-8, so that could be round off errors from taking and "un-taking" the average over batches. But after 1000 epochs it has grown to be of order 1, whereas the numerical errors shouldn't even stack, each computation is independent. Probably they are taking a weighted average biased towards more recent batches or something. (validation remains identical to the last digit)

Thinking about this, you said that validation remain identical, what about the weights of the neural networks after a complete fit? If those are the same then it is clear that everything stays the same and the difference is only on the reporting, which is ok. For the final training loss (i.e., the one that we write in the .json file) we can evaluate the loss manually and write that.

APJansen · 2023-09-08T07:55:09Z

Better not to change the terminology. The on_X_end have to change because these are internal to tensorflow but now we use "epoch" to mean "training iterations" (take into account that most of the people in the collaboration don't actually touch the code and epoch is now part of the vocabulary). The people touching this part of the code are only like 3 including you and they are all aware of the change :P

Haha ok, I agree. I just put a comment at the top of the callbacks module.

Thinking about this, you said that validation remain identical, what about the weights of the neural networks after a complete fit? If those are the same then it is clear that everything stays the same and the difference is only on the reporting, which is ok. For the final training loss (i.e., the one that we write in the .json file) we can evaluate the loss manually and write that.

The weights must also be the same. I didn't check but literally every digit of every of the first 1000 epochs (didn't train for longer) are identical, would be quite the coincidence if the weights were different ;P I'm not sure if it's not an issue though, I think we want more than the final loss to understand what the model is doing. My current approach for correcting this doesn't work, not just because it's off (which I still don't understand), but also because it relies on being run every epoch, which is what I was testing with, but usually it's only every 100 epochs.

scarlehoff · 2023-09-25T17:36:35Z

Yes. When running many replicas in parallel increasing the memory usage drastically reduce how many can you actually fit in a cluster. If the gain on CPU is only of ~10% it might not be worth it. We might want to have some branching even "run_in_batches=True" in the runcard.

There's a solution that would work in both cases, without the branching. We could put the data in the first layer as a fixed layer (shared between all replicas) and then the input can be a very long list of None, it should have no effect on the memory.

APJansen · 2023-10-31T09:52:28Z

So the approach I've chosen is:

for 1 replica (in practice, CPU), don't change anything
for multiple replicas, copy up to 100 times, and if the number of epochs isn't divisible by this, try 10, and log a warning.

The reason for the first point is that it's not worth it, as it does come with the cost of correcting the training logs, which will slow it down overall.
For the latter, using 100 copies leaves only 0.5% speedup on the table (after other refactorings to come), so this seems like a good tradeoff.

I did this mostly by creating a class CallbackStep, which all the others now inherit from, that takes care of these conversions between epochs and batches, and calls a on_step_end that the others define.

I've tested that for 1 replica results are identical, and for multiple only the training chi2's differ slightly, up to 0.1%. This is because of the conversion of the logs back from an average to a single step loss, and so it doesn't affect the training at all.

I still need to check the performance, but maybe I'll wait with that until enough other refactorings are merged that this becomes substantial.

Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>

epoch. Avoids memory overhead by only combining up to 100 steps into one epoch, and not changing anything when using only 1 replica (i.e. on CPU).

APJansen · 2024-02-13T13:41:29Z

I have revived this in #1939 (was having trouble rebasing, and didn't want to waste time if I could just cherry-pick), so I think this can be closed.

APJansen added the help wanted Extra attention is needed label Aug 30, 2023

APJansen mentioned this pull request Aug 30, 2023

Realising a factor 20-30 speedup on GPU #1803

Closed

scarlehoff reviewed Sep 6, 2023

View reviewed changes

APJansen added 20 commits January 17, 2024 10:54

Move replica loop into generate_nn function

2a49b60

Simplify handling of dropout

9c23fa5

Factor out layer_generator in generate_dense_network

683e354

Refactor dense_per_flavor_network

d02e118

Move setting of last nodes to generate_nn

f907d7f

Add constant arguments

aa76bc7

Add constant arguments

1ad7960

Move dropout to generate_nn

7758497

Move concatenation of per_flavor layers into generate_nn

2d388d9

Make the two layer generators almost equal

1ef87dc

remove separate dense and dense_per_flavor functions

806e2c1

Add documentation.

bdbc3c3

Simplify per_flavor layer concatenation

e3f9f0c

Reverse order of loops over replicas and layers

3d9070f

Fixes for dropout

c8300c8

Fixes for per_flavour

0cf23f2

Fix issue with copying over nodes for per_flavour layer

b0a8e3b

Fix seeds in per_flavour layer

97d2efe

Add error for combination of dropout with per_flavour layers

4c4a2d5

Add basis_size argument to per_flavour layer

2287194

APJansen and others added 19 commits January 22, 2024 14:58

Explain need for is_stacked_single_replicas

115e30c

shorten line

8f4a596

fix constant loading

3568bbb

Simplify get_replica_weights

7de2b59

NNs -> all_NNs

3fba42f

Clarify get_layer_replica_weights

a1c46ec

Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>

Clarify set_layer_replica_weights

3557ee6

Remove comment about python 3.11

2001e63

Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>

Merge prefactors into single layer

be29387

Add MultiDense layer

92d21a3

Add MultiDense layer improvements

efcfc6a

Recreate initializer per replica to make sure seed is properly set

0e3dd2c

Add tolerences to test

4e81bf6

Add multi_dense path in generate_nn

08a3a18

Add MultiDropout

a84d8ac

Replace old dense layer everywhere

ef336e1

Remove MultiDropout, not necessary

737ed2b

Update developing weights structure

a5c6b11

Remove MultiDropout once more

93a5636

APJansen force-pushed the epochs_to_batches branch from 7b1b9c0 to fe24687 Compare January 23, 2024 12:40

APJansen added 5 commits January 23, 2024 14:07

Fix naming inconsistency wrt parallel-prefactor

c07a49b

Merge prefactors into single layer

c77add6

Add MultiDense layer improvements

304b400

Replace old dense layer everywhere

a658304

Avoid TensorFlow overhead by making one step a batch rather than an

0fa674d

epoch. Avoids memory overhead by only combining up to 100 steps into one epoch, and not changing anything when using only 1 replica (i.e. on CPU).

APJansen force-pushed the epochs_to_batches branch from fe24687 to 0fa674d Compare January 23, 2024 13:09

This was referenced Feb 13, 2024

Fk refactor #1936

Merged

Avoid idle gpu #1939

Merged

APJansen closed this Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Go from one step being an epoch to one step being a batch #1802

Go from one step being an epoch to one step being a batch #1802

APJansen commented Aug 30, 2023 •

edited

Loading

APJansen commented Sep 4, 2023

APJansen commented Sep 6, 2023

scarlehoff commented Sep 6, 2023

scarlehoff Sep 6, 2023 •

edited

Loading

APJansen Sep 8, 2023

scarlehoff Sep 8, 2023

APJansen Sep 8, 2023

scarlehoff Sep 8, 2023

APJansen Sep 8, 2023

scarlehoff commented Sep 8, 2023

APJansen commented Sep 8, 2023

scarlehoff commented Sep 25, 2023

APJansen commented Oct 31, 2023

APJansen commented Feb 13, 2024

Go from one step being an epoch to one step being a batch #1802

Go from one step being an epoch to one step being a batch #1802

Conversation

APJansen commented Aug 30, 2023 • edited Loading

APJansen commented Sep 4, 2023

APJansen commented Sep 6, 2023

scarlehoff commented Sep 6, 2023

scarlehoff Sep 6, 2023 • edited Loading

Choose a reason for hiding this comment

APJansen Sep 8, 2023

Choose a reason for hiding this comment

scarlehoff Sep 8, 2023

Choose a reason for hiding this comment

APJansen Sep 8, 2023

Choose a reason for hiding this comment

scarlehoff Sep 8, 2023

Choose a reason for hiding this comment

APJansen Sep 8, 2023

Choose a reason for hiding this comment

scarlehoff commented Sep 8, 2023

APJansen commented Sep 8, 2023

scarlehoff commented Sep 25, 2023

APJansen commented Oct 31, 2023

APJansen commented Feb 13, 2024

APJansen commented Aug 30, 2023 •

edited

Loading

scarlehoff Sep 6, 2023 •

edited

Loading