Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable jit compilation in tf > 2.16 #2135

Merged
merged 2 commits into from
Jul 28, 2024
Merged

Disable jit compilation in tf > 2.16 #2135

merged 2 commits into from
Jul 28, 2024

Conversation

scarlehoff
Copy link
Member

@scarlehoff scarlehoff commented Jul 25, 2024

This is enough to run fits in GPU with TF > 2.16. At least in my systems. I can run 120 replicas (it took me a few iterations of cuda drivers to make it work).

This disables XLA also in CPU (if it is active, which at the moment I think it isn't by default)

--tf_xla_cpu_global_jit=false           bool    Enables global JIT compilation for CPU via SessionOptions.

I would keep running with 2.15 since that one we know for sure it works and I've only tested in one system for now (python 3.12, TF 2.17, RTX 3070).

I don't appreciate any degradation of performance but this GPU only has 8 GB of RAM so maybe I was already bottlenecked due to the memory before.

Note that before TF 2.16, XLA compilation was disabled by default. Funny thing is, if you enable it for TF 2.15 we also see some problems (even in CPU).
I don't know whether this is a fundamental problem with XLA (and thus nothing we can do) or whether there is a problem in our code that makes XLA not work.
If someone wants to investigate, probably the best thing to do is to try TF 2.15 and set JIT_COMPILE=True, because from 2.16 onwards it will work but it will just crash due to the memory leak.

@scarlehoff scarlehoff added the n3fit Issues and PRs related to n3fit label Jul 25, 2024
@scarlehoff scarlehoff marked this pull request as ready for review July 25, 2024 12:11
@Cmurilochem Cmurilochem self-assigned this Jul 25, 2024
@scarlehoff scarlehoff force-pushed the disable_xla_tf216 branch 2 times, most recently from e7fd0ca to 6498f0f Compare July 25, 2024 16:41
@scarlehoff
Copy link
Member Author

scarlehoff commented Jul 25, 2024

Here's a report:

https://vp.nnpdf.science/8vyjWCvZRvSzu0JGbLbErQ==/

The computer used to run this fit was
Intel(R) Core(TM) i9-10900K CPU @ 3.70GHz, 16 GB of RAM
RTX 3070 8 GB

Took about 60 minutes for 120 replicas.

I'll merge once the tests are passing (there has been some trouble with the CERN server and some pdfs were not downloaded from LHAPDF) since this changes nothing.

Also, I'm relatively proud of this stability taking into account the amount of things that have changed between these two fits (including the fact that after 4.0.9 we are no longer backwards compatibility by law) :)

Screenshot 2024-07-25 at 18 44 53

@RoyStegeman
Copy link
Member

Do you know why the training length distribution looks so different?

@scarlehoff
Copy link
Member Author

The specific pattern looks more different than it is because by chance the fits stopped in a way that made them flatter. But the fits that arrive to the last bin (which is the only relevant one) are ~25 vs ~17

That said, for some reason and with N=2 fits in the GPU seem to produce flatter diatributions. Don't know whether there's a reason: https://vp.nnpdf.science/WbBCvsjfQV-6ncIQ3GhCVw==/

or whether it is related to the change in the positivity datasets

@RoyStegeman
Copy link
Member

I should have done a cpu fit with that runcard and master at some point, but let me quickly just redo it to be sure.

@RoyStegeman
Copy link
Member

We already discussed it, but for later reference let me put here the results of the fit with the current version of the master branch but on cpu and using the same runcard as in your fit: : https://vp.nnpdf.science/C0EQblACS2qzaMbpNLuePQ==/

As expected, the TL distribution looks different, but it would be interesting to see how the TL distribution changes on GPU if the pseudodata sampling seed is changed as in both your fits this is the same.

@scarlehoff
Copy link
Member Author

scarlehoff commented Jul 26, 2024

Good eye!

Indeed, there was a problem with the multi-replica stopping, where the replica could become active again. Here's the same fit with this corrected:

https://vp.nnpdf.science/RWnazARaSb6TbKMtkxONGA==

It doesn't seem to have an effect on the fit (in this report even the seed for the pseudodata is different) but good that it has been caught :)

@Cmurilochem in the same manner, it should not impact your hyperopt runs, but if you need to run new ones, better if you use this branch I guess

@scarlehoff
Copy link
Member Author

For good measure, a fit using the same seeds as the first one https://vp.nnpdf.science/CyjXLYyARtKfWK1iUnJrYg==/

This is ready to merge

@RoyStegeman RoyStegeman added the run-fit-bot Starts fit bot from a PR. label Jul 27, 2024
Copy link

Greetings from your nice fit 🤖 !
I have good news for you, I just finished my tasks:

Check the report carefully, and please buy me a ☕ , or better, a GPU 😉!

Copy link
Member

@RoyStegeman RoyStegeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also with the same runcard as your first (240726-jcm-004 wasn't uploaded so couldn't compare to that) https://vp.nnpdf.science/rr476H7nT9GjOb6WBasw7Q==/

Took 2:15h and 25 gb on a V100

Anyway, looks good to me

@RoyStegeman RoyStegeman merged commit 68f5c66 into master Jul 28, 2024
8 checks passed
@RoyStegeman RoyStegeman deleted the disable_xla_tf216 branch July 28, 2024 11:12
@scarlehoff
Copy link
Member Author

scarlehoff commented Jul 28, 2024

Also with the same runcard as your first (240726-jcm-004 wasn't uploaded so couldn't compare to that)

they are the same

I'm surprised my puny rtx was faster though. I wonder why.

@goord
Copy link
Collaborator

goord commented Jul 29, 2024

From the XLA docs:

All operations must have inferrable shapes
XLA needs to be able to infer the shapes for all of operations it compiles given the inputs to the computation. So a model function that produces a Tensor with an unpredictable shape will fail with an error when run. (In this example, the shape of the output of tf.expand_dims depends on random_dim_size which cannot be inferred given x, y and z.)

Note that because XLA is a JIT compiler, the shapes can vary across runs, as long as they can be inferred given the inputs to the cluster. So this example is fine.


Could it be that the masking layers give rise to 'unpredictable shape' tensors?

@scarlehoff
Copy link
Member Author

Then it should crash at compilation time (so either way it is a bug in their side). If, for fun, you want to look further into this I suggest this as a starting point: #2137
a code that is at the same time compatible with pytorch and tensorflow is the most explicit thing you can get (and actually, to make it compatible with pytorch I had to add the output shape in a few places where tensorflow as able to infer them but pytorch wasn't).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
n3fit Issues and PRs related to n3fit run-fit-bot Starts fit bot from a PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants