Parallel replicas with varying tr-vl masks #1788

goord · 2023-08-09T13:38:22Z

This is a continuation of the pull request #1661 that implements the parallel replicas with same-trvl-seed=false support. The branch has been migrated to this original repo and the latest master was merged.

Zaharid · 2023-09-14T13:28:30Z

validphys2/src/validphys/utils.py

+    if is_hashable(obj):
+        return obj
+    elif isinstance(obj, Mapping):
+        return frozenset([(immute(k), immute(v)) for k, v in obj.items()])


The dict keys are already hashable.

Suggested change

return frozenset([(immute(k), immute(v)) for k, v in obj.items()])

return frozenset([k, immute(v)) for k, v in obj.items()])

validphys2/src/validphys/utils.py

Zaharid · 2023-09-14T13:34:44Z

validphys2/src/validphys/utils.py

+    def wrapped(*args, **kwargs):
+        args = tuple([immute(arg) for arg in args])
+        kwargs = {k: immute(v) for k, v in kwargs.items()}
+        return func(*args, **kwargs)


I feel this is rather dangerous as a general decorator, as it changes the meaning of the inputs in unexpected ways.

It may make more sense to do this closer to the runcard input. In which case the freezer becomes

def freeze(obj): if isinstance(obj, list): return tuple(freeze(ele) for ele in obj) if isinstance(obj, dict): return ((k, freeze(v)) for k, v in obj.items()) if isinstance(obj, set): return frozenset(obj) return obj

Right, but where in the code should the freezer be called?

An input to a vp action is either:

Some input coming from the runcard, which is not hashable in general.

The ouput of a parsing or a production rule (like a PDF object) which really needs to be hashable already.

The output of another action, which you really don't care about being hashable.

Hence you really only need to make sure to make the inputs hashable, which would not change the semantics too much. And the best place to do that is close to where you are parsing the input. It is something I had around for reportengine "1.0" which is why I had the above function lying around.

Zaharid · 2023-09-14T13:37:05Z

validphys2/src/validphys/utils.py

+def immute(obj: Any):
+    # So that we don't infinitely recurse since frozenset and tuples
+    # are Sequences.
+    if is_hashable(obj):


Computing hashes of potentially deeply nested objects can be expensive. Given that you are failing anyway, you can as well do nothing and deal with effectively the same error.

Indeed we could skip the hashing check. As a side note: I have no idea why tests are currently failing in this branch...

scarlehoff · 2023-10-04T09:04:34Z

Looking through all the cuts classes the lru_caches that you added should be fine. Below I've added a small script that tests that they are actually loaded independently.

load_with_cuts, the cuts variable will only be understood as being the same when it is really the same (either it comes from the same path or are internal to the same dataset or are defined by the same similaritycuts, so it should be fine)
dataset_t0_predictions for a given dataset and a given pdf set these should be exactly the same.
loaded_commondata_with_cuts (same as (1.))
produce_rules, it's only used with use_cuts=internal (and then they are what they are) or when filter_rules is given. And if filter_rules are given then, again, they will be defining the cut and if they are the same the rules should be the same.

So it seems safe to me. We might want to do a comparefits report when we know we have different cuts to make sure that stays the same. Several examples here (that use different kind of cuts)

https://vp.nnpdf.science/sK8VI_HITF-hrg2Cx-_lPQ==/
https://vp.nnpdf.science/GMXWE_wgS-Os5pbbsNA4gA==/
https://vp.nnpdf.science/rm4Vqm7zQwW8m6v5O1c_4A==/

However, some coments.

(2.) Suggests to me that the lru_cache could be put instead at the level of the covariance matrix since this is the only thing the t0 predictions are used for and for a given list of datasets, cuts an a given t0set the covariance matrix should be the same. This might be trickier to implement though and if this already gives a good speed up maybe not worth it.

from validphys.api import API
from validphys.convolution import central_predictions
dsname = "NMCPD_dw_ite"
ds = API.dataset(dataset_input={"dataset": dsname}, theoryid=400, use_cuts="fromfit", fit="230718-jcmwfaser-001")
ds2 = API.dataset(dataset_input={"dataset": dsname}, theoryid=400, use_cuts="fromfit", fit="230725-jcm-nnpdf40-002")
pdf = API.pdf(pdf="NNPDF40_nnlo_as_01180")
cv = central_predictions(ds, pdf)
cv2 = central_predictions(ds2, pdf)
print(cv.shape[0] != cv.shape[1])
> True

I've manually changed the cuts in both fits to make sure they were different.

RE the tests, I think you are seeing the problem that was fixed here #1805

goord · 2023-10-24T11:26:14Z

It looks like something is still wrong. I did a test with a single epoch, 10 parallel replicas, and it seems to me that the difference between baseline (sequential replicas) and the parallel case is growing with the replica number:

grep -i "Best fit for replica" master/seq-cpu/nnpdf-cpu-10.log
[INFO]: Best fit for replica #1, chi2=12155.363 (tr=15396.849, vl=3370.149)
[INFO]: Best fit for replica #2, chi2=52.119 (tr=56.119, vl=78.265)
[INFO]: Best fit for replica #3, chi2=44.637 (tr=49.594, vl=67.921)
[INFO]: Best fit for replica #4, chi2=1759.277 (tr=1103.458, vl=3076.298)
[INFO]: Best fit for replica #5, chi2=181.414 (tr=190.806, vl=254.887)
[INFO]: Best fit for replica #6, chi2=880.913 (tr=1119.586, vl=281.592)
[INFO]: Best fit for replica #7, chi2=9869.230 (tr=12839.809, vl=1296.612)
[INFO]: Best fit for replica #8, chi2=136.152 (tr=144.090, vl=194.078)
[INFO]: Best fit for replica #9, chi2=351.785 (tr=424.573, vl=156.781)
[INFO]: Best fit for replica #10, chi2=70873.164 (tr=14250.999, vl=225323.688)

grep -i "Best fit for replica" trvl-mask/seq-cpu/nnpdf-cpu-10.log 
[INFO]: Best fit for replica #1, chi2=12155.471 (tr=15396.991, vl=3370.150)
[INFO]: Best fit for replica #2, chi2=52.119 (tr=56.119, vl=78.265)
[INFO]: Best fit for replica #3, chi2=44.637 (tr=49.594, vl=67.921)
[INFO]: Best fit for replica #4, chi2=1759.277 (tr=1103.458, vl=3076.298)
[INFO]: Best fit for replica #5, chi2=181.414 (tr=190.806, vl=254.887)
[INFO]: Best fit for replica #6, chi2=880.913 (tr=1119.586, vl=281.592)
[INFO]: Best fit for replica #7, chi2=9869.230 (tr=12839.809, vl=1296.612)
[INFO]: Best fit for replica #8, chi2=136.152 (tr=144.090, vl=194.078)
[INFO]: Best fit for replica #9, chi2=351.789 (tr=424.577, vl=156.781)
[INFO]: Best fit for replica #10, chi2=70875.047 (tr=14250.945, vl=225331.047)

grep -i "Best fit for replica" trvl-mask/par-cpu/nnpdf-cpu-10.log 
[INFO]: Best fit for replica #1, chi2=12155.734 (tr=15381.599, vl=3383.293)
[INFO]: Best fit for replica #2, chi2=52.516 (tr=56.810, vl=78.048)
[INFO]: Best fit for replica #3, chi2=43.484 (tr=47.829, vl=69.136)
[INFO]: Best fit for replica #4, chi2=1679.792 (tr=1526.910, vl=1633.075)
[INFO]: Best fit for replica #5, chi2=200.910 (tr=216.563, vl=258.565)
[INFO]: Best fit for replica #6, chi2=2451.295 (tr=3158.492, vl=608.226)
[INFO]: Best fit for replica #7, chi2=2967.282 (tr=3846.834, vl=671.785)
[INFO]: Best fit for replica #8, chi2=1429.978 (tr=1877.560, vl=221.299)
[INFO]: Best fit for replica #9, chi2=553.733 (tr=642.818, vl=245.547)
[INFO]: Best fit for replica #10, chi2=123811.469 (tr=144675.531, vl=52737.551)

I expect some roundoff errors, but certainly not a systematic increase per replica. BTW, for the sequential mode the chi2 is identical to the master branch

edit: a quick check with the basic parallel runcard shows no significant differences across replicas, so the problem appears to be only for the NNPDF4.0 runcard...

Radonirinaunimi · 2023-10-24T12:31:33Z

As already mentioned, I don't expect the numbers at a given epoch to be fully consistent/close. In the parallel replicas case, even a very small difference at the start of the fit could lead to fluctuations during the training. It is not that all the changes here kept the numerical values to be exactly the same. So, the proper way to compare the results are by running proper fits. At the very least for now we should compare: master sequential vs trvl-mask sequential and master sequential vs trvl-mask parallel.

Does this make sense?

RoyStegeman · 2023-10-24T12:49:07Z

Well I agree with @goord that the pattern is curious. If all the seeds are the same at the start of each replica then the result should be the same for the sequential and parallel cases. It may be that there is some numerical instability, or perhaps the seeds are not the same because e.g. the optimizer/nn initialization/trvl split/whatever seed doesn't get reset for the parallel case(?) but then I still don't understand why the not only the first but also the second and third replicas have quite good agreement while this agreement has clearly deteriorated by replicas 9 and 10. If the seeds were not the same I'd expect the deterioration at replica 2 to be equivalent to all later differences...

Also the edit that it's a problem for the NNPDF4.0 runcard but not for the basic runcard is odd. Is debug always False when doing the comparison for the basic runcard?

scarlehoff · 2023-10-24T12:51:29Z

In this case this could point however to a bug on how the fit is being stopped. The first replica is stopped correctly, the second one is a bit farther away from the optimal point and so on and so forth.
For instance, more than the numerical difference, I'd be worried that for replica 8 chi_tr < chi_vl in the sequential case but chi_tr > chi_vl in the parallel case.

The random seeds is the other thing that comes to mind, as @RoyStegeman said, it would make sense that it is the same for the 1st replica but then different for the rest. But I also fail to see why should it be more different for 5 than for 2 (and the first few replicas are very close...)

The basic_runcard has same_trvl_per_replica: True, did you remove that option? Because if you didn't that might be pointing to the source of the bug.

goord · 2023-10-24T13:45:37Z

Well my first suspicion was that somewhere along the losses or observables, tensors get somehow accumulated along the replica dimension, but then again the basic runcard seems to reproduce the sequential fit ..

RoyStegeman · 2023-10-24T13:52:28Z

But are all the important settings the same for the basic runcard and the NNPDF4.0 runcard? I.e. the changes between the basic runcard you use are limited to choice of dataset, seeds (same seeds just different value), and preprocessing, while keys such as debug and same_trvl_per_replica are the same in the two cases? If they are all the same, do you again find the divergence of results if all you do is add more datasets to the basic runcard?

I'd like to understand what the important difference is between the runcards that causes the different behaviour.

goord · 2023-10-24T14:29:43Z

Both have dis and Dy datasets, positivity constraints and same_trvl_per_replica set to false to test the new functionality. The nnpdf4.0 has integrability layers though, which the basic runcard doesn't have

goord · 2023-10-30T12:21:19Z

Regarding the single-epoch reproducibility test:

On our laptops, both sequential and parallel modes seem to produce the same result (modulo small numerical noise)
On the Snellius cluster, also the master branch (with same tr/vl split for all replicas) gives a different result for the single-epoch sequential and parallel fit with 10 replicas.

(edit) Using TensorFlow 2.10 instead of 2.11 solves the reproducibility issue on the cluster too

Radonirinaunimi · 2023-10-31T14:11:06Z

(edit) Using TensorFlow 2.10 instead of 2.11 solves the reproducibility issue on the cluster too

Great! Did you understand why so?

PS: Could you please run black on the modified files?

goord · 2023-11-28T13:30:14Z

Over the weekend I did an extensive set of regression tests between master (rev. db8b790) and trvl-mask-layers branch (latest rev.). To my untrained eye, results look more or less identical:

runcard	comparison (100 replicas)	link
NNPDF4.0	master vs. trvl-mask-layers sequential CPU	https://vp.nnpdf.science/I6rdrAFsQ6Gah3fJ1Vf2rQ==
NNPDF4.0	master vs. trvl-mask-layers parallel CPU	https://vp.nnpdf.science/IX_nDzOGSfaclCah74K-rQ==
NNPDF4.0	master vs. trvl-mask-layers parallel GPU	https://vp.nnpdf.science/3JOhq50GRqizFGwZPdRfUQ==
feature-scaling	master vs. trvl-mask-layers sequential CPU	TBD
feature-scaling	master vs. trvl-mask-layers parallel CPU	https://vp.nnpdf.science/vRpWyCzARWKKs8U3PR733Q==
feature-scaling	master vs. trvl-mask-layers parallel GPU	https://vp.nnpdf.science/atG4x7_xQNSP7bl0xJSSGg==
flavour-basis	master vs. trvl-mask-layers sequential CPU	TBD
flavour-basis	master vs. trvl-mask-layers parallel CPU	https://vp.nnpdf.science/RD_jmIzrQtKZ3qfoLikrlw==
flavour-basis	master vs. trvl-mask-layers parallel GPU	https://vp.nnpdf.science/cc60ST-4QFGzIk9SKGV0vw==

Radonirinaunimi · 2023-11-28T20:56:13Z

Thanks a lot @goord for these comparisons! These comparisons look great; as you said, the results are statistically identical. I also see that you've found a way around the TF version linked to the reproducibility issue (?). I guess that after some clean ups (and black), this is finally ready for reviews?

goord · 2023-11-29T09:02:45Z

TODO: the observation data rotation is currently not working correctly, because it is applying a masked rotation matrix after the masking layer. This should change to applying the unmasked rotation to the observation data before the masking.

scarlehoff · 2023-11-29T12:29:57Z

When you try to run DIS_diagonal_l2reg_example.yml it might be that not all datasets declared there exist anymore. You might need to remove some of them to test the runcard (as mentioned earlier today... it would be good to have all examples tested...)

goord · 2023-12-21T14:39:35Z

Refactored the LossInvCovmat to handle the diagonal case in an optimized way. I get no significant numerical changes after 4000 epochs.

I was running the black tool, but I get a whole bunch of files I didn't touch that need reformatting... Should I ignore those?

APJansen · 2024-02-20T14:27:28Z

Is this ready for the redo-regressions?

scarlehoff · 2024-02-20T14:28:42Z

No, we asked for a lot of stuff to be changed. I'd need to redo the review.

n3fit/src/n3fit/layers/losses.py

n3fit/src/n3fit/model_gen.py

n3fit/src/n3fit/model_trainer.py

n3fit/src/n3fit/performfit.py

validphys2/src/validphys/n3fit_data.py

Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>

res->loss)

Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>

APJansen · 2024-02-21T08:01:42Z

I've fixed what I could, but I also find anything related to the preprocessing of data very confusing.

APJansen · 2024-02-21T15:39:05Z

I noticed I introduced a bug in 49904a3, got the number of replicas from the batch axis, just pushed a fix.

…-layers

APJansen · 2024-02-22T08:15:52Z

Is this ready now for the redoing of the regressions (once tests pass)? Would be great if we can get this merged, and perhaps even #1936, so that we can turn on some faster hyperopt runs over the weekend to get some more data.

scarlehoff

Just a final comment from me :P

n3fit/src/n3fit/model_trainer.py

…layers.

APJansen · 2024-02-22T10:38:59Z

@scarlehoff About the flattening, indeed I guess it was the covmat? You wrote that you would change your suggestion to a comment, but then resolved it, should I add a comment there or not?

n3fit/src/n3fit/model_trainer.py

Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>

goord requested review from scarlehoff, Radonirinaunimi and RoyStegeman August 9, 2023 13:38

Radonirinaunimi requested a review from APJansen August 17, 2023 13:42

This was referenced Aug 30, 2023

Trvl mask layers #1661

Closed

Refactor stopping #1792

Merged

APJansen mentioned this pull request Aug 30, 2023

Realising a factor 20-30 speedup on GPU #1803

Closed

Zaharid reviewed Sep 14, 2023

View reviewed changes

This was referenced Nov 1, 2023

Restart hyperopt #1824

Merged

trvl-mask-layers parallel NNPDF4.0 fit broken #1838

Closed

scarlehoff mentioned this pull request Nov 17, 2023

To Do for 4.0.10 #1854

Open

31 tasks

Radonirinaunimi linked an issue Nov 28, 2023 that may be closed by this pull request

trvl-mask-layers parallel NNPDF4.0 fit broken #1838

Closed

RoyStegeman added the escience label Nov 29, 2023

Cmurilochem mentioned this pull request Dec 11, 2023

Hyperopt loss #1726

Merged

3 tasks

APJansen added 2 commits February 20, 2024 13:16

Remove unused functools import

b337bf7

Fix shape of input and mask in the Mask layer, and adjust test.

c2e4935

APJansen force-pushed the trvl-mask-layers branch from 6bd0fb6 to c2e4935 Compare February 20, 2024 12:17

scarlehoff reviewed Feb 20, 2024

View reviewed changes

APJansen and others added 5 commits February 21, 2024 07:46

Simplify einsum call

a8da0db

Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>

Use enumerate instead of indexing

2bbf864

Simplify branching on kernel shape (also rename tmp->obs_diff,

1d91484

res->loss)

be explicit about axis

d994895

Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>

Remove check for multiple replicas

c260f14

Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>

APJansen and others added 5 commits February 21, 2024 10:12

keys -> items in loop as a value was being used

a4026e1

Remove deprecated comment

d63115b

Clarify branching in loss

49904a3

Simplified pos_info and integ_info usage in model_trainer

db24f63

Fix bug in losses, incorrectly determining number of replicas

f6e9bac

goord and others added 3 commits February 21, 2024 21:22

Made construction of replicas_info more readable

66f1068

Merge remote-tracking branch 'origin/trvl-mask-layers' into trvl-mask…

2cd1701

…-layers

Simplify replicas_info further

bc76939

scarlehoff approved these changes Feb 22, 2024

View reviewed changes

n3fit/src/n3fit/model_trainer.py Outdated Show resolved Hide resolved

APJansen added the redo-regressions Recompute the regression data label Feb 22, 2024

Automatically regenerated regressions from PR 1788, branch trvl-mask-…

44901f0

…layers.

scarlehoff reviewed Feb 22, 2024

View reviewed changes

n3fit/src/n3fit/model_trainer.py Outdated Show resolved Hide resolved

Add comment on flattening

a7dfc0d

Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>

scarlehoff merged commit faadba0 into master Feb 22, 2024
8 checks passed

scarlehoff deleted the trvl-mask-layers branch February 22, 2024 12:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel replicas with varying tr-vl masks #1788

Parallel replicas with varying tr-vl masks #1788

goord commented Aug 9, 2023

Zaharid Sep 14, 2023

Zaharid Sep 14, 2023

goord Sep 14, 2023

Zaharid Sep 14, 2023 •

edited

Loading

Zaharid Sep 14, 2023

goord Sep 14, 2023

scarlehoff commented Oct 4, 2023

goord commented Oct 24, 2023 •

edited

Loading

Radonirinaunimi commented Oct 24, 2023

RoyStegeman commented Oct 24, 2023 •

edited

Loading

scarlehoff commented Oct 24, 2023 •

edited

Loading

goord commented Oct 24, 2023

RoyStegeman commented Oct 24, 2023

goord commented Oct 24, 2023

goord commented Oct 30, 2023 •

edited

Loading

Radonirinaunimi commented Oct 31, 2023

goord commented Nov 28, 2023 •

edited

Loading

Radonirinaunimi commented Nov 28, 2023

goord commented Nov 29, 2023

scarlehoff commented Nov 29, 2023

goord commented Dec 21, 2023

APJansen commented Feb 20, 2024

scarlehoff commented Feb 20, 2024

APJansen commented Feb 21, 2024

APJansen commented Feb 21, 2024

APJansen commented Feb 22, 2024

scarlehoff left a comment

APJansen commented Feb 22, 2024

	return frozenset([(immute(k), immute(v)) for k, v in obj.items()])
	return frozenset([k, immute(v)) for k, v in obj.items()])

Parallel replicas with varying tr-vl masks #1788

Parallel replicas with varying tr-vl masks #1788

Conversation

goord commented Aug 9, 2023

Zaharid Sep 14, 2023

Choose a reason for hiding this comment

Zaharid Sep 14, 2023

Choose a reason for hiding this comment

goord Sep 14, 2023

Choose a reason for hiding this comment

Zaharid Sep 14, 2023 • edited Loading

Choose a reason for hiding this comment

Zaharid Sep 14, 2023

Choose a reason for hiding this comment

goord Sep 14, 2023

Choose a reason for hiding this comment

scarlehoff commented Oct 4, 2023

goord commented Oct 24, 2023 • edited Loading

Radonirinaunimi commented Oct 24, 2023

RoyStegeman commented Oct 24, 2023 • edited Loading

scarlehoff commented Oct 24, 2023 • edited Loading

goord commented Oct 24, 2023

RoyStegeman commented Oct 24, 2023

goord commented Oct 24, 2023

goord commented Oct 30, 2023 • edited Loading

Radonirinaunimi commented Oct 31, 2023

goord commented Nov 28, 2023 • edited Loading

Radonirinaunimi commented Nov 28, 2023

goord commented Nov 29, 2023

scarlehoff commented Nov 29, 2023

goord commented Dec 21, 2023

APJansen commented Feb 20, 2024

scarlehoff commented Feb 20, 2024

APJansen commented Feb 21, 2024

APJansen commented Feb 21, 2024

APJansen commented Feb 22, 2024

scarlehoff left a comment

Choose a reason for hiding this comment

APJansen commented Feb 22, 2024

Zaharid Sep 14, 2023 •

edited

Loading

goord commented Oct 24, 2023 •

edited

Loading

RoyStegeman commented Oct 24, 2023 •

edited

Loading

scarlehoff commented Oct 24, 2023 •

edited

Loading

goord commented Oct 30, 2023 •

edited

Loading

goord commented Nov 28, 2023 •

edited

Loading