Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel replicas with varying tr-vl masks #1788

Merged
merged 33 commits into from
Feb 22, 2024
Merged

Conversation

goord
Copy link
Collaborator

@goord goord commented Aug 9, 2023

This is a continuation of the pull request #1661 that implements the parallel replicas with same-trvl-seed=false support. The branch has been migrated to this original repo and the latest master was merged.

if is_hashable(obj):
return obj
elif isinstance(obj, Mapping):
return frozenset([(immute(k), immute(v)) for k, v in obj.items()])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dict keys are already hashable.

Suggested change
return frozenset([(immute(k), immute(v)) for k, v in obj.items()])
return frozenset([k, immute(v)) for k, v in obj.items()])

validphys2/src/validphys/utils.py Outdated Show resolved Hide resolved
def wrapped(*args, **kwargs):
args = tuple([immute(arg) for arg in args])
kwargs = {k: immute(v) for k, v in kwargs.items()}
return func(*args, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this is rather dangerous as a general decorator, as it changes the meaning of the inputs in unexpected ways.

It may make more sense to do this closer to the runcard input. In which case the freezer becomes

def freeze(obj):
    if isinstance(obj, list):
        return tuple(freeze(ele) for ele in obj)
    if isinstance(obj, dict):
        return ((k, freeze(v)) for k, v in obj.items())
    if isinstance(obj, set):
        return frozenset(obj)
    return obj

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but where in the code should the freezer be called?

Copy link
Contributor

@Zaharid Zaharid Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An input to a vp action is either:

  • Some input coming from the runcard, which is not hashable in general.
  • The ouput of a parsing or a production rule (like a PDF object) which really needs to be hashable already.
  • The output of another action, which you really don't care about being hashable.

Hence you really only need to make sure to make the inputs hashable, which would not change the semantics too much. And the best place to do that is close to where you are parsing the input. It is something I had around for reportengine "1.0" which is why I had the above function lying around.

def immute(obj: Any):
# So that we don't infinitely recurse since frozenset and tuples
# are Sequences.
if is_hashable(obj):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Computing hashes of potentially deeply nested objects can be expensive. Given that you are failing anyway, you can as well do nothing and deal with effectively the same error.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed we could skip the hashing check. As a side note: I have no idea why tests are currently failing in this branch...

@scarlehoff
Copy link
Member

Looking through all the cuts classes the lru_caches that you added should be fine. Below I've added a small script that tests that they are actually loaded independently.

  1. load_with_cuts, the cuts variable will only be understood as being the same when it is really the same (either it comes from the same path or are internal to the same dataset or are defined by the same similaritycuts, so it should be fine)
  2. dataset_t0_predictions for a given dataset and a given pdf set these should be exactly the same.
  3. loaded_commondata_with_cuts (same as (1.))
  4. produce_rules, it's only used with use_cuts=internal (and then they are what they are) or when filter_rules is given. And if filter_rules are given then, again, they will be defining the cut and if they are the same the rules should be the same.

So it seems safe to me. We might want to do a comparefits report when we know we have different cuts to make sure that stays the same. Several examples here (that use different kind of cuts)

https://vp.nnpdf.science/sK8VI_HITF-hrg2Cx-_lPQ==/
https://vp.nnpdf.science/GMXWE_wgS-Os5pbbsNA4gA==/
https://vp.nnpdf.science/rm4Vqm7zQwW8m6v5O1c_4A==/

However, some coments.

(2.) Suggests to me that the lru_cache could be put instead at the level of the covariance matrix since this is the only thing the t0 predictions are used for and for a given list of datasets, cuts an a given t0set the covariance matrix should be the same. This might be trickier to implement though and if this already gives a good speed up maybe not worth it.

from validphys.api import API
from validphys.convolution import central_predictions
dsname = "NMCPD_dw_ite"
ds = API.dataset(dataset_input={"dataset": dsname}, theoryid=400, use_cuts="fromfit", fit="230718-jcmwfaser-001")
ds2 = API.dataset(dataset_input={"dataset": dsname}, theoryid=400, use_cuts="fromfit", fit="230725-jcm-nnpdf40-002")
pdf = API.pdf(pdf="NNPDF40_nnlo_as_01180")
cv = central_predictions(ds, pdf)
cv2 = central_predictions(ds2, pdf)
print(cv.shape[0] != cv.shape[1])
> True

I've manually changed the cuts in both fits to make sure they were different.

RE the tests, I think you are seeing the problem that was fixed here #1805

@goord
Copy link
Collaborator Author

goord commented Oct 24, 2023

It looks like something is still wrong. I did a test with a single epoch, 10 parallel replicas, and it seems to me that the difference between baseline (sequential replicas) and the parallel case is growing with the replica number:

grep -i "Best fit for replica" master/seq-cpu/nnpdf-cpu-10.log
[INFO]: Best fit for replica #1, chi2=12155.363 (tr=15396.849, vl=3370.149)
[INFO]: Best fit for replica #2, chi2=52.119 (tr=56.119, vl=78.265)
[INFO]: Best fit for replica #3, chi2=44.637 (tr=49.594, vl=67.921)
[INFO]: Best fit for replica #4, chi2=1759.277 (tr=1103.458, vl=3076.298)
[INFO]: Best fit for replica #5, chi2=181.414 (tr=190.806, vl=254.887)
[INFO]: Best fit for replica #6, chi2=880.913 (tr=1119.586, vl=281.592)
[INFO]: Best fit for replica #7, chi2=9869.230 (tr=12839.809, vl=1296.612)
[INFO]: Best fit for replica #8, chi2=136.152 (tr=144.090, vl=194.078)
[INFO]: Best fit for replica #9, chi2=351.785 (tr=424.573, vl=156.781)
[INFO]: Best fit for replica #10, chi2=70873.164 (tr=14250.999, vl=225323.688)

grep -i "Best fit for replica" trvl-mask/seq-cpu/nnpdf-cpu-10.log 
[INFO]: Best fit for replica #1, chi2=12155.471 (tr=15396.991, vl=3370.150)
[INFO]: Best fit for replica #2, chi2=52.119 (tr=56.119, vl=78.265)
[INFO]: Best fit for replica #3, chi2=44.637 (tr=49.594, vl=67.921)
[INFO]: Best fit for replica #4, chi2=1759.277 (tr=1103.458, vl=3076.298)
[INFO]: Best fit for replica #5, chi2=181.414 (tr=190.806, vl=254.887)
[INFO]: Best fit for replica #6, chi2=880.913 (tr=1119.586, vl=281.592)
[INFO]: Best fit for replica #7, chi2=9869.230 (tr=12839.809, vl=1296.612)
[INFO]: Best fit for replica #8, chi2=136.152 (tr=144.090, vl=194.078)
[INFO]: Best fit for replica #9, chi2=351.789 (tr=424.577, vl=156.781)
[INFO]: Best fit for replica #10, chi2=70875.047 (tr=14250.945, vl=225331.047)

grep -i "Best fit for replica" trvl-mask/par-cpu/nnpdf-cpu-10.log 
[INFO]: Best fit for replica #1, chi2=12155.734 (tr=15381.599, vl=3383.293)
[INFO]: Best fit for replica #2, chi2=52.516 (tr=56.810, vl=78.048)
[INFO]: Best fit for replica #3, chi2=43.484 (tr=47.829, vl=69.136)
[INFO]: Best fit for replica #4, chi2=1679.792 (tr=1526.910, vl=1633.075)
[INFO]: Best fit for replica #5, chi2=200.910 (tr=216.563, vl=258.565)
[INFO]: Best fit for replica #6, chi2=2451.295 (tr=3158.492, vl=608.226)
[INFO]: Best fit for replica #7, chi2=2967.282 (tr=3846.834, vl=671.785)
[INFO]: Best fit for replica #8, chi2=1429.978 (tr=1877.560, vl=221.299)
[INFO]: Best fit for replica #9, chi2=553.733 (tr=642.818, vl=245.547)
[INFO]: Best fit for replica #10, chi2=123811.469 (tr=144675.531, vl=52737.551)

I expect some roundoff errors, but certainly not a systematic increase per replica. BTW, for the sequential mode the chi2 is identical to the master branch

edit: a quick check with the basic parallel runcard shows no significant differences across replicas, so the problem appears to be only for the NNPDF4.0 runcard...

@Radonirinaunimi
Copy link
Member

As already mentioned, I don't expect the numbers at a given epoch to be fully consistent/close. In the parallel replicas case, even a very small difference at the start of the fit could lead to fluctuations during the training. It is not that all the changes here kept the numerical values to be exactly the same. So, the proper way to compare the results are by running proper fits. At the very least for now we should compare: master sequential vs trvl-mask sequential and master sequential vs trvl-mask parallel.

Does this make sense?

@RoyStegeman
Copy link
Member

RoyStegeman commented Oct 24, 2023

Well I agree with @goord that the pattern is curious. If all the seeds are the same at the start of each replica then the result should be the same for the sequential and parallel cases. It may be that there is some numerical instability, or perhaps the seeds are not the same because e.g. the optimizer/nn initialization/trvl split/whatever seed doesn't get reset for the parallel case(?) but then I still don't understand why the not only the first but also the second and third replicas have quite good agreement while this agreement has clearly deteriorated by replicas 9 and 10. If the seeds were not the same I'd expect the deterioration at replica 2 to be equivalent to all later differences...

Also the edit that it's a problem for the NNPDF4.0 runcard but not for the basic runcard is odd. Is debug always False when doing the comparison for the basic runcard?

@scarlehoff
Copy link
Member

scarlehoff commented Oct 24, 2023

In this case this could point however to a bug on how the fit is being stopped. The first replica is stopped correctly, the second one is a bit farther away from the optimal point and so on and so forth.
For instance, more than the numerical difference, I'd be worried that for replica 8 chi_tr < chi_vl in the sequential case but chi_tr > chi_vl in the parallel case.

The random seeds is the other thing that comes to mind, as @RoyStegeman said, it would make sense that it is the same for the 1st replica but then different for the rest. But I also fail to see why should it be more different for 5 than for 2 (and the first few replicas are very close...)

The basic_runcard has same_trvl_per_replica: True, did you remove that option? Because if you didn't that might be pointing to the source of the bug.

@goord
Copy link
Collaborator Author

goord commented Oct 24, 2023

Well my first suspicion was that somewhere along the losses or observables, tensors get somehow accumulated along the replica dimension, but then again the basic runcard seems to reproduce the sequential fit ..

@RoyStegeman
Copy link
Member

But are all the important settings the same for the basic runcard and the NNPDF4.0 runcard? I.e. the changes between the basic runcard you use are limited to choice of dataset, seeds (same seeds just different value), and preprocessing, while keys such as debug and same_trvl_per_replica are the same in the two cases? If they are all the same, do you again find the divergence of results if all you do is add more datasets to the basic runcard?

I'd like to understand what the important difference is between the runcards that causes the different behaviour.

@goord
Copy link
Collaborator Author

goord commented Oct 24, 2023

Both have dis and Dy datasets, positivity constraints and same_trvl_per_replica set to false to test the new functionality. The nnpdf4.0 has integrability layers though, which the basic runcard doesn't have

@goord
Copy link
Collaborator Author

goord commented Oct 30, 2023

Regarding the single-epoch reproducibility test:

  • On our laptops, both sequential and parallel modes seem to produce the same result (modulo small numerical noise)
  • On the Snellius cluster, also the master branch (with same tr/vl split for all replicas) gives a different result for the single-epoch sequential and parallel fit with 10 replicas.

(edit) Using TensorFlow 2.10 instead of 2.11 solves the reproducibility issue on the cluster too

@Radonirinaunimi
Copy link
Member

(edit) Using TensorFlow 2.10 instead of 2.11 solves the reproducibility issue on the cluster too

Great! Did you understand why so?

PS: Could you please run black on the modified files?

@scarlehoff scarlehoff mentioned this pull request Nov 17, 2023
31 tasks
@goord
Copy link
Collaborator Author

goord commented Nov 28, 2023

Over the weekend I did an extensive set of regression tests between master (rev. db8b790) and trvl-mask-layers branch (latest rev.). To my untrained eye, results look more or less identical:

runcard comparison (100 replicas) link
NNPDF4.0 master vs. trvl-mask-layers sequential CPU https://vp.nnpdf.science/I6rdrAFsQ6Gah3fJ1Vf2rQ==
NNPDF4.0 master vs. trvl-mask-layers parallel CPU https://vp.nnpdf.science/IX_nDzOGSfaclCah74K-rQ==
NNPDF4.0 master vs. trvl-mask-layers parallel GPU https://vp.nnpdf.science/3JOhq50GRqizFGwZPdRfUQ==
feature-scaling master vs. trvl-mask-layers sequential CPU TBD
feature-scaling master vs. trvl-mask-layers parallel CPU https://vp.nnpdf.science/vRpWyCzARWKKs8U3PR733Q==
feature-scaling master vs. trvl-mask-layers parallel GPU https://vp.nnpdf.science/atG4x7_xQNSP7bl0xJSSGg==
flavour-basis master vs. trvl-mask-layers sequential CPU TBD
flavour-basis master vs. trvl-mask-layers parallel CPU https://vp.nnpdf.science/RD_jmIzrQtKZ3qfoLikrlw==
flavour-basis master vs. trvl-mask-layers parallel GPU https://vp.nnpdf.science/cc60ST-4QFGzIk9SKGV0vw==

@Radonirinaunimi Radonirinaunimi linked an issue Nov 28, 2023 that may be closed by this pull request
@Radonirinaunimi
Copy link
Member

Thanks a lot @goord for these comparisons! These comparisons look great; as you said, the results are statistically identical. I also see that you've found a way around the TF version linked to the reproducibility issue (?). I guess that after some clean ups (and black), this is finally ready for reviews?

@goord
Copy link
Collaborator Author

goord commented Nov 29, 2023

TODO: the observation data rotation is currently not working correctly, because it is applying a masked rotation matrix after the masking layer. This should change to applying the unmasked rotation to the observation data before the masking.

@scarlehoff
Copy link
Member

When you try to run DIS_diagonal_l2reg_example.yml it might be that not all datasets declared there exist anymore. You might need to remove some of them to test the runcard (as mentioned earlier today... it would be good to have all examples tested...)

@Cmurilochem Cmurilochem mentioned this pull request Dec 11, 2023
3 tasks
@goord
Copy link
Collaborator Author

goord commented Dec 21, 2023

Refactored the LossInvCovmat to handle the diagonal case in an optimized way. I get no significant numerical changes after 4000 epochs.

I was running the black tool, but I get a whole bunch of files I didn't touch that need reformatting... Should I ignore those?

@APJansen
Copy link
Collaborator

Is this ready for the redo-regressions?

@scarlehoff
Copy link
Member

No, we asked for a lot of stuff to be changed. I'd need to redo the review.

n3fit/src/n3fit/layers/losses.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/layers/losses.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/model_gen.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/model_trainer.py Show resolved Hide resolved
n3fit/src/n3fit/model_trainer.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/model_trainer.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/model_trainer.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/performfit.py Outdated Show resolved Hide resolved
validphys2/src/validphys/n3fit_data.py Outdated Show resolved Hide resolved
validphys2/src/validphys/n3fit_data.py Outdated Show resolved Hide resolved
APJansen and others added 5 commits February 21, 2024 07:46
Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>
Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>
Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>
@APJansen
Copy link
Collaborator

I've fixed what I could, but I also find anything related to the preprocessing of data very confusing.

@APJansen
Copy link
Collaborator

I noticed I introduced a bug in 49904a3, got the number of replicas from the batch axis, just pushed a fix.

@APJansen
Copy link
Collaborator

Is this ready now for the redoing of the regressions (once tests pass)? Would be great if we can get this merged, and perhaps even #1936, so that we can turn on some faster hyperopt runs over the weekend to get some more data.

Copy link
Member

@scarlehoff scarlehoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a final comment from me :P

n3fit/src/n3fit/model_trainer.py Outdated Show resolved Hide resolved
@APJansen APJansen added the redo-regressions Recompute the regression data label Feb 22, 2024
@APJansen
Copy link
Collaborator

@scarlehoff About the flattening, indeed I guess it was the covmat? You wrote that you would change your suggestion to a comment, but then resolved it, should I add a comment there or not?

Co-authored-by: Juan M. Cruz-Martinez <juacrumar@lairen.eu>
@scarlehoff scarlehoff merged commit faadba0 into master Feb 22, 2024
8 checks passed
@scarlehoff scarlehoff deleted the trvl-mask-layers branch February 22, 2024 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
escience redo-regressions Recompute the regression data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

trvl-mask-layers parallel NNPDF4.0 fit broken
6 participants