Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hyperopt loss #1726

Merged
merged 20 commits into from
Mar 7, 2024
Merged

Hyperopt loss #1726

merged 20 commits into from
Mar 7, 2024

Conversation

APJansen
Copy link
Collaborator

@APJansen APJansen commented May 4, 2023

Improving hyperoptimization, experimenting with different hyperoptimization loss functions

Tasks done in this PR

  • Added a new HyperLossclass with buit-in methods that can automatically perform statistics over replicas and then folds. The user can select statistics via replica_statistic and fold_statistic in runcard kfold.
  • Added new a statistics over replicas, $\varphi^2_{k}$, that (in addition to $\chi^{2}$) can be selected via loss_type option in runcard kfold.
  • Addressed Printing Multi-statistics to hyperopt json files #1894. Here, regardless the selected metrics to minimise by Hyperopt ($\varphi^2$ or $\chi^{2}$), we also print in tries.json (specifically within kfold_meta entry) a matrix (folds x replicas) of calculated $\chi^{2}$ values named hyper_losses_chi2 and a vector (folds) of $\varphi^2$ values named hyper_losses_phi2.

Description

The implemented HyperLoss is instantiated within ModelTrainer and later on used in ModelTrainer.hyperparametrizable. The user must pass three paramaters that are set in the runcard:

  • loss_type: The type of loss to be used. Options are chi2 or phi2.
  • replica_statistic: The statistics over replicas to be used within each fold. For loss_type = chi2, it can assume the usual statistics: average, best_worst and std. Note: replica_statistic is inactive if loss_type = phi2 as $\varphi^2_{k}$ is by definition a statistics over replicas.
  • fold_statistic: The statistics over folds. Options are: average, best_worst and std.
  • As discussed by @juanrojochacon, the calculation of any $\varphi^2$ statistics over folds is done with the reciprocal of the chosen function. For example, for loss_type: phi2 and fold_statistic: average, the figure of merit to be minimised is actually:
    $$\Huge \left( \frac{1}{n_\text{fold}} \sum_{k=1}^{n_\text{fold}} \varphi^2_{k} \right)^{-1} $$

The current implementation of $\varphi^2_{k}$ is based on validphys functions. It is evaluated using only experimental data within the hold out fold (as expected).

Runcard examples

  • Default run: hyper_loss is set as the $\chi^2$ averaged over replicas and then over folds.
kfold:
      loss_type: chi2
      replica_statistic: average
      fold_statistic: average
      penalties:
        - saturation
        - patience
        - integrability
...
  • Setting $\varphi^2_{k}$ within each $k$-fold and then set hyper_loss as the inverse of the max value of $[\varphi^2_{1}, \varphi^2_{2}, ..., \varphi^2_{k}]$:
kfold:
      loss_type: phi2
      fold_statistic: best_worst
      penalties:
        - saturation
        - patience
        - integrability
...

Notes

It must be merged after #1788 as the current hyperopt_loss branch has been created from trvl-mask-layers.

@RoyStegeman RoyStegeman mentioned this pull request Aug 30, 2023
3 tasks
@APJansen
Copy link
Collaborator Author

APJansen commented Sep 6, 2023

Description of what happens with hyperoptimization

For the benefit of my future self and my colleagues @goord and @Cmurilochem I describe here exactly what happens during hyperoptimization. It mainly describes how it works in master, so far this branch only changes it on two points, as indicated.

Single Trial

A "trial" is the entire evaluation of one set of hyperparameters. One trial in the code is one execution of the function ModelTrainer.hyperparametrizable.

Folds

Trials are scored by their hyper_loss, a validation loss computed using k-folding.
This means that the datasets are partitioned into k (=4) parts. This is done manually to ensure that all folds cover the same kinematic range. It is configured in the runcard.

For each fold, the datasets in that fold are left out completely, and the model is trained as normal on the remaining k-1 folds.
Construction of the models for each fold happens here. The training and validation model both use the datasets in the k-1 remaining folds, and have a different randomized per-dataset mask. The experimental model uses the held out fold, without any mask. It is this model that determines the hyper_loss.

The model is trained on the k-1 folds (masked with the training mask).

Early exit 1
It is said to have passed if any replica has a best epoch. This will only be false if for all replicas and for all epochs, the
validation chi2 is worse than at the previous epoch. (technically, if passes here is always false). If this fails, the trial is exited early and given an infinite loss.

Then the hyper_losses, per replica are computed, as the loss on the held out fold, and penalties added (this branch fixes this to be per replica, it was being overcounted).

Early exit 2
If the mean per replica is above a threshold set in the runcard (the threshold key under kfold), the trial is again exited early and given an infinite loss.

Loss

We get l_hyper, an array of losses of shape (folds, replicas).
This is passed into a HyperLoss object (implemented in this branch), which computes the specified statistics first over replicas, then over folds.
This is then put into the loss key of the returned dictionary.

Managing trials

Results are saved in runcard/nnfit/replica_1/tries.json, for all replicas (other replica folders are empty). It is a list of dictionaries, one for each trial. The entire dictionary returned by hyperparametrizable is under the "result" key.

The trials start from the function hyper_scan_wrapper, called in performfit. The best parameters are printed, and it seems retraining a model with those parameters is to be done manually.

Inside hyper_scan_wrapper, the hyperopt library's fmin function is called.
The hyperoptimization algorithm used is a Tree structured Parzen Estimator (TPE), section 4 of this paper.
It can handle hierarchical hyperparameters, i.e. the number of units needs to be chosen for each layer, and the number of layers can vary too. I didn't read in any detail, but it does use previous results to suggest new trials.

possible issue

on a test with 2 trials varying only the nodes per layer, it tested (27,12,8) and then (30,36,8), but the best parameters printed were the default (25,20,8). I'm not sure what exactly happens in this final call to space_eval.

Some thoughts on improvements

Parallelizing trials, checkpointing

Trials can be run in parallel using MongoDB, as described here.
It seems this could also be used to enable continuing from a previous run as discussed in issue #1800.

K-folding

When using K-folding, we now have 3 classes of data at every fold:

  1. training data: in the k-1 folds that ARE used, seen during training
  2. validation data: in the k-1 folds that ARE used, not seen during training
  3. "hyper-validation" (not sure what to call it) data: in the held out fold

Both 2 and 3 are not seen during training. The difference in how they are used is that 2. is used for the "early exit 1" above, while 3 is used to compute the hyper loss. If that was all, I'd say 2 is wasted as this condition is very unlikely to trigger even with one replica.
But actually 2 is also used for early stopping. So we do want to keep all 3, as combining 2 and 3 would introduce a bias.

We also discussed a long time ago that perhaps K-folding won't be necessary anymore. The point of it is to have a more accurate validation loss, but if we run with say 100 replicas, it is already more accurate. We would still need 3 groups of data, to avoid the bias of joining 2 and 3, however they could be split randomly per dataset, (per replica). The advantage would be that it'd require only a single run per trial (slightly bigger, but still nearly 4x speedup), and simplify the code. Not sure though what the effect on the learning would be.

Code improvements

  • It would be nice to move as much as possible of ModelTrainer.hyperparametrizable into one of the hyperopt modules
  • models are recreated from scratch every fold, PDFs only need to be re-initialized. In fact the PDFs could be generated only once per entire hyperopt run (and copied and reinitialized), and the observables only once per fold, and copied and re-initialized for each trial.
  • It seems that the best hyperparameters are just printed, and not saved anywhere?
  • Maybe we want to add (optionally) rerunning a training run without K-folding using the best parameters.

@Cmurilochem
Copy link
Collaborator

Cmurilochem commented Nov 22, 2023

Implementation of a new metrics

As a continuation of the work by @APJansen, I intent to add a new hyperoptimization metrics to the HyperLoss class , $\varphi^{2}$ [Eq.(4.6) of the NNPDF3.0 paper and #1849 ]. $\varphi^{2}$ is defined for each $k$-fold as
$$\Huge \varphi_{\chi^2_k}^2 = \langle \chi^2_k [ \mathcal{T}[f_{\rm fit}], \mathcal{D} ] \rangle_{\rm rep} - \chi^2_k [ \langle \mathcal{T}[f_{\rm fit}] \rangle_{\rm rep}, \mathcal{D} ] $$

where the first term represents our usual averaged-over-replicas hyper loss, $\chi^2_k$, that is calculated based on the dataset used in the fit ($\mathcal{D}$) and the theory predictions from each fitted PDF ($f_{\rm fit}$) replica. The second term of the above equation would involve the calculation of the hyper loss but now using the theory predictions from the averaged-over-replicas PDF.

Refactoring HyperLoss class

After a fruitful discussion with @goord, we agreed that, before implementing $\varphi^{2}$, it would be wise to refactor a bit the HyperLoss class. The idea here is to add to HyperLoss the responsibility of properly calculating hyper_losses by passing to it all information needed to do so, e.g.,:

  • models["experimental"]
  • pdf_models
  • self.experimental["output"]
  • chosen statistics (average, best_worst, std or phi) over replicas and folds
  • penalties (saturation, patience, and/or integrability) to be added

@APJansen
Copy link
Collaborator Author

We talked again about refactoring this, and we decided to keep the changes minimal for now, just to allow implementation of the phi2 loss. We can refactor the code about hyperoptimization inside model_trainer later, once we have things running.

We made some assumptions listed below, are these correct @scarlehoff?
And it's still missing the computation of the phi2 itself, that's still quite tricky.
For now we assumed that there is a function compute_phi2 in vpinterface.py that takes n3pdfs and a dictionary of data and covmat, but we didn't implement anything. We extract these from self.experimental["output"], but still need to convert these to the arguments that result expects, so DataSetSpec etc.
Is that right? You mentioned in #1849 that you would come back to it, I think some help here would be useful if you know more.

Assumptions

Previously HyperLoss was implemented with the assumption that we can first compute all losses per replica and per fold, and at the end compute first a statistic over the replicas, and then over the folds.
This is not true for phi2, as the chi2 of the central pdf cannot be obtained by any statistic over the chi2's of individual pdfs.

So now we assume instead that first the losses are computed per fold, so already having some kind of aggregation of the replicas. This is done in the training loop, as it is quit early if it is too high.
Then we assume that after a trial, we just take some statistic over the fold, as before. We assume this is done for chi2 as well as phi2.

And then there are also penalties, which currently are all computed per replica. We assumed for now that this is general enough, and that we always want to take an average over the per-replica penalties, and add this to the loss.

@scarlehoff
Copy link
Member

You mentioned in #1849 that you would come back to it, I think some help here would be useful if you know more.

Ah, since there were no questions I thought that the example I put there was enough. Not sure what to add (i.e., not sure what the showstoppers are?)

For now we assumed that there is a function compute_phi2 in vpinterface.py that takes n3pdfs and a dictionary of data and covmat, but we didn't implement anything. We extract these from self.experimental["output"], but still need to convert these to the arguments that result expects, so DataSetSpec etc.

The thing is that, while n3fit converts everything into dictionaries, everything is originally coming from DataSetSpecs, so you can pass those also to the input. I can work of that later on, but first I'd rather merge everything else (I'm going to test today #1818)

Regarding the assumptions:

Previously HyperLoss was implemented with the assumption that we can first compute all losses per replica and per fold, and at the end compute first a statistic over the replicas, and then over the folds.
This is not true for phi2, as the chi2 of the central pdf cannot be obtained by any statistic over the chi2's of individual pdfs.

No. But at the end of the fold (when you have an ensemble of model) you can easily create another model which is the average of all previous models (ideally discarding outliers) over the axis of the replicas. With that you can compute the phi.

And then there are also penalties, which currently are all computed per replica. We assumed for now that this is general enough, and that we always want to take an average over the per-replica penalties, and add this to the loss.

Yes. Let's not care about the penalties for the time being. I don't think they are needed.

@Cmurilochem
Copy link
Collaborator

Cmurilochem commented Nov 30, 2023

Start implementation of $\varphi^{2}$

Continuing on #1849, I am proceeding with a provisory implementation of compute_phi2 in order to calculate $\varphi^{2}_k$ within each $k$ fold.

Implementation ideas (also in #1849)

  • With the help of @scarlehoff I realised that validphys pass an experiments_data variable to performfit. It contains a list of DataGroupSpec objects for each group exp data that could be directly used by validphys.results functions to obtain $\varphi$.
  • I am now passing experiments_data as argument to instantiate ModelTrainer inside performfit.
  • With self.experiments_data and self.exp_info attributes of ModelTrainer, I create inside Model.parametrizable a list of tuples defining validphys.core.DataGroupSpec representations of each group experimental data set and the associated covariant matrices.
  • This list and the N3PDF object are them used by HyperLoss.compute_loss (designed by @APJansen) and passed as argument to compute_phi2 in vpinterface.py. This allows us to calculate the sum of $\varphi^2$ by resorting to the built-in validphys.results functions: results, abs_chi2_data and phi_data.

Questions

  • the current calculation of $\varphi^2_{k}$ for each $k$ fold uses all input experimental data sets. Would the correct calculation include only the experimental data of the held out fold?
  • I am now using the multi-replica N3PDF objects to obtain $\varphi^2_{k}$ as expected by results. I noticed that there was some discussion yesterday about _set_central_value. This has also been discussed by @scarlehoff above. It is still not very clear to me what would be the difference between the calculated results using validphys central values and those obtained by possibly setting the central pdf as PDF_0 as in _set_central_value.
  • Need to discuss what to do with penalties.
  • We added new options in the runcard's kfold: section that selects the type of loss ($\varphi^2_{k}$ or $\chi^2_{k}$) and the type of replica/fold statistics, e.g.,
kfold:
      loss: phi2
      replica_statistic: average
      fold_statistic: average
      target: average
      .....

I see here some problems. First, phi2 is a replica statistics for its own and it not compatible with replica_statistic: average. This could be dealt with internally by adding some constraints and warnings. Second, I am not entirely sure if fold_statistic and target represent the same thing; I suspect so, but would appreciate any discussion on that.

@scarlehoff, @RoyStegeman, @APJansen and @goord, any thoughts on this ?

How $\varphi$ is calculated by valiphys

1. results function

When passing validphys.core.DataGroupSpec, N3PDF, experiment covmat and sqrt_covmat

res = results(groupdataset, n3pdfs, covmat, sqrt_covmat(covmat))

the results function will output a truple composed by:

  • an instance of DataResult which holds the relevant information for a given group exp data set
  • an instance of ThPredictionsResult which holds the corresponding theory predictions from the multi-replica PDF. The attribute ThPredictionsResult.rawdata stores such theory predictions for each replica.

2. abs_chi2_data function

This function will use the the output of results (the above tuple) to calculate:

  • all_chi2: this corresponds to the usual chi2 calculated by the sum of the squared differences between the predicted ThPredictionsResult.rawdata and central experimental observables for each replica and group exp dataset. all_chi2 is then used to instantiate ThPredictionsResult.stats_class which in our case is always N3Stats ; see also here.
  • central_chi2: Differently from all_chi2, this corresponds to the chi2 calculated by the sum of squared differences between ThPredictionsResult.central_value, i.e., average of the predictions of each replica PDF, and the central experimental observables for each group exp dataset.

These calculated quantities will be returned, together with the number of exp data points, in a form of a namedtuple named Chi2Data.

3. phi_data function

Finally, phi_data just uses the output data of abs_chi2_data. It calculates $\varphi$ by:

  1. averaging out all_chi2 over replicas using N3Stats.error_members().mean()
  2. subtract <all_chi2> from central_chi2, dividing the result by the number of exp data points within the group exp dataset.
  3. Take the sqrt of [(<all_chi2> - central_chi2)/number of exp data points]

@Cmurilochem
Copy link
Collaborator

Cmurilochem commented Dec 5, 2023

Question regarding added tests

@scarlehoff, I added new tests into test_hyperopt.py.

To test the calculation of $\varphi$, I am generating mocked pdf_models by providing random seeds; see here. However, even when providing the same seeds to pdfNN_layer_generator, I am not able to obtain the same test result when I run it on my local computer and in the CI/CD. Do you think that this would be due to the different TensorFlow versions ? If not, how could I warrant that the weights/pre-processing exponents of my pdf_models are exactly the same ?

@scarlehoff
Copy link
Member

even when providing the same seeds ...

Make sure the seeds are also fixed for numpy. Note that when the fit is set to debug we do all this

os.environ.setdefault("HYPEROPT_FMIN_SEED", str(seed))
to ensure reproducibility. You might need some of these steps in your test as well.

however while the tests should be robust in linux (they empirically seem to be) I'm not so sure about mac m1 (i.e., if you are locally running on a mac and then trying to test in the ci... not sure what will happen)

@Cmurilochem
Copy link
Collaborator

even when providing the same seeds ...

Make sure the seeds are also fixed for numpy. Note that when the fit is set to debug we do all this

os.environ.setdefault("HYPEROPT_FMIN_SEED", str(seed))

to ensure reproducibility. You might need some of these steps in your test as well.
however while the tests should be robust in linux (they empirically seem to be) I'm not so sure about mac m1 (i.e., if you are locally running on a mac and then trying to test in the ci... not sure what will happen)

Thanks @scarlehoff. I have added a custom set_initial_state to be used together with generate_pdf in test_hyperopt.py. After some unsuccessful trials, I decided to add only property-based tests for $\varphi$ for the moment.
I will come back later to it. For now, I will try first to address the problem of calculating $\varphi^{2}$ only with experimental data of the held out set.

n3fit/src/n3fit/hyper_optimization/rewards.py Show resolved Hide resolved
n3fit/src/n3fit/hyper_optimization/rewards.py Show resolved Hide resolved
n3fit/src/n3fit/hyper_optimization/rewards.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/vpinterface.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/vpinterface.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/hyper_optimization/hyper_scan.py Outdated Show resolved Hide resolved
Copy link
Member

@RoyStegeman RoyStegeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add documentation for the different losses/statistics that can be optimised and how the runcard can be changed to use them?

n3fit/src/n3fit/hyper_optimization/hyper_scan.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/hyper_optimization/hyper_scan.py Outdated Show resolved Hide resolved
n3fit/src/n3fit/model_trainer.py Show resolved Hide resolved
n3fit/src/n3fit/model_trainer.py Outdated Show resolved Hide resolved
@Cmurilochem Cmurilochem force-pushed the hyperopt_loss branch 2 times, most recently from 09f929d to f88fbb5 Compare March 6, 2024 20:51
@Cmurilochem
Copy link
Collaborator

Can you add documentation for the different losses/statistics that can be optimised and how the runcard can be changed to use them?

Done. Just added docs in f88fbb5

Copy link
Member

@RoyStegeman RoyStegeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, the docs look good

doc/sphinx/source/n3fit/hyperopt.rst Outdated Show resolved Hide resolved
doc/sphinx/source/n3fit/hyperopt.rst Show resolved Hide resolved
doc/sphinx/source/n3fit/hyperopt.rst Outdated Show resolved Hide resolved
doc/sphinx/source/n3fit/hyperopt.rst Outdated Show resolved Hide resolved
Cmurilochem and others added 2 commits March 7, 2024 14:29
Co-authored-by: Roy Stegeman <roystegeman@live.nl>
Co-authored-by: Roy Stegeman <roystegeman@live.nl>
@RoyStegeman RoyStegeman merged commit 4b00608 into master Mar 7, 2024
7 checks passed
@RoyStegeman RoyStegeman deleted the hyperopt_loss branch March 7, 2024 21:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants