Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge processes at runtime #267

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

pkausw
Copy link

@pkausw pkausw commented Jan 18, 2022

Suppose you have a datacard with a process class with some sub-processes. For example, in our analysis we have a data-driven control region (CR) that is derived as the difference between data_CR and some other processes proc1_CR and proc2_CR. Since the complete process in the CR is estimated during the fit, the processes need to be considered individually in the datacard for a correct parameterization. However, when interpreting the results and plotting everything, we're only interested in the total process in the CR and would like to correctly account for the post-fit correlations between the individual *_CR processes in order to derive sensible uncertainty bands.

This PR introduces an option to PostFitShapesFromWorkspace to merge processes at runtime. The processes can be defined with the syntax NAME='expr', where expr is the regular expression that is parsed to the process filter of the harvester instance.

Additionally, the PR introduces an option for additional printout during the calculation of the post-fit uncertainty bands. If activated, the option will print out the difference between the bin content of the nominal post-fit template and each toy used for the construction of the uncertainty band, for each bin respectively. This functionality can be used to understand abnormally large fluctuations in the uncertainty band and is intended for debugging purposes.

@pkausw
Copy link
Author

pkausw commented Mar 3, 2023

Hi, are there any updates regarding this PR? I think it might be related to issue #291

@anigamova anigamova linked an issue Mar 3, 2023 that may be closed by this pull request
4 tasks
@kcormi
Copy link
Collaborator

kcormi commented Mar 6, 2023

Thanks for this!

I had a few inline comments to the code, another couple things here:

Firstly, perhaps it is possible to simplify the code by packaging the loop over processes (or merged processes) into a function which is reused. Since now we have two loops over prefit (processes, then merged processes) and two loops over postfit (same as before). They are handling slightly different objects and with slightly different treatment for pre- and postfit-, so maybe its not so simple; but much of the code is replicated and it might be nice to compactify this.

Secondly, I think we would also want to add the merged processes into the loop over processes when the postfit/prefit yields are printed, right? https://github.com/cms-analysis/CombineHarvester/pull/267/files#diff-850d8a5178aa5366c05bacce2ec563fa72bfa4ce6ef5509f33b134d0dae89e5eL442

Comment on lines 178 to 184
TH1F rand_shape = this->GetShapeInternal(lookup);
for (int j = 1; j <= rand_shape.GetNbinsX(); ++j) {
tmp.Form("\tBin Content %i: %f\n", j, rand_shape.GetBinContent(j));
// std::cout << tmp.Data();
full_rand_shape_summary.at(j-1) = rand_shape_summary;
full_rand_shape_summary.at(j-1) += tmp.Data();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like quite a lot of information (checking the updated shape for bin for every updated parameter value). I wonder if it is maybe a bit too much, and could be avoided? But perhaps this was useful in some debugging which I'm simply not aware of. It just seems it would be a lot to even read through for large models, and perhaps targeting more specific information might be better?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes definitely! I originally implemented this when we generated post-fit uncertainty bands for our ttHbb analysis and so some very large fluctuations in the bands. The intend there was really to find the reason for these fluctuations, which is why I wanted to output as much information as possible.

I think the comment by Andrew is valid, and it might be better to simply change to toys themselves into a dedicated root file. I can also update this PR accordingly if we want to have something like that from the start, or we can do it at some later point in time

std::fabs(rand_shape.GetBinContent(i) - shape.GetBinContent(i));
shape.SetBinError(i, err*err + shape.GetBinError(i));
std::fabs(rand_shape.GetBinContent(j) - shape.GetBinContent(j));
if (std::fabs(err/shape.GetBinContent(j)) >= 0.5 and verbose){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be over optimizing here, but perhaps some of these extra steps could be avoided by default when not using the verbose mode? Perhaps just an if( verbose) before this loop, and in other places, for example?

I'm also wondering if it would be useful to make this 0.5 threshold a configurable parameter?

post_shapes[bin][proc] = cmb_proc.GetShape();
} else {
post_shapes[bin][proc] =
sampling ? cmb_proc.GetShapeWithUncertainty(res, samples, verbose)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deafult behaviour of PostFitShapesFromWorkspace should be to use sampling, and now use of the actual --sampling flag prints out a warning. So the code should not check the --sampling flag, but rather the --no-sampling flag, and always do sampling unless --no-sampling has been set.

@ajgilbert
Copy link
Collaborator

Hi, I remembered I made some related changes a while ago, and never merged them. See #295 which I just opened. That adds the option to merge channels (not processes), so is in principle complementary.

I tend to agree with @kcormi that the changes to GetShapeWithUncertainty take some care - especially not to run any excess code if verbose is not true. I see the motivation to have access to the variations though for debugging. To make it more flexible, we could see if instead of building a big string internally with some arbitrary thresholds involved, we could instead add a way to return the full set of varied TH1s, saved to the output, which the user would be free to inspect with their own code.

@pkausw
Copy link
Author

pkausw commented Mar 6, 2023

Hi, thanks for the review and the comments! Indeed, I did most/some of these changes in the meanwhile in my local setup - I'm in the process of getting my local repository in shape to update this PR asap.

Regarding the additional debugging information: Yes, totally agree that this should be completely optional (which it is in my local version of the Harvester at this point). @ajgilbert , your comment about saving the toys themselves instead of a very long string is excellent I think - I'll have a look at this as soon as this PR is updated.

@pkausw
Copy link
Author

pkausw commented Mar 7, 2023

Hi! I am finally done with the update for this PR. In particular, I

  • merged with the main version of the CombineHarvester
  • restructured the c++ code
  • made the generation of the yields + uncertainties optional

Unfortunately, the PR is now a little messy since I changed quite a few things in one go (sorry). Let me know if you'd like me to change something more, sorry for the mess!

@pkausw pkausw changed the base branch from master to main March 7, 2023 18:52
…threshold parameter to report in verbose mode (not excessible right now)
@ajgilbert
Copy link
Collaborator

Thanks, this now has a lot of changes :-)
In general I think the restructuring is a good idea, I'll try to review it (may be delayed because of Moriond season)

@pkausw
Copy link
Author

pkausw commented Jun 1, 2023

Hi @anigamova , as discussed earlier, I've updated this PR. I have cherry-picked the updates from #295 and included the latest changes from main. Therefore, this PR should now enable to merge analysis bins and/or processes at runtime :) Let me know if you have any comments!

@anigamova
Copy link
Collaborator

Hi @pkausw,

It would be great to do what @ajgilbert suggested in PR #295, i.e. try to call GetShapeWithUncertainty method once, so that we don't generate 500 toys for each histogram but rather save the list of all of the shapes we want to produce and pass it to the GetShapeWithUncertainty function at the end and loop over the shapes for each toy once. I will also leave few comments where I think it would make sense to implement these changes with your PR.

cmb.GetShape();
} else {
shape_container[proc] =
no_sampling ? cmb.GetShapeWithUncertainty()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here to avoid sampling for each histogram you could call another method to update the list/map of all of shapes.

// std::cout << tmp.Data();
rand_shape_summary = tmp.Data();
// std::cout << rand_shape_summary.c_str() << std::endl;
TH1F rand_shape = this->GetShapeInternal(lookup);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the place where the loop through the list/map of shapes would go then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PostFitShapesFromWorkspace development
4 participants