Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read shapes in from pandas dataframe #548

Merged
merged 12 commits into from
Apr 7, 2022
Merged

Read shapes in from pandas dataframe #548

merged 12 commits into from
Apr 7, 2022

Conversation

shane-breeze
Copy link

I kept converting pandas dataframes into TH1s for my binned shape fits so instead I've included this conversion in combine and others might find it useful (shapes can be saved in human readable csv/json files or even excel spreadsheets e.g.).

Changed ShapeTools.py to interpret files with the extensions [".csv", ".json", ".html", ".pkl", ".xlsx", ".h5", ".parquet"] as a pandas dataframe (see here for IO). Any other extensions are dealt with as before, i.e. as ROOT files. Note that multiindexed dataframes are used and some file extensions need to be converted to multiindex where all but the last 2 columns are used for indexing.

DataFrameWrapper.py adds a class to wrap pandas dataframe so there's a Get method which acts in a similar way to ROOT::TFile::Get in return TH1s.

Example csv file included and the following commands gave the same results (apart from file names/CP time):

combine -M MultiDimFit --algo singles data/tutorials/shapes/simple-shapes-TH1.txt -v 4
combine -M MultiDimFit --algo singles data/tutorials/shapes/simple-shapes-df.txt -v 4

Excel spreadsheets depends on openpyxl and xlrd, parquet depends on pyarrow or fastparquet, and hdf depends on pytables (can be installed through pip).

This can be extended to unbinned fits by having a similar multiindex for channel/process categorisation and a column(s) with the observable(s).

@nucleosynthesis
Copy link
Contributor

Thanks @shane-breeze . Will this also support the use of the autoMCStats directive in the datacards?

@shane-breeze
Copy link
Author

Yes. There's a variance column in the dataframe that fills the error of the histograms. Just tested the simple shapes card with autoMCStats and got the same results with TH1s and the dataframe.

@shane-breeze
Copy link
Author

Added a cast for the index selection in the datacard to the index dtypes. This is taken as a string inside the datacard but the dataframe can have int, float, ...

@nsmith-
Copy link
Collaborator

nsmith- commented Mar 28, 2022

Just wondering about the label, what work needs to be done on this?

@amarini
Copy link
Collaborator

amarini commented Apr 5, 2022

@nsmith- , the action items on this PR were the following in the last discussion:

  1. resolve the conflicts
  2. add a warning when pandas shapes are loaded that non all the features may be available

@amarini
Copy link
Collaborator

amarini commented Apr 5, 2022

@shane-breeze , can you allow editing from maintainers?

@shane-breeze
Copy link
Author

@amarini, I have unarchived my fork of this repository. It should now be writable.

@nsmith- nsmith- mentioned this pull request Apr 5, 2022
Conflicts:
	python/ShapeTools.py
@nsmith-
Copy link
Collaborator

nsmith- commented Apr 5, 2022

non all the features may be available

autoMCStats is available (I checked it still works), so I think there is no missing feature. Unbinned data is anyway detected by the output of getShape so it shouldn't cause any datacard-level requirement to fail iiuc.

@hcombbot
Copy link

hcombbot commented Apr 5, 2022

Pull Request Test.
Summary
========
Running options:
* MODE : cmssw
* COMBINE_TAG : 102x
* COMBINE_REPO : cms-analysis
* COMBINE_MERGE : shane-breeze/shapes-df
* GITHUB_PR : 548


Ratio to reference values:
--------
| comb_2019_hbb_boosted_standalone | comb_2019_hgg | comb_2019_hmm | comb_2019_htt | comb_2019_hww | comb_2019_tth_hbb | comb_2019_tth_hgg | comb_2019_tth_multilepton | comb_2019_vh_htt | comb_2019_vhbb | comb_2019_vhbb2017 |
| ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
| 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |

You can find more detail at https://gitlab.cern.ch/cms-hcg/performances/ci/-/pipelines/3810969

@nsmith- nsmith- removed the needs work label Apr 6, 2022
@nsmith- nsmith- merged commit e8a7a2a into cms-analysis:102x Apr 7, 2022
nsmith- added a commit that referenced this pull request Apr 8, 2022
Merge pull request #548 from shane-breeze/shapes-df

Read shapes in from pandas dataframe
nsmith- added a commit that referenced this pull request Apr 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants