Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ability to define splits in metadata section of dataset card #5209

Closed
merveenoyan opened this issue Nov 7, 2022 · 9 comments
Closed
Labels
enhancement New feature or request

Comments

@merveenoyan
Copy link
Contributor

merveenoyan commented Nov 7, 2022

Feature request

If you go here: https://huggingface.co/datasets/inria-soda/tabular-benchmark/tree/main you will see bunch of folders that has various CSV files. I’d like dataset viewer to show these files instead of only one dataset like it currently does. (and also people to be able to load them as splits instead of loading through data_files)
e.g GLUE has various splits on viewer but it’s too overkill to ask people to implement loading script, so it would be better to let them define these in the README file instead.

Also pinging @polinaeterna @lhoestq @adrinjalali

@merveenoyan merveenoyan added the enhancement New feature or request label Nov 7, 2022
@polinaeterna
Copy link
Contributor

@merveenoyan Do you want different files to be splits or configurations?

From what you specified in Readme.md I hypothesize that you want to have 4 configs corresponding to directories: "clf_cat", "clf_num", "reg_cat", "reg_num". And inside each config you require to have as many splits as there are csv files
so if you run

load_dataset("inria-soda/tabular-benchmark", "clf_cat", split="compass")

you will generate the data only from compass.csv file.
In this case, running load_dataset("inria-soda/tabular-benchmark", "clf_cat") without split parameter will return DatasetDict object with "KDDCup09_upselling", "cat_compass", "cat_covertype", ... "road_safety" keys (which values are splits - Dataset objects)

or
do you want each file to be a separate config? Like:

load_dataset("inria-soda/tabular-benchmark", "clf_cat_compass")   # returns DatasetDict with a single "train" split

or
maybe smth completely different? 😄

Anyway, now I have an impression that this is probably rather a matter of automatically inferring configs from repository structure rather than providing parameters in metadata yaml.

@merveenoyan
Copy link
Contributor Author

@polinaeterna I want the latter where you can think of every CSV file as a config, like MNLI from GLUE.

@polinaeterna
Copy link
Contributor

@merveenoyan @lhoestq I see two solutions to this case.

  1. Parse configurations automatically from directories names. That is, if you have data structure like:
tabular-benchmark
  └─clf_cat_compass
      └─compass.csv
  └─clf_cat_cat_covertype
      └─covertype.csv
 ...
  └─reg_cat_house_sales
      └─house_sales.csv

you'll get "clf_cat_compass", "clf_cat_cat_covertype", ... "reg_cat_house_sales" configurations that would contain only files from corresponding directories.
+ this is a requested change and needed in general and would solve other problems, see #4578, would also help with #5213 which I'm working on currently
+ would allow users to do just load_dataset(“inria-soda/tabular-benchmark”, “clf_cat_compass”), no data_files param required
- in this specific case it would require restructuring of the data - putting each file in a directory named as a config name (to me personally it doesn't seem to be a big deal)

  1. More or less what we discussed before - add support for manually specifying parameters in the metadata. We can add new metadata yaml field (say, "custom_configs_info"), so that we can provide smth like:
---
...
dataset_info:
 ...  
custom_configs_info:
- config_name: reg_cat_house_sales
 data_files:
 - reg_cat/house_sales.csv
- config_name: clf_cat_compass
 data_files:
 - clf_cat/compass.csv
...
---

+ Would be useful not only for tabular data and not only for data_files parameter - any packaged dataset’s viewer can be customized to use specific, non-default parameters. @merveenoyan do you maybe have any other examples/use cases in mind where you want to provide any specific parameters to the viewer?
- I'm not sure here but assume that it might require changes in interaction with the viewer on the hub side - to parse these configurations, as they not default configurations (not in BUILDER_CONFIGS list). cc @severo But probably this can be solved on the datasets side too.

Overall, I would start from implementing the first solution since it's related to what I'm doing now and is super useful for datasets in general. And then if we agree that having more flexibility in providing parameters to the viewer is required, I can implement the second one. Let me know what you think :)

@lhoestq
Copy link
Member

lhoestq commented Nov 9, 2022

We can add new metadata yaml field (say, "custom_configs_info"), so that we can provide smth like:

Love it ! Some other ideas to name the "custom_configs_info" field: "configs", "parameters", "config_args", "configurations"

it might require changes in interaction with the viewer on the hub side - to parse these configurations, as they not default configurations (not in BUILDER_CONFIGS list)

If we update the get_dataset_config_names() function in datasets in inspect.py we should be fine - that's what the viewer is using

Overall, I would start from implementing the first solution since it's related to what I'm doing now and is super useful for datasets in general. And then if we agree that having more flexibility in providing parameters to the viewer is required, I can implement the second one. Let me know what you think :)

Actually I feel like the second solution includes the first use case you mentioned. If you implement the second solution, then users would just have to add a few lines of YAML and their directories would be considered configurations no ? Maybe there's no need to implement two different logics to do the same thing

@merveenoyan
Copy link
Contributor Author

is there any update on this? 🕵🏻

@polinaeterna
Copy link
Contributor

@merveenoyan I haven't started working on this yet, working on adding configs to packaged datasets instead: #5213 because this both would allow you to solve your issue and is a frequently requested feature.

adding arbitrary parameters to yaml would be my next task i think!

@polinaeterna
Copy link
Contributor

@merveenoyan ignore my comment above, I'm switching to this task now :D

@Jakeukalane
Copy link

I want to be able to create folders in a model.

@mariosasko
Copy link
Collaborator

Addressed in #5331

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants