Implement ability to define splits in metadata section of dataset card #5209

merveenoyan · 2022-11-07T13:27:16Z

Feature request

If you go here: https://huggingface.co/datasets/inria-soda/tabular-benchmark/tree/main you will see bunch of folders that has various CSV files. I’d like dataset viewer to show these files instead of only one dataset like it currently does. (and also people to be able to load them as splits instead of loading through data_files)
e.g GLUE has various splits on viewer but it’s too overkill to ask people to implement loading script, so it would be better to let them define these in the README file instead.

Also pinging @polinaeterna @lhoestq @adrinjalali

The text was updated successfully, but these errors were encountered:

polinaeterna · 2022-11-08T14:18:11Z

@merveenoyan Do you want different files to be splits or configurations?

From what you specified in Readme.md I hypothesize that you want to have 4 configs corresponding to directories: "clf_cat", "clf_num", "reg_cat", "reg_num". And inside each config you require to have as many splits as there are csv files
so if you run

load_dataset("inria-soda/tabular-benchmark", "clf_cat", split="compass")

you will generate the data only from compass.csv file.
In this case, running load_dataset("inria-soda/tabular-benchmark", "clf_cat") without split parameter will return DatasetDict object with "KDDCup09_upselling", "cat_compass", "cat_covertype", ... "road_safety" keys (which values are splits - Dataset objects)

or
do you want each file to be a separate config? Like:

load_dataset("inria-soda/tabular-benchmark", "clf_cat_compass")   # returns DatasetDict with a single "train" split

or
maybe smth completely different? 😄

Anyway, now I have an impression that this is probably rather a matter of automatically inferring configs from repository structure rather than providing parameters in metadata yaml.

merveenoyan · 2022-11-08T18:36:38Z

@polinaeterna I want the latter where you can think of every CSV file as a config, like MNLI from GLUE.

polinaeterna · 2022-11-09T15:27:22Z

@merveenoyan @lhoestq I see two solutions to this case.

Parse configurations automatically from directories names. That is, if you have data structure like:

tabular-benchmark
  └─clf_cat_compass
      └─compass.csv
  └─clf_cat_cat_covertype
      └─covertype.csv
 ...
  └─reg_cat_house_sales
      └─house_sales.csv

you'll get "clf_cat_compass", "clf_cat_cat_covertype", ... "reg_cat_house_sales" configurations that would contain only files from corresponding directories.
+ this is a requested change and needed in general and would solve other problems, see #4578, would also help with #5213 which I'm working on currently
+ would allow users to do just load_dataset(“inria-soda/tabular-benchmark”, “clf_cat_compass”), no data_files param required
- in this specific case it would require restructuring of the data - putting each file in a directory named as a config name (to me personally it doesn't seem to be a big deal)

More or less what we discussed before - add support for manually specifying parameters in the metadata. We can add new metadata yaml field (say, "custom_configs_info"), so that we can provide smth like:

---
...
dataset_info:
 ...  
custom_configs_info:
- config_name: reg_cat_house_sales
 data_files:
 - reg_cat/house_sales.csv
- config_name: clf_cat_compass
 data_files:
 - clf_cat/compass.csv
...
---

+ Would be useful not only for tabular data and not only for data_files parameter - any packaged dataset’s viewer can be customized to use specific, non-default parameters. @merveenoyan do you maybe have any other examples/use cases in mind where you want to provide any specific parameters to the viewer?
- I'm not sure here but assume that it might require changes in interaction with the viewer on the hub side - to parse these configurations, as they not default configurations (not in BUILDER_CONFIGS list). cc @severo But probably this can be solved on the datasets side too.

Overall, I would start from implementing the first solution since it's related to what I'm doing now and is super useful for datasets in general. And then if we agree that having more flexibility in providing parameters to the viewer is required, I can implement the second one. Let me know what you think :)

lhoestq · 2022-11-09T17:19:15Z

We can add new metadata yaml field (say, "custom_configs_info"), so that we can provide smth like:

Love it ! Some other ideas to name the "custom_configs_info" field: "configs", "parameters", "config_args", "configurations"

it might require changes in interaction with the viewer on the hub side - to parse these configurations, as they not default configurations (not in BUILDER_CONFIGS list)

If we update the get_dataset_config_names() function in datasets in inspect.py we should be fine - that's what the viewer is using

Overall, I would start from implementing the first solution since it's related to what I'm doing now and is super useful for datasets in general. And then if we agree that having more flexibility in providing parameters to the viewer is required, I can implement the second one. Let me know what you think :)

Actually I feel like the second solution includes the first use case you mentioned. If you implement the second solution, then users would just have to add a few lines of YAML and their directories would be considered configurations no ? Maybe there's no need to implement two different logics to do the same thing

merveenoyan · 2022-11-30T10:10:20Z

is there any update on this? 🕵🏻

polinaeterna · 2022-11-30T13:34:47Z

@merveenoyan I haven't started working on this yet, working on adding configs to packaged datasets instead: #5213 because this both would allow you to solve your issue and is a frequently requested feature.

adding arbitrary parameters to yaml would be my next task i think!

polinaeterna · 2022-11-30T17:07:51Z

@merveenoyan ignore my comment above, I'm switching to this task now :D

Jakeukalane · 2022-12-21T13:22:29Z

I want to be able to create folders in a model.

mariosasko · 2023-07-21T14:36:01Z

Addressed in #5331

merveenoyan added the enhancement New feature or request label Nov 7, 2022

polinaeterna mentioned this issue Nov 30, 2022

Add support for different configs with push_to_hub #5213

Closed

4 tasks

polinaeterna mentioned this issue Dec 2, 2022

Support for multiple configs in packaged modules via metadata yaml info #5331

Merged

2 tasks

mariosasko closed this as completed Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement ability to define splits in metadata section of dataset card #5209

Implement ability to define splits in metadata section of dataset card #5209

merveenoyan commented Nov 7, 2022 •

edited

Loading

polinaeterna commented Nov 8, 2022

merveenoyan commented Nov 8, 2022

polinaeterna commented Nov 9, 2022

lhoestq commented Nov 9, 2022 •

edited

Loading

merveenoyan commented Nov 30, 2022

polinaeterna commented Nov 30, 2022

polinaeterna commented Nov 30, 2022

Jakeukalane commented Dec 21, 2022

mariosasko commented Jul 21, 2023

Implement ability to define splits in metadata section of dataset card #5209

Implement ability to define splits in metadata section of dataset card #5209

Comments

merveenoyan commented Nov 7, 2022 • edited Loading

Feature request

polinaeterna commented Nov 8, 2022

merveenoyan commented Nov 8, 2022

polinaeterna commented Nov 9, 2022

lhoestq commented Nov 9, 2022 • edited Loading

merveenoyan commented Nov 30, 2022

polinaeterna commented Nov 30, 2022

polinaeterna commented Nov 30, 2022

Jakeukalane commented Dec 21, 2022

mariosasko commented Jul 21, 2023

merveenoyan commented Nov 7, 2022 •

edited

Loading

lhoestq commented Nov 9, 2022 •

edited

Loading