-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ability to define splits in metadata section of dataset card #5209
Comments
@merveenoyan Do you want different files to be splits or configurations? From what you specified in load_dataset("inria-soda/tabular-benchmark", "clf_cat", split="compass") you will generate the data only from or load_dataset("inria-soda/tabular-benchmark", "clf_cat_compass") # returns DatasetDict with a single "train" split or Anyway, now I have an impression that this is probably rather a matter of automatically inferring configs from repository structure rather than providing parameters in metadata yaml. |
@polinaeterna I want the latter where you can think of every CSV file as a config, like MNLI from GLUE. |
@merveenoyan @lhoestq I see two solutions to this case.
you'll get "clf_cat_compass", "clf_cat_cat_covertype", ... "reg_cat_house_sales" configurations that would contain only files from corresponding directories.
---
...
dataset_info:
...
custom_configs_info:
- config_name: reg_cat_house_sales
data_files:
- reg_cat/house_sales.csv
- config_name: clf_cat_compass
data_files:
- clf_cat/compass.csv
...
--- + Would be useful not only for tabular data and not only for Overall, I would start from implementing the first solution since it's related to what I'm doing now and is super useful for |
Love it ! Some other ideas to name the "custom_configs_info" field: "configs", "parameters", "config_args", "configurations"
If we update the
Actually I feel like the second solution includes the first use case you mentioned. If you implement the second solution, then users would just have to add a few lines of YAML and their directories would be considered configurations no ? Maybe there's no need to implement two different logics to do the same thing |
is there any update on this? 🕵🏻 |
@merveenoyan I haven't started working on this yet, working on adding configs to packaged datasets instead: #5213 because this both would allow you to solve your issue and is a frequently requested feature. adding arbitrary parameters to yaml would be my next task i think! |
@merveenoyan ignore my comment above, I'm switching to this task now :D |
I want to be able to create folders in a model. |
Addressed in #5331 |
Feature request
If you go here: https://huggingface.co/datasets/inria-soda/tabular-benchmark/tree/main you will see bunch of folders that has various CSV files. I’d like dataset viewer to show these files instead of only one dataset like it currently does. (and also people to be able to load them as splits instead of loading through
data_files
)e.g GLUE has various splits on viewer but it’s too overkill to ask people to implement loading script, so it would be better to let them define these in the README file instead.
Also pinging @polinaeterna @lhoestq @adrinjalali
The text was updated successfully, but these errors were encountered: