Support for multiple configs in packaged modules via metadata yaml info #5331

polinaeterna · 2022-12-02T16:43:44Z

will solve #5209 and #5151 and many other...

Config parameters for packaged builders are parsed from “builder_config” field in README.md file (separate firs-level field, not part of “dataset_info”), example:

---
dataset_info:
...
configs:
  - config_name: v1
    data_dir: v1
    drop_labels: true
  - config_name: v2
    data_dir: v2
    drop_labels: false

I tried to align packaged builders with custom configs parsed from metadata with scripts dataset builder as much as possible. Their builders are created dynamically (see configure_builder_class() in load.py) and have BUILDER_CONFIGSattribute filled withBuilderConfig` objects in the same way as for datasets with script.

load_dataset

If there is single config in meta and it doesn’t have a name, the name becomes “default” (as we do for “dataset_info”), example:

load_dataset("ds") == load_dataset("ds", "default")   # load with the params provided in metadata

load_dataset("ds", "random name")  # ValueError: BuilderConfig 'random_name' not found. Available: ['default']

If there is single config in metadata with config_name provided, it becomes a default one (loaded when no config_name is specified, example

load_dataset("ds") == load_dataset("ds", "custom")   # load with the params provided in meta

load_dataset("ds", "random name")  # ValueError: BuilderConfig 'random_name' not found. Available: ['custom']

If there are several configs in metadata with names example

load_dataset("ds", "v1")  # load with "v1" params

load_dataset("ds", "v2")  # load with "v2" params

load_dataset("ds")  # ValueError: BuilderConfig 'default' not found. Available: ['v1', 'v2']

Thanks to @lhoestq and this change, it's possible to add "default" field in yaml and set it to True, to make the config a default one (loaded when no config is specified):

configs:
  - config_name: v1
    drop_labels: true
    default: true
  - config_name: v2
...

then load_dataset("ds") == load_dataset("ds", "v1").

dataset_name and cache

I decided that it’s reasonable to add a dataset_name attribute to DatasetBuilder class which would be equal to name for scripts dataset but reflect a real dataset name for packaged builders (last part of path/name from hub). This is mostly to reorganize cache structure (I believe we can do this in the major release?) because otherwise, with custom configs for packaged builders which were all stored in the same directory, i it was becoming a mess. And in general it makes much more sense like this, from datasets server perspective too, though it’s a breaking change

So the cache dir has the following structure: "{namespace__}<dataset_name>/<config_name>/<version>/<hash>/" and arrow/parquet filenames are also "<dataset_name>-<split>.arrow".
For example polinaeterna___audiofolder_two_configs_in_metadata/v1-5532fac9443ea252/0.0.0/6cbdd16f8688354c63b4e2a36e1585d05de285023ee6443ffd71c4182055c0fc/ for polinaeterna/audiofolder_two_configs_in_metadata Hub dataset, train arrow file is audiofolder_two_configs_in_metadata-train.arrow.

For script datasets it remains unchanged.

push_to_hub

To support custom configs with push_to_hub, the data is put under directory named either as <config_name> if config_name is not "default" or "data" if config_name is omitted or "default" (for backward compatibility). "builder_config" field is added to README.md, with config_name (optional) and data_files fields. for "data_files", "pattern" parameter is introduced, to resolve data files correctly, see polinaeterna#1.

ds.push_to_hub("ds") --> one config ("default"), put under "data" directory, example

dataset_info:
...
configs:
  data_files:
  - split: train
    pattern: data/train-*
...

ds.push_to_hub("ds", "custom") --> put under "custom" directory, example

configs:
  config_name: custom
  data_files:
    - split: train
      path: custom/train-*
...

for many configs, example:

configs:
  - config_name: v1
    data_files:
      - split: train
        path: v1/train-*
...
  - config_name: v2
    data_files:
      - split: train
        path: v2/train-*
...

Thanks to @lhoestq and polinaeterna#1, when pushing to datasets created before this change, README.md is updated accordingly (config for old data is added along with the one that is being pushed).

"dataset_info" yaml field is updated accordingly (new configs are added).
This shouldn't break anything!

TODO in separate PRs:

docs
probably update test cli util (make --save_info not rewrite builder_config in readme)

HuggingFaceDocBuilderDev · 2022-12-02T16:48:06Z

The documentation is not available anymore as the PR was closed or merged.

…eta-yaml

…l correctly

…eta-yaml

…lass

…e pass glob pattern

…eta-yaml

…:polinaeterna/datasets into arbitrary-config-parameters-in-meta-yaml

save it until better times

because otherwise the error says that there is no data files in repository which is misleading

This reverts commit b3a963b.

lhoestq

Approving again :)

Good job !

github-actions · 2023-07-13T13:36:24Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005868 / 0.011353 (-0.005485)	0.003544 / 0.011008 (-0.007464)	0.080329 / 0.038508 (0.041821)	0.061072 / 0.023109 (0.037963)	0.307802 / 0.275898 (0.031904)	0.340353 / 0.323480 (0.016873)	0.004665 / 0.007986 (-0.003321)	0.002779 / 0.004328 (-0.001550)	0.062065 / 0.004250 (0.057815)	0.046350 / 0.037052 (0.009297)	0.312045 / 0.258489 (0.053556)	0.353524 / 0.293841 (0.059683)	0.026965 / 0.128546 (-0.101581)	0.007906 / 0.075646 (-0.067740)	0.260678 / 0.419271 (-0.158593)	0.044167 / 0.043533 (0.000634)	0.309757 / 0.255139 (0.054618)	0.340188 / 0.283200 (0.056988)	0.020440 / 0.141683 (-0.121243)	1.486886 / 1.452155 (0.034732)	1.548330 / 1.492716 (0.055614)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.188658 / 0.018006 (0.170652)	0.422204 / 0.000490 (0.421715)	0.003508 / 0.000200 (0.003308)	0.000068 / 0.000054 (0.000013)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.025173 / 0.037411 (-0.012238)	0.072868 / 0.014526 (0.058343)	0.084817 / 0.176557 (-0.091739)	0.151667 / 0.737135 (-0.585468)	0.085632 / 0.296338 (-0.210706)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.400998 / 0.215209 (0.185789)	4.022274 / 2.077655 (1.944619)	2.025768 / 1.504120 (0.521648)	1.874193 / 1.541195 (0.332998)	2.006537 / 1.468490 (0.538047)	0.501799 / 4.584777 (-4.082978)	2.987487 / 3.745712 (-0.758225)	4.552295 / 5.269862 (-0.717566)	2.775859 / 4.565676 (-1.789817)	0.057596 / 0.424275 (-0.366679)	0.006449 / 0.007607 (-0.001158)	0.470776 / 0.226044 (0.244732)	4.725933 / 2.268929 (2.457005)	2.480130 / 55.444624 (-52.964494)	2.183919 / 6.876477 (-4.692558)	2.408052 / 2.142072 (0.265979)	0.584038 / 4.805227 (-4.221190)	0.124964 / 6.500664 (-6.375701)	0.060939 / 0.075469 (-0.014530)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.221263 / 1.841788 (-0.620524)	18.326372 / 8.074308 (10.252064)	13.398937 / 10.191392 (3.207545)	0.149153 / 0.680424 (-0.531271)	0.016941 / 0.534201 (-0.517260)	0.332106 / 0.579283 (-0.247177)	0.339958 / 0.434364 (-0.094406)	0.378125 / 0.540337 (-0.162212)	0.517787 / 1.386936 (-0.869149)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005927 / 0.011353 (-0.005426)	0.003607 / 0.011008 (-0.007402)	0.062925 / 0.038508 (0.024417)	0.058676 / 0.023109 (0.035566)	0.362129 / 0.275898 (0.086231)	0.395864 / 0.323480 (0.072384)	0.004652 / 0.007986 (-0.003334)	0.002893 / 0.004328 (-0.001435)	0.062696 / 0.004250 (0.058445)	0.049988 / 0.037052 (0.012935)	0.365366 / 0.258489 (0.106877)	0.412326 / 0.293841 (0.118485)	0.027118 / 0.128546 (-0.101429)	0.008179 / 0.075646 (-0.067467)	0.068048 / 0.419271 (-0.351223)	0.041065 / 0.043533 (-0.002468)	0.359858 / 0.255139 (0.104719)	0.386589 / 0.283200 (0.103390)	0.020467 / 0.141683 (-0.121216)	1.438070 / 1.452155 (-0.014084)	1.479617 / 1.492716 (-0.013099)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.231516 / 0.018006 (0.213510)	0.413407 / 0.000490 (0.412917)	0.000358 / 0.000200 (0.000158)	0.000052 / 0.000054 (-0.000002)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026071 / 0.037411 (-0.011340)	0.076486 / 0.014526 (0.061960)	0.085943 / 0.176557 (-0.090613)	0.138087 / 0.737135 (-0.599048)	0.087466 / 0.296338 (-0.208872)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.417711 / 0.215209 (0.202502)	4.171915 / 2.077655 (2.094260)	2.140677 / 1.504120 (0.636557)	1.960164 / 1.541195 (0.418969)	2.002134 / 1.468490 (0.533644)	0.499699 / 4.584777 (-4.085078)	2.991814 / 3.745712 (-0.753898)	2.906589 / 5.269862 (-2.363272)	1.842305 / 4.565676 (-2.723372)	0.057633 / 0.424275 (-0.366642)	0.006465 / 0.007607 (-0.001142)	0.492874 / 0.226044 (0.266830)	4.931613 / 2.268929 (2.662684)	2.623161 / 55.444624 (-52.821463)	2.310624 / 6.876477 (-4.565853)	2.483146 / 2.142072 (0.341074)	0.586910 / 4.805227 (-4.218317)	0.124681 / 6.500664 (-6.375983)	0.061561 / 0.075469 (-0.013908)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.319111 / 1.841788 (-0.522677)	18.637326 / 8.074308 (10.563018)	13.803912 / 10.191392 (3.612520)	0.143989 / 0.680424 (-0.536435)	0.017025 / 0.534201 (-0.517176)	0.333156 / 0.579283 (-0.246127)	0.342163 / 0.434364 (-0.092201)	0.380357 / 0.540337 (-0.159981)	0.512261 / 1.386936 (-0.874675)

load configs parameters from metadata

92d5925

polinaeterna mentioned this pull request Dec 2, 2022

Add support for different configs with push_to_hub #5213

Closed

4 tasks

polinaeterna added 27 commits December 2, 2022 19:01

copy paste

c921741

add dynamical builder creation

18b50b1

Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…

7257c79

…eta-yaml

Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…

c65ea41

…eta-yaml

merge configs kwargs and dataset info in hub without script module

9a5a60d

Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…

e245901

…eta-yaml

Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…

871d3ec

…eta-yaml

Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…

30299bd

…eta-yaml

fix conflicts

302c94a

fix loading from configs for packaged module

e9f2416

update push to hub to include configs and update info in metadata yam…

3bcd4e2

…l correctly

Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…

cdf400a

…eta-yaml

add class to read and write configs from metadata

be56a8b

remove metadata configs field from config.py

d0f07de

add configs to Dataset.push_to_hub

fc72481

refactor get_module of local ds

c38fc93

add test for push_to_hub with multiple configs

a4dc72c

fix parsing configs for hub datasets

e567f64

Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…

971a717

…eta-yaml

get readme from data_files in local packaged ds (probably not needed)

fa6af31

Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…

7c32c89

…eta-yaml

change cache dirs names to include dataset name

6da0245

Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…

6d7c67c

…eta-yaml

Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…

cefd1e3

…eta-yaml

Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…

2384d5c

…eta-yaml

Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…

7e88a45

…eta-yaml

set configs on instance instead of dynamically creating new builder c…

be17a56

…lass

lhoestq and others added 14 commits July 7, 2023 15:27

fix

5b5061a

fix

c775278

docs

73e9aca

skip windows test

2fa59f3

fix test

7711f2a

rename to dataset_card_data

cc7464d

remove glob_pattern_to_regex from string_to_dict, use it only where w…

e8562c4

…e pass glob pattern

group dataset module fields (related to configs construction)

38d260f

Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…

0ca672c

…eta-yaml

more docs

0dc02d8

rename builder configs dataclass and add docstring

ab5d45b

Merge branch 'arbitrary-config-parameters-in-meta-yaml' of github.com…

93c49e6

…:polinaeterna/datasets into arbitrary-config-parameters-in-meta-yaml

minor

e9c63b9

don't instantiate BASE_FEATURE

e3edb1c

polinaeterna requested a review from mariosasko July 11, 2023 14:04

polinaeterna added 8 commits July 11, 2023 19:12

update docs

0aa2a58

Merge branch 'main' into arbitrary-config-parameters-in-meta-yaml

2291a76

remove imagefolder example as it's too specific

83ab47a

save it until better times

raise if data files resolving raised in during metadata config resolving

c0a5d90

because otherwise the error says that there is no data files in repository which is misleading

Merge branch 'main' into arbitrary-config-parameters-in-meta-yaml

493fba2

fix ci style check: add codes to noqa

b3a963b

Revert "fix ci style check: add codes to noqa"

224d28a

This reverts commit b3a963b.

Merge branch 'main' into arbitrary-config-parameters-in-meta-yaml

d2b5b83

lhoestq approved these changes Jul 13, 2023

View reviewed changes

polinaeterna merged commit f49a163 into huggingface:main Jul 13, 2023
12 checks passed

This was referenced Jul 21, 2023

Implement ability to define splits in metadata section of dataset card #5209

Closed

Name the default config default #4902

Closed

tsor13 mentioned this pull request Aug 23, 2023

Configurations in yaml not working #6169

Open

lhoestq mentioned this pull request Oct 26, 2023

Datasets created with push_to_hub can't be accessed in offline mode #3547

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for multiple configs in packaged modules via metadata yaml info #5331

Support for multiple configs in packaged modules via metadata yaml info #5331

polinaeterna commented Dec 2, 2022 •

edited by lhoestq

Loading

HuggingFaceDocBuilderDev commented Dec 2, 2022 •

edited

Loading

lhoestq left a comment

github-actions bot commented Jul 13, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Support for multiple configs in packaged modules via metadata yaml info #5331

Support for multiple configs in packaged modules via metadata yaml info #5331

Conversation

polinaeterna commented Dec 2, 2022 • edited by lhoestq Loading

load_dataset

dataset_name and cache

push_to_hub

HuggingFaceDocBuilderDev commented Dec 2, 2022 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 13, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

polinaeterna commented Dec 2, 2022 •

edited by lhoestq

Loading

HuggingFaceDocBuilderDev commented Dec 2, 2022 •

edited

Loading