Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multiple configs in packaged modules via metadata yaml info #5331

Conversation

polinaeterna
Copy link
Contributor

@polinaeterna polinaeterna commented Dec 2, 2022

will solve #5209 and #5151 and many other...

Config parameters for packaged builders are parsed from “builder_config” field in README.md file (separate firs-level field, not part of “dataset_info”), example:

---
dataset_info:
...
configs:
  - config_name: v1
    data_dir: v1
    drop_labels: true
  - config_name: v2
    data_dir: v2
    drop_labels: false

I tried to align packaged builders with custom configs parsed from metadata with scripts dataset builder as much as possible. Their builders are created dynamically (see configure_builder_class() in load.py) and have BUILDER_CONFIGSattribute filled withBuilderConfig` objects in the same way as for datasets with script.

load_dataset

  1. If there is single config in meta and it doesn’t have a name, the name becomes “default” (as we do for “dataset_info”), example:
load_dataset("ds") == load_dataset("ds", "default")   # load with the params provided in metadata

load_dataset("ds", "random name")  # ValueError: BuilderConfig 'random_name' not found. Available: ['default']
  1. If there is single config in metadata with config_name provided, it becomes a default one (loaded when no config_name is specified, example
load_dataset("ds") == load_dataset("ds", "custom")   # load with the params provided in meta

load_dataset("ds", "random name")  # ValueError: BuilderConfig 'random_name' not found. Available: ['custom']
  1. If there are several configs in metadata with names example
load_dataset("ds", "v1")  # load with "v1" params

load_dataset("ds", "v2")  # load with "v2" params

load_dataset("ds")  # ValueError: BuilderConfig 'default' not found. Available: ['v1', 'v2']

Thanks to @lhoestq and this change, it's possible to add "default" field in yaml and set it to True, to make the config a default one (loaded when no config is specified):

configs:
  - config_name: v1
    drop_labels: true
    default: true
  - config_name: v2
...

then load_dataset("ds") == load_dataset("ds", "v1").

dataset_name and cache

I decided that it’s reasonable to add a dataset_name attribute to DatasetBuilder class which would be equal to name for scripts dataset but reflect a real dataset name for packaged builders (last part of path/name from hub). This is mostly to reorganize cache structure (I believe we can do this in the major release?) because otherwise, with custom configs for packaged builders which were all stored in the same directory, i it was becoming a mess. And in general it makes much more sense like this, from datasets server perspective too, though it’s a breaking change

So the cache dir has the following structure: "{namespace__}<dataset_name>/<config_name>/<version>/<hash>/" and arrow/parquet filenames are also "<dataset_name>-<split>.arrow".
For example polinaeterna___audiofolder_two_configs_in_metadata/v1-5532fac9443ea252/0.0.0/6cbdd16f8688354c63b4e2a36e1585d05de285023ee6443ffd71c4182055c0fc/ for polinaeterna/audiofolder_two_configs_in_metadata Hub dataset, train arrow file is audiofolder_two_configs_in_metadata-train.arrow.

For script datasets it remains unchanged.

push_to_hub

To support custom configs with push_to_hub, the data is put under directory named either as <config_name> if config_name is not "default" or "data" if config_name is omitted or "default" (for backward compatibility). "builder_config" field is added to README.md, with config_name (optional) and data_files fields. for "data_files", "pattern" parameter is introduced, to resolve data files correctly, see polinaeterna#1.

  • ds.push_to_hub("ds") --> one config ("default"), put under "data" directory, example
dataset_info:
...
configs:
  data_files:
  - split: train
    pattern: data/train-*
...
  • ds.push_to_hub("ds", "custom") --> put under "custom" directory, example
configs:
  config_name: custom
  data_files:
    - split: train
      path: custom/train-*
...
configs:
  - config_name: v1
    data_files:
      - split: train
        path: v1/train-*
...
  - config_name: v2
    data_files:
      - split: train
        path: v2/train-*
...

Thanks to @lhoestq and polinaeterna#1, when pushing to datasets created before this change, README.md is updated accordingly (config for old data is added along with the one that is being pushed).

"dataset_info" yaml field is updated accordingly (new configs are added).
This shouldn't break anything!

TODO in separate PRs:

  • docs
  • probably update test cli util (make --save_info not rewrite builder_config in readme)

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Dec 2, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving again :)

Good job !

@polinaeterna polinaeterna merged commit f49a163 into huggingface:main Jul 13, 2023
12 checks passed
@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005868 / 0.011353 (-0.005485) 0.003544 / 0.011008 (-0.007464) 0.080329 / 0.038508 (0.041821) 0.061072 / 0.023109 (0.037963) 0.307802 / 0.275898 (0.031904) 0.340353 / 0.323480 (0.016873) 0.004665 / 0.007986 (-0.003321) 0.002779 / 0.004328 (-0.001550) 0.062065 / 0.004250 (0.057815) 0.046350 / 0.037052 (0.009297) 0.312045 / 0.258489 (0.053556) 0.353524 / 0.293841 (0.059683) 0.026965 / 0.128546 (-0.101581) 0.007906 / 0.075646 (-0.067740) 0.260678 / 0.419271 (-0.158593) 0.044167 / 0.043533 (0.000634) 0.309757 / 0.255139 (0.054618) 0.340188 / 0.283200 (0.056988) 0.020440 / 0.141683 (-0.121243) 1.486886 / 1.452155 (0.034732) 1.548330 / 1.492716 (0.055614)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.188658 / 0.018006 (0.170652) 0.422204 / 0.000490 (0.421715) 0.003508 / 0.000200 (0.003308) 0.000068 / 0.000054 (0.000013)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.025173 / 0.037411 (-0.012238) 0.072868 / 0.014526 (0.058343) 0.084817 / 0.176557 (-0.091739) 0.151667 / 0.737135 (-0.585468) 0.085632 / 0.296338 (-0.210706)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.400998 / 0.215209 (0.185789) 4.022274 / 2.077655 (1.944619) 2.025768 / 1.504120 (0.521648) 1.874193 / 1.541195 (0.332998) 2.006537 / 1.468490 (0.538047) 0.501799 / 4.584777 (-4.082978) 2.987487 / 3.745712 (-0.758225) 4.552295 / 5.269862 (-0.717566) 2.775859 / 4.565676 (-1.789817) 0.057596 / 0.424275 (-0.366679) 0.006449 / 0.007607 (-0.001158) 0.470776 / 0.226044 (0.244732) 4.725933 / 2.268929 (2.457005) 2.480130 / 55.444624 (-52.964494) 2.183919 / 6.876477 (-4.692558) 2.408052 / 2.142072 (0.265979) 0.584038 / 4.805227 (-4.221190) 0.124964 / 6.500664 (-6.375701) 0.060939 / 0.075469 (-0.014530)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.221263 / 1.841788 (-0.620524) 18.326372 / 8.074308 (10.252064) 13.398937 / 10.191392 (3.207545) 0.149153 / 0.680424 (-0.531271) 0.016941 / 0.534201 (-0.517260) 0.332106 / 0.579283 (-0.247177) 0.339958 / 0.434364 (-0.094406) 0.378125 / 0.540337 (-0.162212) 0.517787 / 1.386936 (-0.869149)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005927 / 0.011353 (-0.005426) 0.003607 / 0.011008 (-0.007402) 0.062925 / 0.038508 (0.024417) 0.058676 / 0.023109 (0.035566) 0.362129 / 0.275898 (0.086231) 0.395864 / 0.323480 (0.072384) 0.004652 / 0.007986 (-0.003334) 0.002893 / 0.004328 (-0.001435) 0.062696 / 0.004250 (0.058445) 0.049988 / 0.037052 (0.012935) 0.365366 / 0.258489 (0.106877) 0.412326 / 0.293841 (0.118485) 0.027118 / 0.128546 (-0.101429) 0.008179 / 0.075646 (-0.067467) 0.068048 / 0.419271 (-0.351223) 0.041065 / 0.043533 (-0.002468) 0.359858 / 0.255139 (0.104719) 0.386589 / 0.283200 (0.103390) 0.020467 / 0.141683 (-0.121216) 1.438070 / 1.452155 (-0.014084) 1.479617 / 1.492716 (-0.013099)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.231516 / 0.018006 (0.213510) 0.413407 / 0.000490 (0.412917) 0.000358 / 0.000200 (0.000158) 0.000052 / 0.000054 (-0.000002)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.026071 / 0.037411 (-0.011340) 0.076486 / 0.014526 (0.061960) 0.085943 / 0.176557 (-0.090613) 0.138087 / 0.737135 (-0.599048) 0.087466 / 0.296338 (-0.208872)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.417711 / 0.215209 (0.202502) 4.171915 / 2.077655 (2.094260) 2.140677 / 1.504120 (0.636557) 1.960164 / 1.541195 (0.418969) 2.002134 / 1.468490 (0.533644) 0.499699 / 4.584777 (-4.085078) 2.991814 / 3.745712 (-0.753898) 2.906589 / 5.269862 (-2.363272) 1.842305 / 4.565676 (-2.723372) 0.057633 / 0.424275 (-0.366642) 0.006465 / 0.007607 (-0.001142) 0.492874 / 0.226044 (0.266830) 4.931613 / 2.268929 (2.662684) 2.623161 / 55.444624 (-52.821463) 2.310624 / 6.876477 (-4.565853) 2.483146 / 2.142072 (0.341074) 0.586910 / 4.805227 (-4.218317) 0.124681 / 6.500664 (-6.375983) 0.061561 / 0.075469 (-0.013908)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.319111 / 1.841788 (-0.522677) 18.637326 / 8.074308 (10.563018) 13.803912 / 10.191392 (3.612520) 0.143989 / 0.680424 (-0.536435) 0.017025 / 0.534201 (-0.517176) 0.333156 / 0.579283 (-0.246127) 0.342163 / 0.434364 (-0.092201) 0.380357 / 0.540337 (-0.159981) 0.512261 / 1.386936 (-0.874675)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants