Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[push_to_hub] Add data_files in yaml #1

Merged

Conversation

lhoestq
Copy link
Collaborator

@lhoestq lhoestq commented Apr 3, 2023

Introducing

builder_config:
  data_files:
  - split: train
    pattern: data/train-*

when pushing a dataset in the README.md.

I also updated sanitize_patterns to support this structure as input before passing it to DataFilesDict.from_*

When pushing a new config to a dataset that was pushed using an old push_to_hub, the YAML is also created automatically for the old config.

Regarding default configs: a config is the default one if it's called "default" or if it has a "default: true" in YAML

If it looks good for you I can add some tests

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Apr 3, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Owner

@polinaeterna polinaeterna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you a lot! I think it's good like this. I don't really like that we introduce yet another keyword but considering that:
1.the yaml is created automatically with .push_to_hub and I guess not a lot of users would use this format manually to specify custom splits
2. I haven't come up with any other ideas :D
it's fine for me.
I've left a few comments and suggestions and would be good to have some tests

src/datasets/utils/metadata.py Show resolved Hide resolved
metadata_configs = MetadataConfigs()
# create the metadata configs if it was uploaded with push_to_hub before metadata configs existed
if not metadata_configs:
_matched_paths = [p for p in repo_files if fnmatch(p, SPLIT_PATTERN_SHARDED.replace("{split}", "*"))]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_matched_paths = [p for p in repo_files if fnmatch(p, SPLIT_PATTERN_SHARDED.replace("{split}", "*"))]
_matched_paths = [p for p in repo_files if fnmatch(p, f'{SPLIT_PATTERN_SHARDED.replace("{split}", "*")[:-1]}parquet')]

also check that these are parquet files?

else:
dataset_metadata = DatasetMetadata()
metadata_configs = MetadataConfigs()
# create the metadata configs if it was uploaded with push_to_hub before metadata configs existed
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love it, thanks!

src/datasets/data_files.py Show resolved Hide resolved
@github-actions
Copy link

github-actions bot commented Apr 4, 2023

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008027 / 0.011353 (-0.003326) 0.005522 / 0.011008 (-0.005486) 0.097481 / 0.038508 (0.058973) 0.034389 / 0.023109 (0.011279) 0.302697 / 0.275898 (0.026799) 0.351299 / 0.323480 (0.027819) 0.005638 / 0.007986 (-0.002348) 0.005183 / 0.004328 (0.000854) 0.069624 / 0.004250 (0.065373) 0.047033 / 0.037052 (0.009980) 0.315435 / 0.258489 (0.056946) 0.341326 / 0.293841 (0.047485) 0.039653 / 0.128546 (-0.088893) 0.012608 / 0.075646 (-0.063038) 0.322931 / 0.419271 (-0.096340) 0.053091 / 0.043533 (0.009558) 0.322161 / 0.255139 (0.067022) 0.320122 / 0.283200 (0.036922) 0.104686 / 0.141683 (-0.036997) 1.465838 / 1.452155 (0.013683) 1.597641 / 1.492716 (0.104924)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.232149 / 0.018006 (0.214143) 0.511297 / 0.000490 (0.510807) 0.006794 / 0.000200 (0.006594) 0.000098 / 0.000054 (0.000044)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.025125 / 0.037411 (-0.012286) 0.125875 / 0.014526 (0.111349) 0.133102 / 0.176557 (-0.043455) 0.187171 / 0.737135 (-0.549964) 0.147051 / 0.296338 (-0.149287)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.498881 / 0.215209 (0.283672) 4.541684 / 2.077655 (2.464029) 1.988475 / 1.504120 (0.484355) 1.832604 / 1.541195 (0.291410) 1.887613 / 1.468490 (0.419123) 0.748140 / 4.584777 (-3.836637) 4.374174 / 3.745712 (0.628462) 4.301114 / 5.269862 (-0.968747) 2.185659 / 4.565676 (-2.380017) 0.096033 / 0.424275 (-0.328242) 0.013011 / 0.007607 (0.005403) 0.536717 / 0.226044 (0.310672) 5.483939 / 2.268929 (3.215011) 2.645061 / 55.444624 (-52.799564) 2.354811 / 6.876477 (-4.521666) 2.395738 / 2.142072 (0.253666) 0.911701 / 4.805227 (-3.893527) 0.176480 / 6.500664 (-6.324184) 0.070688 / 0.075469 (-0.004781)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.263884 / 1.841788 (-0.577904) 15.337586 / 8.074308 (7.263277) 14.420835 / 10.191392 (4.229443) 0.200593 / 0.680424 (-0.479830) 0.018624 / 0.534201 (-0.515577) 0.451052 / 0.579283 (-0.128231) 0.461517 / 0.434364 (0.027153) 0.501247 / 0.540337 (-0.039091) 0.597282 / 1.386936 (-0.789654)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007554 / 0.011353 (-0.003799) 0.005026 / 0.011008 (-0.005982) 0.080956 / 0.038508 (0.042448) 0.034210 / 0.023109 (0.011101) 0.375536 / 0.275898 (0.099638) 0.391838 / 0.323480 (0.068358) 0.005694 / 0.007986 (-0.002291) 0.005826 / 0.004328 (0.001498) 0.074402 / 0.004250 (0.070152) 0.046650 / 0.037052 (0.009597) 0.362894 / 0.258489 (0.104405) 0.406527 / 0.293841 (0.112686) 0.038477 / 0.128546 (-0.090069) 0.013259 / 0.075646 (-0.062387) 0.097315 / 0.419271 (-0.321957) 0.053862 / 0.043533 (0.010329) 0.355927 / 0.255139 (0.100788) 0.396812 / 0.283200 (0.113613) 0.115407 / 0.141683 (-0.026276) 1.582031 / 1.452155 (0.129876) 1.594966 / 1.492716 (0.102250)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.230933 / 0.018006 (0.212926) 0.440901 / 0.000490 (0.440411) 0.005294 / 0.000200 (0.005094) 0.000259 / 0.000054 (0.000204)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.031132 / 0.037411 (-0.006279) 0.113799 / 0.014526 (0.099274) 0.123644 / 0.176557 (-0.052913) 0.176952 / 0.737135 (-0.560184) 0.127008 / 0.296338 (-0.169330)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.461949 / 0.215209 (0.246740) 4.654338 / 2.077655 (2.576684) 2.241821 / 1.504120 (0.737701) 2.024217 / 1.541195 (0.483022) 2.269460 / 1.468490 (0.800970) 0.812370 / 4.584777 (-3.772407) 4.470578 / 3.745712 (0.724865) 4.512527 / 5.269862 (-0.757335) 2.040845 / 4.565676 (-2.524831) 0.097521 / 0.424275 (-0.326754) 0.013298 / 0.007607 (0.005691) 0.596200 / 0.226044 (0.370156) 5.956963 / 2.268929 (3.688035) 2.768614 / 55.444624 (-52.676010) 2.448626 / 6.876477 (-4.427851) 2.483687 / 2.142072 (0.341614) 0.927440 / 4.805227 (-3.877788) 0.182153 / 6.500664 (-6.318511) 0.066558 / 0.075469 (-0.008911)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.292599 / 1.841788 (-0.549189) 15.727340 / 8.074308 (7.653031) 14.427366 / 10.191392 (4.235974) 0.185437 / 0.680424 (-0.494987) 0.020335 / 0.534201 (-0.513866) 0.454907 / 0.579283 (-0.124377) 0.476523 / 0.434364 (0.042159) 0.535526 / 0.540337 (-0.004811) 0.642391 / 1.386936 (-0.744545)

@github-actions
Copy link

github-actions bot commented Apr 4, 2023

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008658 / 0.011353 (-0.002695) 0.005847 / 0.011008 (-0.005162) 0.116312 / 0.038508 (0.077804) 0.041333 / 0.023109 (0.018224) 0.358491 / 0.275898 (0.082593) 0.397601 / 0.323480 (0.074121) 0.006817 / 0.007986 (-0.001169) 0.006265 / 0.004328 (0.001937) 0.085826 / 0.004250 (0.081576) 0.053042 / 0.037052 (0.015990) 0.359754 / 0.258489 (0.101265) 0.401853 / 0.293841 (0.108012) 0.042885 / 0.128546 (-0.085661) 0.014149 / 0.075646 (-0.061497) 0.399811 / 0.419271 (-0.019461) 0.060073 / 0.043533 (0.016540) 0.363781 / 0.255139 (0.108642) 0.385223 / 0.283200 (0.102024) 0.121277 / 0.141683 (-0.020406) 1.714500 / 1.452155 (0.262345) 1.823012 / 1.492716 (0.330295)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.246306 / 0.018006 (0.228299) 0.472130 / 0.000490 (0.471640) 0.002625 / 0.000200 (0.002425) 0.000095 / 0.000054 (0.000040)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.033078 / 0.037411 (-0.004333) 0.137270 / 0.014526 (0.122744) 0.141185 / 0.176557 (-0.035371) 0.215502 / 0.737135 (-0.521633) 0.149454 / 0.296338 (-0.146885)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.475318 / 0.215209 (0.260109) 4.730739 / 2.077655 (2.653085) 2.121157 / 1.504120 (0.617037) 1.904546 / 1.541195 (0.363351) 2.013194 / 1.468490 (0.544704) 0.839438 / 4.584777 (-3.745339) 4.636177 / 3.745712 (0.890465) 4.561387 / 5.269862 (-0.708474) 2.273378 / 4.565676 (-2.292299) 0.103926 / 0.424275 (-0.320349) 0.014591 / 0.007607 (0.006984) 0.593373 / 0.226044 (0.367329) 5.900603 / 2.268929 (3.631675) 2.697179 / 55.444624 (-52.747445) 2.325042 / 6.876477 (-4.551435) 2.515062 / 2.142072 (0.372990) 1.023214 / 4.805227 (-3.782013) 0.201167 / 6.500664 (-6.299497) 0.078477 / 0.075469 (0.003008)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.483030 / 1.841788 (-0.358758) 18.684242 / 8.074308 (10.609934) 17.483815 / 10.191392 (7.292423) 0.241614 / 0.680424 (-0.438810) 0.023130 / 0.534201 (-0.511071) 0.558636 / 0.579283 (-0.020647) 0.507632 / 0.434364 (0.073268) 0.586904 / 0.540337 (0.046566) 0.696867 / 1.386936 (-0.690069)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008868 / 0.011353 (-0.002485) 0.006047 / 0.011008 (-0.004961) 0.090016 / 0.038508 (0.051508) 0.041334 / 0.023109 (0.018225) 0.406831 / 0.275898 (0.130933) 0.452984 / 0.323480 (0.129504) 0.006915 / 0.007986 (-0.001071) 0.006193 / 0.004328 (0.001864) 0.087825 / 0.004250 (0.083574) 0.056566 / 0.037052 (0.019513) 0.407529 / 0.258489 (0.149040) 0.474033 / 0.293841 (0.180192) 0.043177 / 0.128546 (-0.085370) 0.014285 / 0.075646 (-0.061361) 0.103333 / 0.419271 (-0.315938) 0.057462 / 0.043533 (0.013929) 0.404718 / 0.255139 (0.149579) 0.430815 / 0.283200 (0.147615) 0.118565 / 0.141683 (-0.023117) 1.743298 / 1.452155 (0.291143) 1.863265 / 1.492716 (0.370549)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.241058 / 0.018006 (0.223051) 0.483979 / 0.000490 (0.483490) 0.000461 / 0.000200 (0.000261) 0.000068 / 0.000054 (0.000013)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.035273 / 0.037411 (-0.002138) 0.140581 / 0.014526 (0.126056) 0.144412 / 0.176557 (-0.032144) 0.204935 / 0.737135 (-0.532201) 0.152248 / 0.296338 (-0.144090)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.511798 / 0.215209 (0.296589) 5.057519 / 2.077655 (2.979864) 2.445449 / 1.504120 (0.941330) 2.222225 / 1.541195 (0.681030) 2.410113 / 1.468490 (0.941623) 0.842953 / 4.584777 (-3.741824) 4.652713 / 3.745712 (0.907001) 4.640184 / 5.269862 (-0.629678) 2.056175 / 4.565676 (-2.509502) 0.102805 / 0.424275 (-0.321470) 0.014657 / 0.007607 (0.007050) 0.637944 / 0.226044 (0.411900) 6.373196 / 2.268929 (4.104268) 3.007392 / 55.444624 (-52.437232) 2.632639 / 6.876477 (-4.243838) 2.693904 / 2.142072 (0.551832) 1.017286 / 4.805227 (-3.787941) 0.202611 / 6.500664 (-6.298053) 0.079733 / 0.075469 (0.004264)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.657453 / 1.841788 (-0.184335) 18.979919 / 8.074308 (10.905611) 17.753915 / 10.191392 (7.562523) 0.200433 / 0.680424 (-0.479991) 0.020558 / 0.534201 (-0.513643) 0.507195 / 0.579283 (-0.072088) 0.535147 / 0.434364 (0.100783) 0.590592 / 0.540337 (0.050255) 0.701850 / 1.386936 (-0.685086)

@github-actions
Copy link

github-actions bot commented Apr 4, 2023

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007641 / 0.011353 (-0.003711) 0.005205 / 0.011008 (-0.005803) 0.097994 / 0.038508 (0.059485) 0.033971 / 0.023109 (0.010861) 0.299220 / 0.275898 (0.023322) 0.348234 / 0.323480 (0.024754) 0.005928 / 0.007986 (-0.002058) 0.005514 / 0.004328 (0.001186) 0.072358 / 0.004250 (0.068107) 0.049811 / 0.037052 (0.012758) 0.314346 / 0.258489 (0.055857) 0.350351 / 0.293841 (0.056510) 0.036596 / 0.128546 (-0.091950) 0.012119 / 0.075646 (-0.063527) 0.332662 / 0.419271 (-0.086609) 0.051012 / 0.043533 (0.007479) 0.293716 / 0.255139 (0.038577) 0.318938 / 0.283200 (0.035738) 0.103428 / 0.141683 (-0.038255) 1.451502 / 1.452155 (-0.000653) 1.521686 / 1.492716 (0.028970)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.215792 / 0.018006 (0.197786) 0.443063 / 0.000490 (0.442573) 0.004667 / 0.000200 (0.004467) 0.000084 / 0.000054 (0.000029)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.027663 / 0.037411 (-0.009748) 0.108625 / 0.014526 (0.094099) 0.120017 / 0.176557 (-0.056540) 0.178581 / 0.737135 (-0.558554) 0.122871 / 0.296338 (-0.173468)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.418026 / 0.215209 (0.202817) 4.147609 / 2.077655 (2.069955) 1.924866 / 1.504120 (0.420746) 1.735278 / 1.541195 (0.194083) 1.884634 / 1.468490 (0.416144) 0.703426 / 4.584777 (-3.881351) 3.836659 / 3.745712 (0.090947) 2.173283 / 5.269862 (-3.096579) 1.485502 / 4.565676 (-3.080175) 0.087592 / 0.424275 (-0.336683) 0.012466 / 0.007607 (0.004859) 0.517358 / 0.226044 (0.291314) 5.169384 / 2.268929 (2.900456) 2.379905 / 55.444624 (-53.064720) 2.038367 / 6.876477 (-4.838110) 2.128192 / 2.142072 (-0.013880) 0.853605 / 4.805227 (-3.951622) 0.170946 / 6.500664 (-6.329718) 0.067189 / 0.075469 (-0.008281)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.193055 / 1.841788 (-0.648732) 14.701979 / 8.074308 (6.627671) 14.447968 / 10.191392 (4.256576) 0.164026 / 0.680424 (-0.516397) 0.017243 / 0.534201 (-0.516958) 0.426994 / 0.579283 (-0.152289) 0.420689 / 0.434364 (-0.013675) 0.489051 / 0.540337 (-0.051287) 0.582946 / 1.386936 (-0.803990)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.007462 / 0.011353 (-0.003890) 0.005106 / 0.011008 (-0.005902) 0.075103 / 0.038508 (0.036594) 0.033651 / 0.023109 (0.010542) 0.339882 / 0.275898 (0.063983) 0.376579 / 0.323480 (0.053099) 0.005601 / 0.007986 (-0.002385) 0.005464 / 0.004328 (0.001136) 0.073861 / 0.004250 (0.069611) 0.048533 / 0.037052 (0.011480) 0.348908 / 0.258489 (0.090419) 0.391021 / 0.293841 (0.097180) 0.036402 / 0.128546 (-0.092144) 0.012247 / 0.075646 (-0.063399) 0.086133 / 0.419271 (-0.333138) 0.048492 / 0.043533 (0.004959) 0.336621 / 0.255139 (0.081482) 0.353506 / 0.283200 (0.070306) 0.101967 / 0.141683 (-0.039715) 1.443238 / 1.452155 (-0.008916) 1.525444 / 1.492716 (0.032727)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.192393 / 0.018006 (0.174387) 0.453748 / 0.000490 (0.453258) 0.003365 / 0.000200 (0.003165) 0.000092 / 0.000054 (0.000038)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.030771 / 0.037411 (-0.006641) 0.113449 / 0.014526 (0.098923) 0.123078 / 0.176557 (-0.053479) 0.175719 / 0.737135 (-0.561417) 0.129010 / 0.296338 (-0.167328)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.423799 / 0.215209 (0.208590) 4.213995 / 2.077655 (2.136340) 2.068120 / 1.504120 (0.564000) 1.876613 / 1.541195 (0.335419) 2.041498 / 1.468490 (0.573008) 0.699671 / 4.584777 (-3.885106) 3.920526 / 3.745712 (0.174814) 3.566152 / 5.269862 (-1.703710) 1.861814 / 4.565676 (-2.703862) 0.087081 / 0.424275 (-0.337194) 0.012338 / 0.007607 (0.004731) 0.526855 / 0.226044 (0.300811) 5.258511 / 2.268929 (2.989582) 2.522738 / 55.444624 (-52.921886) 2.205048 / 6.876477 (-4.671429) 2.284552 / 2.142072 (0.142479) 0.854798 / 4.805227 (-3.950429) 0.171358 / 6.500664 (-6.329306) 0.065468 / 0.075469 (-0.010001)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.264648 / 1.841788 (-0.577140) 15.078050 / 8.074308 (7.003742) 14.852754 / 10.191392 (4.661362) 0.147531 / 0.680424 (-0.532893) 0.017566 / 0.534201 (-0.516635) 0.427360 / 0.579283 (-0.151923) 0.431500 / 0.434364 (-0.002864) 0.499531 / 0.540337 (-0.040807) 0.592437 / 1.386936 (-0.794499)

@lhoestq
Copy link
Collaborator Author

lhoestq commented Apr 4, 2023

Took your comments into account and added some integration tests :)

Copy link
Owner

@polinaeterna polinaeterna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you a lot for the tests!!
my main question is about "default" parameter. did I understand it correctly that this is to support specifying which config is a default one by adding default=true to the yaml dict of a config parameters? like

builder_config:
  - config_name: v1
    default: true
    data_files: ...
  - config_name: v1
    data_files: ...

@@ -608,3 +619,117 @@ def test_push_streaming_dataset_dict_to_hub(self, temporary_repo):
assert local_ds.column_names == hub_ds.column_names
assert list(local_ds["train"].features.keys()) == list(hub_ds["train"].features.keys())
assert local_ds["train"].features == hub_ds["train"].features

def test_push_multiple_dataset_configs_to_hub(self, temporary_repo):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you 🥹🥹🥹

**{
param: value
for param, value in meta_config.items()
if hasattr(builder_config_cls, param) and param != "default"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does it mean that config parameter is named "default"? how that might be possible? Do you mean we can now write smth like

builder_config:
  - config_name: v1
    default: true
    ...

to make a custom config default?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly - forgot to mention it in the OP

param
for meta_config in metadata_configs.values()
for param in meta_config
if hasattr(builder_config_cls, param) and param != "default"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if hasattr(builder_config_cls, param) and param != "default"
if not hasattr(builder_config_cls, param) or param == "default"

if I understood it right, this should be a negation of what is in return statement.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, I think it should be

if not hasattr(builder_config_cls, param) and param != "default"

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah true

"*/config2/random-*",
)
with pytest.raises(ValueError): # no config
load_dataset_builder(ds_name, download_mode="force_redownload")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can also check the content of metadata in README.md, I can add it myself in my main PR when this one is merged if you don't want to :D

@github-actions
Copy link

github-actions bot commented Apr 5, 2023

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006811 / 0.011353 (-0.004541) 0.004514 / 0.011008 (-0.006494) 0.097662 / 0.038508 (0.059154) 0.028151 / 0.023109 (0.005042) 0.368709 / 0.275898 (0.092811) 0.398972 / 0.323480 (0.075492) 0.005268 / 0.007986 (-0.002718) 0.003395 / 0.004328 (-0.000933) 0.076085 / 0.004250 (0.071834) 0.038440 / 0.037052 (0.001388) 0.377194 / 0.258489 (0.118705) 0.410385 / 0.293841 (0.116544) 0.032104 / 0.128546 (-0.096443) 0.011729 / 0.075646 (-0.063918) 0.319981 / 0.419271 (-0.099290) 0.042770 / 0.043533 (-0.000763) 0.368657 / 0.255139 (0.113518) 0.392829 / 0.283200 (0.109630) 0.084676 / 0.141683 (-0.057007) 1.457070 / 1.452155 (0.004915) 1.534584 / 1.492716 (0.041868)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.197735 / 0.018006 (0.179728) 0.392598 / 0.000490 (0.392109) 0.001350 / 0.000200 (0.001150) 0.000071 / 0.000054 (0.000017)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.022838 / 0.037411 (-0.014573) 0.096609 / 0.014526 (0.082083) 0.105124 / 0.176557 (-0.071433) 0.166888 / 0.737135 (-0.570247) 0.108228 / 0.296338 (-0.188110)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.432367 / 0.215209 (0.217158) 4.326529 / 2.077655 (2.248874) 2.012846 / 1.504120 (0.508726) 1.833583 / 1.541195 (0.292388) 1.899786 / 1.468490 (0.431296) 0.697938 / 4.584777 (-3.886839) 3.418531 / 3.745712 (-0.327181) 2.801947 / 5.269862 (-2.467914) 1.488396 / 4.565676 (-3.077280) 0.084176 / 0.424275 (-0.340099) 0.012862 / 0.007607 (0.005255) 0.535340 / 0.226044 (0.309296) 5.370825 / 2.268929 (3.101897) 2.447186 / 55.444624 (-52.997438) 2.119639 / 6.876477 (-4.756838) 2.243839 / 2.142072 (0.101766) 0.815252 / 4.805227 (-3.989975) 0.152736 / 6.500664 (-6.347928) 0.066197 / 0.075469 (-0.009272)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.229197 / 1.841788 (-0.612591) 13.712698 / 8.074308 (5.638390) 14.231742 / 10.191392 (4.040350) 0.141021 / 0.680424 (-0.539403) 0.016701 / 0.534201 (-0.517500) 0.381939 / 0.579283 (-0.197344) 0.386433 / 0.434364 (-0.047931) 0.445016 / 0.540337 (-0.095322) 0.521318 / 1.386936 (-0.865618)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.006481 / 0.011353 (-0.004872) 0.004600 / 0.011008 (-0.006409) 0.078263 / 0.038508 (0.039755) 0.027929 / 0.023109 (0.004820) 0.339792 / 0.275898 (0.063894) 0.375112 / 0.323480 (0.051632) 0.004969 / 0.007986 (-0.003017) 0.004171 / 0.004328 (-0.000158) 0.076495 / 0.004250 (0.072244) 0.037134 / 0.037052 (0.000082) 0.340808 / 0.258489 (0.082319) 0.388779 / 0.293841 (0.094938) 0.031859 / 0.128546 (-0.096688) 0.011646 / 0.075646 (-0.064001) 0.087742 / 0.419271 (-0.331529) 0.042318 / 0.043533 (-0.001215) 0.340736 / 0.255139 (0.085597) 0.368970 / 0.283200 (0.085771) 0.087976 / 0.141683 (-0.053707) 1.489741 / 1.452155 (0.037586) 1.575746 / 1.492716 (0.083030)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.232897 / 0.018006 (0.214891) 0.398951 / 0.000490 (0.398461) 0.000396 / 0.000200 (0.000196) 0.000059 / 0.000054 (0.000005)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.025245 / 0.037411 (-0.012167) 0.098824 / 0.014526 (0.084298) 0.106298 / 0.176557 (-0.070258) 0.157530 / 0.737135 (-0.579605) 0.108865 / 0.296338 (-0.187474)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.442791 / 0.215209 (0.227582) 4.426299 / 2.077655 (2.348644) 2.080860 / 1.504120 (0.576740) 1.867079 / 1.541195 (0.325884) 1.922901 / 1.468490 (0.454411) 0.699570 / 4.584777 (-3.885207) 3.402933 / 3.745712 (-0.342780) 1.865475 / 5.269862 (-3.404387) 1.171962 / 4.565676 (-3.393714) 0.084195 / 0.424275 (-0.340080) 0.012472 / 0.007607 (0.004865) 0.545513 / 0.226044 (0.319468) 5.462826 / 2.268929 (3.193897) 2.530984 / 55.444624 (-52.913641) 2.186692 / 6.876477 (-4.689785) 2.228484 / 2.142072 (0.086412) 0.809304 / 4.805227 (-3.995923) 0.151440 / 6.500664 (-6.349224) 0.066779 / 0.075469 (-0.008690)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.279629 / 1.841788 (-0.562158) 14.049083 / 8.074308 (5.974775) 14.068205 / 10.191392 (3.876812) 0.139951 / 0.680424 (-0.540473) 0.016729 / 0.534201 (-0.517472) 0.380697 / 0.579283 (-0.198586) 0.399084 / 0.434364 (-0.035280) 0.446465 / 0.540337 (-0.093872) 0.529385 / 1.386936 (-0.857551)

@lhoestq lhoestq merged commit 0be02e2 into arbitrary-config-parameters-in-meta-yaml Apr 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants