[push_to_hub] Add data_files in yaml #1

lhoestq · 2023-04-03T18:27:11Z

Introducing

builder_config:
  data_files:
  - split: train
    pattern: data/train-*

when pushing a dataset in the README.md.

I also updated sanitize_patterns to support this structure as input before passing it to DataFilesDict.from_*

When pushing a new config to a dataset that was pushed using an old push_to_hub, the YAML is also created automatically for the old config.

Regarding default configs: a config is the default one if it's called "default" or if it has a "default: true" in YAML

If it looks good for you I can add some tests

HuggingFaceDocBuilderDev · 2023-04-03T18:32:58Z

The documentation is not available anymore as the PR was closed or merged.

polinaeterna

Thank you a lot! I think it's good like this. I don't really like that we introduce yet another keyword but considering that:
1.the yaml is created automatically with .push_to_hub and I guess not a lot of users would use this format manually to specify custom splits
2. I haven't come up with any other ideas :D
it's fine for me.
I've left a few comments and suggestions and would be good to have some tests

src/datasets/utils/metadata.py

polinaeterna · 2023-04-04T11:42:27Z

src/datasets/dataset_dict.py

+            metadata_configs = MetadataConfigs()
+        # create the metadata configs if it was uploaded with push_to_hub before metadata configs existed
+        if not metadata_configs:
+            _matched_paths = [p for p in repo_files if fnmatch(p, SPLIT_PATTERN_SHARDED.replace("{split}", "*"))]


Suggested change

_matched_paths = [p for p in repo_files if fnmatch(p, SPLIT_PATTERN_SHARDED.replace("{split}", "*"))]

_matched_paths = [p for p in repo_files if fnmatch(p, f'{SPLIT_PATTERN_SHARDED.replace("{split}", "*")[:-1]}parquet')]

also check that these are parquet files?

polinaeterna · 2023-04-04T11:43:56Z

src/datasets/dataset_dict.py

+        else:
+            dataset_metadata = DatasetMetadata()
+            metadata_configs = MetadataConfigs()
+        # create the metadata configs if it was uploaded with push_to_hub before metadata configs existed


love it, thanks!

src/datasets/data_files.py

github-actions · 2023-04-04T14:25:16Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008027 / 0.011353 (-0.003326)	0.005522 / 0.011008 (-0.005486)	0.097481 / 0.038508 (0.058973)	0.034389 / 0.023109 (0.011279)	0.302697 / 0.275898 (0.026799)	0.351299 / 0.323480 (0.027819)	0.005638 / 0.007986 (-0.002348)	0.005183 / 0.004328 (0.000854)	0.069624 / 0.004250 (0.065373)	0.047033 / 0.037052 (0.009980)	0.315435 / 0.258489 (0.056946)	0.341326 / 0.293841 (0.047485)	0.039653 / 0.128546 (-0.088893)	0.012608 / 0.075646 (-0.063038)	0.322931 / 0.419271 (-0.096340)	0.053091 / 0.043533 (0.009558)	0.322161 / 0.255139 (0.067022)	0.320122 / 0.283200 (0.036922)	0.104686 / 0.141683 (-0.036997)	1.465838 / 1.452155 (0.013683)	1.597641 / 1.492716 (0.104924)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.232149 / 0.018006 (0.214143)	0.511297 / 0.000490 (0.510807)	0.006794 / 0.000200 (0.006594)	0.000098 / 0.000054 (0.000044)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.025125 / 0.037411 (-0.012286)	0.125875 / 0.014526 (0.111349)	0.133102 / 0.176557 (-0.043455)	0.187171 / 0.737135 (-0.549964)	0.147051 / 0.296338 (-0.149287)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.498881 / 0.215209 (0.283672)	4.541684 / 2.077655 (2.464029)	1.988475 / 1.504120 (0.484355)	1.832604 / 1.541195 (0.291410)	1.887613 / 1.468490 (0.419123)	0.748140 / 4.584777 (-3.836637)	4.374174 / 3.745712 (0.628462)	4.301114 / 5.269862 (-0.968747)	2.185659 / 4.565676 (-2.380017)	0.096033 / 0.424275 (-0.328242)	0.013011 / 0.007607 (0.005403)	0.536717 / 0.226044 (0.310672)	5.483939 / 2.268929 (3.215011)	2.645061 / 55.444624 (-52.799564)	2.354811 / 6.876477 (-4.521666)	2.395738 / 2.142072 (0.253666)	0.911701 / 4.805227 (-3.893527)	0.176480 / 6.500664 (-6.324184)	0.070688 / 0.075469 (-0.004781)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.263884 / 1.841788 (-0.577904)	15.337586 / 8.074308 (7.263277)	14.420835 / 10.191392 (4.229443)	0.200593 / 0.680424 (-0.479830)	0.018624 / 0.534201 (-0.515577)	0.451052 / 0.579283 (-0.128231)	0.461517 / 0.434364 (0.027153)	0.501247 / 0.540337 (-0.039091)	0.597282 / 1.386936 (-0.789654)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007554 / 0.011353 (-0.003799)	0.005026 / 0.011008 (-0.005982)	0.080956 / 0.038508 (0.042448)	0.034210 / 0.023109 (0.011101)	0.375536 / 0.275898 (0.099638)	0.391838 / 0.323480 (0.068358)	0.005694 / 0.007986 (-0.002291)	0.005826 / 0.004328 (0.001498)	0.074402 / 0.004250 (0.070152)	0.046650 / 0.037052 (0.009597)	0.362894 / 0.258489 (0.104405)	0.406527 / 0.293841 (0.112686)	0.038477 / 0.128546 (-0.090069)	0.013259 / 0.075646 (-0.062387)	0.097315 / 0.419271 (-0.321957)	0.053862 / 0.043533 (0.010329)	0.355927 / 0.255139 (0.100788)	0.396812 / 0.283200 (0.113613)	0.115407 / 0.141683 (-0.026276)	1.582031 / 1.452155 (0.129876)	1.594966 / 1.492716 (0.102250)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.230933 / 0.018006 (0.212926)	0.440901 / 0.000490 (0.440411)	0.005294 / 0.000200 (0.005094)	0.000259 / 0.000054 (0.000204)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031132 / 0.037411 (-0.006279)	0.113799 / 0.014526 (0.099274)	0.123644 / 0.176557 (-0.052913)	0.176952 / 0.737135 (-0.560184)	0.127008 / 0.296338 (-0.169330)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.461949 / 0.215209 (0.246740)	4.654338 / 2.077655 (2.576684)	2.241821 / 1.504120 (0.737701)	2.024217 / 1.541195 (0.483022)	2.269460 / 1.468490 (0.800970)	0.812370 / 4.584777 (-3.772407)	4.470578 / 3.745712 (0.724865)	4.512527 / 5.269862 (-0.757335)	2.040845 / 4.565676 (-2.524831)	0.097521 / 0.424275 (-0.326754)	0.013298 / 0.007607 (0.005691)	0.596200 / 0.226044 (0.370156)	5.956963 / 2.268929 (3.688035)	2.768614 / 55.444624 (-52.676010)	2.448626 / 6.876477 (-4.427851)	2.483687 / 2.142072 (0.341614)	0.927440 / 4.805227 (-3.877788)	0.182153 / 6.500664 (-6.318511)	0.066558 / 0.075469 (-0.008911)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.292599 / 1.841788 (-0.549189)	15.727340 / 8.074308 (7.653031)	14.427366 / 10.191392 (4.235974)	0.185437 / 0.680424 (-0.494987)	0.020335 / 0.534201 (-0.513866)	0.454907 / 0.579283 (-0.124377)	0.476523 / 0.434364 (0.042159)	0.535526 / 0.540337 (-0.004811)	0.642391 / 1.386936 (-0.744545)

…_files-in-yaml

github-actions · 2023-04-04T14:41:08Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008658 / 0.011353 (-0.002695)	0.005847 / 0.011008 (-0.005162)	0.116312 / 0.038508 (0.077804)	0.041333 / 0.023109 (0.018224)	0.358491 / 0.275898 (0.082593)	0.397601 / 0.323480 (0.074121)	0.006817 / 0.007986 (-0.001169)	0.006265 / 0.004328 (0.001937)	0.085826 / 0.004250 (0.081576)	0.053042 / 0.037052 (0.015990)	0.359754 / 0.258489 (0.101265)	0.401853 / 0.293841 (0.108012)	0.042885 / 0.128546 (-0.085661)	0.014149 / 0.075646 (-0.061497)	0.399811 / 0.419271 (-0.019461)	0.060073 / 0.043533 (0.016540)	0.363781 / 0.255139 (0.108642)	0.385223 / 0.283200 (0.102024)	0.121277 / 0.141683 (-0.020406)	1.714500 / 1.452155 (0.262345)	1.823012 / 1.492716 (0.330295)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.246306 / 0.018006 (0.228299)	0.472130 / 0.000490 (0.471640)	0.002625 / 0.000200 (0.002425)	0.000095 / 0.000054 (0.000040)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033078 / 0.037411 (-0.004333)	0.137270 / 0.014526 (0.122744)	0.141185 / 0.176557 (-0.035371)	0.215502 / 0.737135 (-0.521633)	0.149454 / 0.296338 (-0.146885)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.475318 / 0.215209 (0.260109)	4.730739 / 2.077655 (2.653085)	2.121157 / 1.504120 (0.617037)	1.904546 / 1.541195 (0.363351)	2.013194 / 1.468490 (0.544704)	0.839438 / 4.584777 (-3.745339)	4.636177 / 3.745712 (0.890465)	4.561387 / 5.269862 (-0.708474)	2.273378 / 4.565676 (-2.292299)	0.103926 / 0.424275 (-0.320349)	0.014591 / 0.007607 (0.006984)	0.593373 / 0.226044 (0.367329)	5.900603 / 2.268929 (3.631675)	2.697179 / 55.444624 (-52.747445)	2.325042 / 6.876477 (-4.551435)	2.515062 / 2.142072 (0.372990)	1.023214 / 4.805227 (-3.782013)	0.201167 / 6.500664 (-6.299497)	0.078477 / 0.075469 (0.003008)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.483030 / 1.841788 (-0.358758)	18.684242 / 8.074308 (10.609934)	17.483815 / 10.191392 (7.292423)	0.241614 / 0.680424 (-0.438810)	0.023130 / 0.534201 (-0.511071)	0.558636 / 0.579283 (-0.020647)	0.507632 / 0.434364 (0.073268)	0.586904 / 0.540337 (0.046566)	0.696867 / 1.386936 (-0.690069)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008868 / 0.011353 (-0.002485)	0.006047 / 0.011008 (-0.004961)	0.090016 / 0.038508 (0.051508)	0.041334 / 0.023109 (0.018225)	0.406831 / 0.275898 (0.130933)	0.452984 / 0.323480 (0.129504)	0.006915 / 0.007986 (-0.001071)	0.006193 / 0.004328 (0.001864)	0.087825 / 0.004250 (0.083574)	0.056566 / 0.037052 (0.019513)	0.407529 / 0.258489 (0.149040)	0.474033 / 0.293841 (0.180192)	0.043177 / 0.128546 (-0.085370)	0.014285 / 0.075646 (-0.061361)	0.103333 / 0.419271 (-0.315938)	0.057462 / 0.043533 (0.013929)	0.404718 / 0.255139 (0.149579)	0.430815 / 0.283200 (0.147615)	0.118565 / 0.141683 (-0.023117)	1.743298 / 1.452155 (0.291143)	1.863265 / 1.492716 (0.370549)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.241058 / 0.018006 (0.223051)	0.483979 / 0.000490 (0.483490)	0.000461 / 0.000200 (0.000261)	0.000068 / 0.000054 (0.000013)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.035273 / 0.037411 (-0.002138)	0.140581 / 0.014526 (0.126056)	0.144412 / 0.176557 (-0.032144)	0.204935 / 0.737135 (-0.532201)	0.152248 / 0.296338 (-0.144090)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.511798 / 0.215209 (0.296589)	5.057519 / 2.077655 (2.979864)	2.445449 / 1.504120 (0.941330)	2.222225 / 1.541195 (0.681030)	2.410113 / 1.468490 (0.941623)	0.842953 / 4.584777 (-3.741824)	4.652713 / 3.745712 (0.907001)	4.640184 / 5.269862 (-0.629678)	2.056175 / 4.565676 (-2.509502)	0.102805 / 0.424275 (-0.321470)	0.014657 / 0.007607 (0.007050)	0.637944 / 0.226044 (0.411900)	6.373196 / 2.268929 (4.104268)	3.007392 / 55.444624 (-52.437232)	2.632639 / 6.876477 (-4.243838)	2.693904 / 2.142072 (0.551832)	1.017286 / 4.805227 (-3.787941)	0.202611 / 6.500664 (-6.298053)	0.079733 / 0.075469 (0.004264)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.657453 / 1.841788 (-0.184335)	18.979919 / 8.074308 (10.905611)	17.753915 / 10.191392 (7.562523)	0.200433 / 0.680424 (-0.479991)	0.020558 / 0.534201 (-0.513643)	0.507195 / 0.579283 (-0.072088)	0.535147 / 0.434364 (0.100783)	0.590592 / 0.540337 (0.050255)	0.701850 / 1.386936 (-0.685086)

github-actions · 2023-04-04T14:41:12Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007641 / 0.011353 (-0.003711)	0.005205 / 0.011008 (-0.005803)	0.097994 / 0.038508 (0.059485)	0.033971 / 0.023109 (0.010861)	0.299220 / 0.275898 (0.023322)	0.348234 / 0.323480 (0.024754)	0.005928 / 0.007986 (-0.002058)	0.005514 / 0.004328 (0.001186)	0.072358 / 0.004250 (0.068107)	0.049811 / 0.037052 (0.012758)	0.314346 / 0.258489 (0.055857)	0.350351 / 0.293841 (0.056510)	0.036596 / 0.128546 (-0.091950)	0.012119 / 0.075646 (-0.063527)	0.332662 / 0.419271 (-0.086609)	0.051012 / 0.043533 (0.007479)	0.293716 / 0.255139 (0.038577)	0.318938 / 0.283200 (0.035738)	0.103428 / 0.141683 (-0.038255)	1.451502 / 1.452155 (-0.000653)	1.521686 / 1.492716 (0.028970)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.215792 / 0.018006 (0.197786)	0.443063 / 0.000490 (0.442573)	0.004667 / 0.000200 (0.004467)	0.000084 / 0.000054 (0.000029)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027663 / 0.037411 (-0.009748)	0.108625 / 0.014526 (0.094099)	0.120017 / 0.176557 (-0.056540)	0.178581 / 0.737135 (-0.558554)	0.122871 / 0.296338 (-0.173468)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.418026 / 0.215209 (0.202817)	4.147609 / 2.077655 (2.069955)	1.924866 / 1.504120 (0.420746)	1.735278 / 1.541195 (0.194083)	1.884634 / 1.468490 (0.416144)	0.703426 / 4.584777 (-3.881351)	3.836659 / 3.745712 (0.090947)	2.173283 / 5.269862 (-3.096579)	1.485502 / 4.565676 (-3.080175)	0.087592 / 0.424275 (-0.336683)	0.012466 / 0.007607 (0.004859)	0.517358 / 0.226044 (0.291314)	5.169384 / 2.268929 (2.900456)	2.379905 / 55.444624 (-53.064720)	2.038367 / 6.876477 (-4.838110)	2.128192 / 2.142072 (-0.013880)	0.853605 / 4.805227 (-3.951622)	0.170946 / 6.500664 (-6.329718)	0.067189 / 0.075469 (-0.008281)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.193055 / 1.841788 (-0.648732)	14.701979 / 8.074308 (6.627671)	14.447968 / 10.191392 (4.256576)	0.164026 / 0.680424 (-0.516397)	0.017243 / 0.534201 (-0.516958)	0.426994 / 0.579283 (-0.152289)	0.420689 / 0.434364 (-0.013675)	0.489051 / 0.540337 (-0.051287)	0.582946 / 1.386936 (-0.803990)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007462 / 0.011353 (-0.003890)	0.005106 / 0.011008 (-0.005902)	0.075103 / 0.038508 (0.036594)	0.033651 / 0.023109 (0.010542)	0.339882 / 0.275898 (0.063983)	0.376579 / 0.323480 (0.053099)	0.005601 / 0.007986 (-0.002385)	0.005464 / 0.004328 (0.001136)	0.073861 / 0.004250 (0.069611)	0.048533 / 0.037052 (0.011480)	0.348908 / 0.258489 (0.090419)	0.391021 / 0.293841 (0.097180)	0.036402 / 0.128546 (-0.092144)	0.012247 / 0.075646 (-0.063399)	0.086133 / 0.419271 (-0.333138)	0.048492 / 0.043533 (0.004959)	0.336621 / 0.255139 (0.081482)	0.353506 / 0.283200 (0.070306)	0.101967 / 0.141683 (-0.039715)	1.443238 / 1.452155 (-0.008916)	1.525444 / 1.492716 (0.032727)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.192393 / 0.018006 (0.174387)	0.453748 / 0.000490 (0.453258)	0.003365 / 0.000200 (0.003165)	0.000092 / 0.000054 (0.000038)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030771 / 0.037411 (-0.006641)	0.113449 / 0.014526 (0.098923)	0.123078 / 0.176557 (-0.053479)	0.175719 / 0.737135 (-0.561417)	0.129010 / 0.296338 (-0.167328)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.423799 / 0.215209 (0.208590)	4.213995 / 2.077655 (2.136340)	2.068120 / 1.504120 (0.564000)	1.876613 / 1.541195 (0.335419)	2.041498 / 1.468490 (0.573008)	0.699671 / 4.584777 (-3.885106)	3.920526 / 3.745712 (0.174814)	3.566152 / 5.269862 (-1.703710)	1.861814 / 4.565676 (-2.703862)	0.087081 / 0.424275 (-0.337194)	0.012338 / 0.007607 (0.004731)	0.526855 / 0.226044 (0.300811)	5.258511 / 2.268929 (2.989582)	2.522738 / 55.444624 (-52.921886)	2.205048 / 6.876477 (-4.671429)	2.284552 / 2.142072 (0.142479)	0.854798 / 4.805227 (-3.950429)	0.171358 / 6.500664 (-6.329306)	0.065468 / 0.075469 (-0.010001)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.264648 / 1.841788 (-0.577140)	15.078050 / 8.074308 (7.003742)	14.852754 / 10.191392 (4.661362)	0.147531 / 0.680424 (-0.532893)	0.017566 / 0.534201 (-0.516635)	0.427360 / 0.579283 (-0.151923)	0.431500 / 0.434364 (-0.002864)	0.499531 / 0.540337 (-0.040807)	0.592437 / 1.386936 (-0.794499)

lhoestq · 2023-04-04T14:54:01Z

Took your comments into account and added some integration tests :)

polinaeterna

thank you a lot for the tests!!
my main question is about "default" parameter. did I understand it correctly that this is to support specifying which config is a default one by adding default=true to the yaml dict of a config parameters? like

builder_config:
  - config_name: v1
    default: true
    data_files: ...
  - config_name: v1
    data_files: ...

polinaeterna · 2023-04-05T10:58:09Z

tests/test_upstream_hub.py

@@ -608,3 +619,117 @@ def test_push_streaming_dataset_dict_to_hub(self, temporary_repo):
                assert local_ds.column_names == hub_ds.column_names
                assert list(local_ds["train"].features.keys()) == list(hub_ds["train"].features.keys())
                assert local_ds["train"].features == hub_ds["train"].features
+
+    def test_push_multiple_dataset_configs_to_hub(self, temporary_repo):


thank you 🥹🥹🥹

polinaeterna · 2023-04-05T11:02:50Z

src/datasets/utils/metadata.py

+                **{
+                    param: value
+                    for param, value in meta_config.items()
+                    if hasattr(builder_config_cls, param) and param != "default"


what does it mean that config parameter is named "default"? how that might be possible? Do you mean we can now write smth like

builder_config: - config_name: v1 default: true ...

to make a custom config default?

Yes exactly - forgot to mention it in the OP

polinaeterna · 2023-04-05T11:05:52Z

src/datasets/utils/metadata.py

+            param
+            for meta_config in metadata_configs.values()
+            for param in meta_config
+            if hasattr(builder_config_cls, param) and param != "default"


Suggested change

if hasattr(builder_config_cls, param) and param != "default"

if not hasattr(builder_config_cls, param) or param == "default"

if I understood it right, this should be a negation of what is in return statement.

good catch, I think it should be

if not hasattr(builder_config_cls, param) and param != "default"

polinaeterna · 2023-04-05T11:11:03Z

tests/test_upstream_hub.py

+                "*/config2/random-*",
+            )
+            with pytest.raises(ValueError):  # no config
+                load_dataset_builder(ds_name, download_mode="force_redownload")


we can also check the content of metadata in README.md, I can add it myself in my main PR when this one is merged if you don't want to :D

github-actions · 2023-04-05T12:23:02Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006811 / 0.011353 (-0.004541)	0.004514 / 0.011008 (-0.006494)	0.097662 / 0.038508 (0.059154)	0.028151 / 0.023109 (0.005042)	0.368709 / 0.275898 (0.092811)	0.398972 / 0.323480 (0.075492)	0.005268 / 0.007986 (-0.002718)	0.003395 / 0.004328 (-0.000933)	0.076085 / 0.004250 (0.071834)	0.038440 / 0.037052 (0.001388)	0.377194 / 0.258489 (0.118705)	0.410385 / 0.293841 (0.116544)	0.032104 / 0.128546 (-0.096443)	0.011729 / 0.075646 (-0.063918)	0.319981 / 0.419271 (-0.099290)	0.042770 / 0.043533 (-0.000763)	0.368657 / 0.255139 (0.113518)	0.392829 / 0.283200 (0.109630)	0.084676 / 0.141683 (-0.057007)	1.457070 / 1.452155 (0.004915)	1.534584 / 1.492716 (0.041868)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.197735 / 0.018006 (0.179728)	0.392598 / 0.000490 (0.392109)	0.001350 / 0.000200 (0.001150)	0.000071 / 0.000054 (0.000017)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022838 / 0.037411 (-0.014573)	0.096609 / 0.014526 (0.082083)	0.105124 / 0.176557 (-0.071433)	0.166888 / 0.737135 (-0.570247)	0.108228 / 0.296338 (-0.188110)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.432367 / 0.215209 (0.217158)	4.326529 / 2.077655 (2.248874)	2.012846 / 1.504120 (0.508726)	1.833583 / 1.541195 (0.292388)	1.899786 / 1.468490 (0.431296)	0.697938 / 4.584777 (-3.886839)	3.418531 / 3.745712 (-0.327181)	2.801947 / 5.269862 (-2.467914)	1.488396 / 4.565676 (-3.077280)	0.084176 / 0.424275 (-0.340099)	0.012862 / 0.007607 (0.005255)	0.535340 / 0.226044 (0.309296)	5.370825 / 2.268929 (3.101897)	2.447186 / 55.444624 (-52.997438)	2.119639 / 6.876477 (-4.756838)	2.243839 / 2.142072 (0.101766)	0.815252 / 4.805227 (-3.989975)	0.152736 / 6.500664 (-6.347928)	0.066197 / 0.075469 (-0.009272)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.229197 / 1.841788 (-0.612591)	13.712698 / 8.074308 (5.638390)	14.231742 / 10.191392 (4.040350)	0.141021 / 0.680424 (-0.539403)	0.016701 / 0.534201 (-0.517500)	0.381939 / 0.579283 (-0.197344)	0.386433 / 0.434364 (-0.047931)	0.445016 / 0.540337 (-0.095322)	0.521318 / 1.386936 (-0.865618)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006481 / 0.011353 (-0.004872)	0.004600 / 0.011008 (-0.006409)	0.078263 / 0.038508 (0.039755)	0.027929 / 0.023109 (0.004820)	0.339792 / 0.275898 (0.063894)	0.375112 / 0.323480 (0.051632)	0.004969 / 0.007986 (-0.003017)	0.004171 / 0.004328 (-0.000158)	0.076495 / 0.004250 (0.072244)	0.037134 / 0.037052 (0.000082)	0.340808 / 0.258489 (0.082319)	0.388779 / 0.293841 (0.094938)	0.031859 / 0.128546 (-0.096688)	0.011646 / 0.075646 (-0.064001)	0.087742 / 0.419271 (-0.331529)	0.042318 / 0.043533 (-0.001215)	0.340736 / 0.255139 (0.085597)	0.368970 / 0.283200 (0.085771)	0.087976 / 0.141683 (-0.053707)	1.489741 / 1.452155 (0.037586)	1.575746 / 1.492716 (0.083030)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.232897 / 0.018006 (0.214891)	0.398951 / 0.000490 (0.398461)	0.000396 / 0.000200 (0.000196)	0.000059 / 0.000054 (0.000005)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.025245 / 0.037411 (-0.012167)	0.098824 / 0.014526 (0.084298)	0.106298 / 0.176557 (-0.070258)	0.157530 / 0.737135 (-0.579605)	0.108865 / 0.296338 (-0.187474)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.442791 / 0.215209 (0.227582)	4.426299 / 2.077655 (2.348644)	2.080860 / 1.504120 (0.576740)	1.867079 / 1.541195 (0.325884)	1.922901 / 1.468490 (0.454411)	0.699570 / 4.584777 (-3.885207)	3.402933 / 3.745712 (-0.342780)	1.865475 / 5.269862 (-3.404387)	1.171962 / 4.565676 (-3.393714)	0.084195 / 0.424275 (-0.340080)	0.012472 / 0.007607 (0.004865)	0.545513 / 0.226044 (0.319468)	5.462826 / 2.268929 (3.193897)	2.530984 / 55.444624 (-52.913641)	2.186692 / 6.876477 (-4.689785)	2.228484 / 2.142072 (0.086412)	0.809304 / 4.805227 (-3.995923)	0.151440 / 6.500664 (-6.349224)	0.066779 / 0.075469 (-0.008690)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.279629 / 1.841788 (-0.562158)	14.049083 / 8.074308 (5.974775)	14.068205 / 10.191392 (3.876812)	0.139951 / 0.680424 (-0.540473)	0.016729 / 0.534201 (-0.517472)	0.380697 / 0.579283 (-0.198586)	0.399084 / 0.434364 (-0.035280)	0.446465 / 0.540337 (-0.093872)	0.529385 / 1.386936 (-0.857551)

lhoestq added 6 commits April 3, 2023 19:43

extend sanitize_patterns for data_files from yaml

f726436

disallow pushing metadata with a dict data_files

9fed17d

update Dataset.push_to_hub

e9e2f5a

update DatasetDict.push_to_hub

6010162

remove redundant code

83282b1

minor comment

bbd6cc7

lhoestq requested a review from polinaeterna April 3, 2023 18:27

polinaeterna reviewed Apr 4, 2023

View reviewed changes

lhoestq added 6 commits April 4, 2023 16:14

error for bad data_files, and warn for bad params

b1bda39

add MetadataConfigs.get_default_config_name

abfd50d

error in sanitize_patterns on bad data_files

9671de1

check default config name in get_dataset_builder_class

3fcc6de

fix creation of metadata config for old push_to_hub

ce1bd8f

test push_to_hub when no metadata configs

68f850b

lhoestq and others added 2 commits April 4, 2023 16:32

more tests

4683cf8

Merge branch 'arbitrary-config-parameters-in-meta-yaml' into add-data…

33aa94a

…_files-in-yaml

lhoestq mentioned this pull request Apr 4, 2023

Support for multiple configs in packaged modules via metadata yaml info huggingface/datasets#5331

Merged

2 tasks

polinaeterna approved these changes Apr 5, 2023

View reviewed changes

fix ignored_params check

2f40214

lhoestq merged commit 0be02e2 into arbitrary-config-parameters-in-meta-yaml Apr 6, 2023

	_matched_paths = [p for p in repo_files if fnmatch(p, SPLIT_PATTERN_SHARDED.replace("{split}", "*"))]
	_matched_paths = [p for p in repo_files if fnmatch(p, f'{SPLIT_PATTERN_SHARDED.replace("{split}", "*")[:-1]}parquet')]

	if hasattr(builder_config_cls, param) and param != "default"
	if not hasattr(builder_config_cls, param) or param == "default"

[push_to_hub] Add data_files in yaml #1

[push_to_hub] Add data_files in yaml #1

Conversation

lhoestq commented Apr 3, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Apr 3, 2023 • edited Loading

polinaeterna left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Apr 4, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Apr 4, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Apr 4, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

lhoestq commented Apr 4, 2023

polinaeterna left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Apr 5, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

lhoestq commented Apr 3, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 3, 2023 •

edited

Loading