-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[push_to_hub] Add data_files in yaml #1
[push_to_hub] Add data_files in yaml #1
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you a lot! I think it's good like this. I don't really like that we introduce yet another keyword but considering that:
1.the yaml is created automatically with .push_to_hub
and I guess not a lot of users would use this format manually to specify custom splits
2. I haven't come up with any other ideas :D
it's fine for me.
I've left a few comments and suggestions and would be good to have some tests
src/datasets/dataset_dict.py
Outdated
metadata_configs = MetadataConfigs() | ||
# create the metadata configs if it was uploaded with push_to_hub before metadata configs existed | ||
if not metadata_configs: | ||
_matched_paths = [p for p in repo_files if fnmatch(p, SPLIT_PATTERN_SHARDED.replace("{split}", "*"))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_matched_paths = [p for p in repo_files if fnmatch(p, SPLIT_PATTERN_SHARDED.replace("{split}", "*"))] | |
_matched_paths = [p for p in repo_files if fnmatch(p, f'{SPLIT_PATTERN_SHARDED.replace("{split}", "*")[:-1]}parquet')] |
also check that these are parquet files?
else: | ||
dataset_metadata = DatasetMetadata() | ||
metadata_configs = MetadataConfigs() | ||
# create the metadata configs if it was uploaded with push_to_hub before metadata configs existed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love it, thanks!
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Took your comments into account and added some integration tests :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you a lot for the tests!!
my main question is about "default" parameter. did I understand it correctly that this is to support specifying which config is a default one by adding default=true
to the yaml dict of a config parameters? like
builder_config:
- config_name: v1
default: true
data_files: ...
- config_name: v1
data_files: ...
@@ -608,3 +619,117 @@ def test_push_streaming_dataset_dict_to_hub(self, temporary_repo): | |||
assert local_ds.column_names == hub_ds.column_names | |||
assert list(local_ds["train"].features.keys()) == list(hub_ds["train"].features.keys()) | |||
assert local_ds["train"].features == hub_ds["train"].features | |||
|
|||
def test_push_multiple_dataset_configs_to_hub(self, temporary_repo): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you 🥹🥹🥹
**{ | ||
param: value | ||
for param, value in meta_config.items() | ||
if hasattr(builder_config_cls, param) and param != "default" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does it mean that config parameter is named "default"? how that might be possible? Do you mean we can now write smth like
builder_config:
- config_name: v1
default: true
...
to make a custom config default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly - forgot to mention it in the OP
src/datasets/utils/metadata.py
Outdated
param | ||
for meta_config in metadata_configs.values() | ||
for param in meta_config | ||
if hasattr(builder_config_cls, param) and param != "default" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if hasattr(builder_config_cls, param) and param != "default" | |
if not hasattr(builder_config_cls, param) or param == "default" |
if I understood it right, this should be a negation of what is in return statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, I think it should be
if not hasattr(builder_config_cls, param) and param != "default"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah true
"*/config2/random-*", | ||
) | ||
with pytest.raises(ValueError): # no config | ||
load_dataset_builder(ds_name, download_mode="force_redownload") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can also check the content of metadata in README.md, I can add it myself in my main PR when this one is merged if you don't want to :D
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Introducing
when pushing a dataset in the README.md.
I also updated
sanitize_patterns
to support this structure as input before passing it toDataFilesDict.from_*
When pushing a new config to a dataset that was pushed using an old
push_to_hub
, the YAML is also created automatically for the old config.Regarding default configs: a config is the default one if it's called "default" or if it has a "default: true" in YAML
If it looks good for you I can add some tests