-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow deleting a subset/config from a no-script dataset #6820
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
This is ready for review, @huggingface/datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice ! Maybe add a simple test ?
src/datasets/hub.py
Outdated
operations = [] | ||
# data_files | ||
fs = HfFileSystem(endpoint=config.HF_ENDPOINT, token=token) | ||
builder = load_dataset_builder(repo_id, config_name, revision=revision, token=token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe disallow builders from a script ? you can check if the builder module starts with "datasets." for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have just passed trust_remote_code=False. Does it seems OK to you?
src/datasets/hub.py
Outdated
fs = HfFileSystem(endpoint=config.HF_ENDPOINT, token=token) | ||
builder = load_dataset_builder(repo_id, config_name, revision=revision, token=token) | ||
for data_file in chain(*builder.config.data_files.values()): | ||
operations.append(CommitOperationDelete(path_in_repo=fs.resolve_path(data_file).path_in_repo)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) maybe add a sanity check to make sure it's a file in the same repo (in case someone defines a dataset with data_files urls from another repo)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
I am adding a test... |
@lhoestq I am getting an error in the test and I think it happens because the CI endpoint does not have the /preupload functionality:
|
@lhoestq, finally, I implemented the test with a mock of the call to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
TODO:
Close #6810.