Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multiple configs in packaged modules via metadata yaml info #5331

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
223 commits
Select commit Hold shift + click to select a range
92d5925
load configs parameters from metadata
polinaeterna Dec 2, 2022
c921741
copy paste
polinaeterna Dec 2, 2022
18b50b1
add dynamical builder creation
polinaeterna Dec 8, 2022
7257c79
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Dec 12, 2022
c65ea41
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Dec 13, 2022
9a5a60d
merge configs kwargs and dataset info in hub without script module
polinaeterna Dec 13, 2022
e245901
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Jan 17, 2023
871d3ec
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Jan 23, 2023
30299bd
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Jan 24, 2023
302c94a
fix conflicts
polinaeterna Jan 24, 2023
e9f2416
fix loading from configs for packaged module
polinaeterna Jan 24, 2023
3bcd4e2
update push to hub to include configs and update info in metadata yam…
polinaeterna Jan 26, 2023
cdf400a
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Jan 26, 2023
be56a8b
add class to read and write configs from metadata
polinaeterna Jan 26, 2023
d0f07de
remove metadata configs field from config.py
polinaeterna Jan 27, 2023
fc72481
add configs to Dataset.push_to_hub
polinaeterna Jan 27, 2023
c38fc93
refactor get_module of local ds
polinaeterna Jan 27, 2023
a4dc72c
add test for push_to_hub with multiple configs
polinaeterna Jan 30, 2023
e567f64
fix parsing configs for hub datasets
polinaeterna Jan 30, 2023
971a717
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Jan 30, 2023
fa6af31
get readme from data_files in local packaged ds (probably not needed)
polinaeterna Jan 30, 2023
7c32c89
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Feb 1, 2023
6da0245
change cache dirs names to include dataset name
polinaeterna Feb 1, 2023
6d7c67c
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Feb 2, 2023
cefd1e3
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Feb 7, 2023
2384d5c
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Feb 10, 2023
7e88a45
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Feb 13, 2023
be17a56
set configs on instance instead of dynamically creating new builder c…
polinaeterna Feb 13, 2023
1ebb226
add test for loading with different configs from data_dir
polinaeterna Feb 13, 2023
825ba4a
modify config names tests, mlodify tests for one config local factory
polinaeterna Feb 13, 2023
04f7087
fix pickling and update metadata methods to convert to/from builders …
polinaeterna Feb 14, 2023
fc69e8f
resolve merge conflicts
polinaeterna Feb 14, 2023
6cf53d7
change relative imports to absolute
polinaeterna Feb 14, 2023
d8fd3c1
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Feb 16, 2023
f0008ba
remove circular import
polinaeterna Feb 16, 2023
9ce983c
remove unused import
polinaeterna Feb 16, 2023
79e2e25
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Feb 17, 2023
db3700b
update get config names to load configs from meta, refactor import of…
polinaeterna Feb 17, 2023
98618fb
merge
polinaeterna Feb 17, 2023
dea1e58
quote config_name in warning
polinaeterna Feb 19, 2023
60321b5
fix builder names in test_builder (non related to this PR)
polinaeterna Feb 19, 2023
0baee40
more tests for local and hub factories
polinaeterna Feb 19, 2023
6ddfd75
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Feb 19, 2023
19e6907
merge
polinaeterna Feb 19, 2023
6963269
fix style
polinaeterna Feb 19, 2023
b8fbc61
fix metric module import
polinaeterna Feb 20, 2023
a44b5e2
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Feb 21, 2023
6da811b
change builder name of parametrized builders, fix inspec_metric
polinaeterna Feb 21, 2023
ed0f241
fix default configs names in inspect.py tests, change parametrized bu…
polinaeterna Feb 22, 2023
c5e2d38
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Feb 22, 2023
afa7a30
fix docstrings in push_to_hub
polinaeterna Feb 23, 2023
5d0e8af
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Feb 23, 2023
a6bc7a2
merge main
polinaeterna Feb 23, 2023
3828cb7
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Feb 24, 2023
8fc2075
merge
polinaeterna Feb 24, 2023
a477ca0
get back parametrized builder name as an attr because it's used to se…
polinaeterna Feb 24, 2023
55394b4
add test for loading parametrized dataset with num_proc>1
polinaeterna Feb 24, 2023
4bac24f
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Feb 28, 2023
1faff2f
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Mar 1, 2023
a16f34d
fix writing 'data' dir for default config in push-to_hub
polinaeterna Mar 1, 2023
1e4d2ee
refactor metadata utils
polinaeterna Mar 1, 2023
69ba48b
add tests for metadata
polinaeterna Mar 1, 2023
033558d
fix loading tests
polinaeterna Mar 1, 2023
4362b9d
fix custom splits in push_to_hub (change data dir structure for custo…
polinaeterna Mar 3, 2023
a38158d
pass only existing params to builder configs
polinaeterna Mar 3, 2023
3a8c447
Merge branch 'main' into arbitrary-config-parameters-in-meta-yaml
polinaeterna Mar 3, 2023
d2214dc
fix test for push to hub with configs, test default config too
polinaeterna Mar 3, 2023
ab9af8e
fix dataset_json reading in get_module, add tests for local and packe…
polinaeterna Mar 3, 2023
b588688
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Mar 5, 2023
4d02fd1
update dataset_infos.json for Dataset.push_to_hub()
polinaeterna Mar 5, 2023
40ca52e
update dataset_infos.json for DatasetDict.push_to_hub
polinaeterna Mar 5, 2023
3868663
Merge branch 'arbitrary-config-parameters-in-meta-yaml' of github.com…
polinaeterna Mar 5, 2023
5afde77
add dataset_name attr to builder class to differentiate between packa…
polinaeterna Mar 5, 2023
65752cf
update info
polinaeterna Mar 5, 2023
5ccc47c
use builder.dataset_name everywhere in filenames and dirs, add it to …
polinaeterna Mar 6, 2023
74b34cc
fix config_id creation
polinaeterna Mar 6, 2023
c424919
use datasets.asdict when parsing configs from BuilderCOnfig objects i…
polinaeterna Mar 6, 2023
ef9f092
Merge branch 'main' into arbitrary-config-parameters-in-meta-yaml
polinaeterna Mar 13, 2023
cd9a43e
Apply suggestions from code review
polinaeterna Mar 13, 2023
8e556f0
Apply suggestions from code review
polinaeterna Mar 13, 2023
7f832be
rename according to review suggestions
polinaeterna Mar 13, 2023
f82a8f2
resolve data_files for all metadata configs in order to not passing c…
polinaeterna Mar 13, 2023
7420dbb
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Mar 14, 2023
2d70c34
get data files for metadata configs
polinaeterna Mar 14, 2023
55f4451
fix test for factories
polinaeterna Mar 14, 2023
58ceac9
pass 'default' config_name for packaged modules without config since …
polinaeterna Mar 15, 2023
952e0ae
move converting metadata to configs out of configuring function to fi…
polinaeterna Mar 15, 2023
cfa1cdd
fix config names in tests
polinaeterna Mar 15, 2023
fb66d2f
update hash of packaged builders with custom config
polinaeterna Mar 15, 2023
24aec4f
simplify update_hash
polinaeterna Mar 15, 2023
9eb8778
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Mar 16, 2023
f8fe5e8
Merge branch 'arbitrary-config-parameters-in-meta-yaml' of github.com…
polinaeterna Mar 16, 2023
b0711aa
fix dataset_info parsing
polinaeterna Mar 16, 2023
d4176c7
revert data_dir change since it breaks everything...
polinaeterna Mar 27, 2023
21f485c
add test for custom split names in custom configs dataset with .push_…
polinaeterna Mar 27, 2023
8e93a82
revert data_dir change again
polinaeterna Mar 27, 2023
736818a
rename METADATA_CONFIGS_FIELD from configs_kwargs to builder_config
polinaeterna Mar 27, 2023
6aca83c
Merge branch 'main' into arbitrary-config-parameters-in-meta-yaml
polinaeterna Mar 27, 2023
5bb8bc7
fix imports in tests
polinaeterna Mar 27, 2023
80afb43
simplify metadata loading, some rename
polinaeterna Mar 27, 2023
19a948b
update tests to reflect change of metadata configs field name
polinaeterna Mar 27, 2023
cf293e4
refactor data files recolving for metadata configs
polinaeterna Mar 28, 2023
d4c7701
update tests
polinaeterna Mar 28, 2023
9eb1f88
add tests for resolvinf data files in metadata configs
polinaeterna Mar 28, 2023
cf33c23
update hash for packaged modules with configs in load instead of buidler
polinaeterna Mar 28, 2023
4e70b9f
Merge branch 'main' into arbitrary-config-parameters-in-meta-yaml
polinaeterna Mar 28, 2023
92a76ff
revert moving finding patterns and resolving data files in a separate…
polinaeterna Mar 28, 2023
d2c8940
don't raise error in packaged factory
polinaeterna Mar 28, 2023
17c5a9c
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Apr 3, 2023
03e2b53
Apply suggestions from code review
polinaeterna Apr 3, 2023
4526889
fix renamed methods usages
polinaeterna Apr 3, 2023
f726436
extend sanitize_patterns for data_files from yaml
lhoestq Apr 3, 2023
9fed17d
disallow pushing metadata with a dict data_files
lhoestq Apr 3, 2023
e9e2f5a
update Dataset.push_to_hub
lhoestq Apr 3, 2023
6010162
update DatasetDict.push_to_hub
lhoestq Apr 3, 2023
83282b1
remove redundant code
lhoestq Apr 3, 2023
bbd6cc7
minor comment
lhoestq Apr 3, 2023
b1bda39
error for bad data_files, and warn for bad params
lhoestq Apr 4, 2023
abfd50d
add MetadataConfigs.get_default_config_name
lhoestq Apr 4, 2023
9671de1
error in sanitize_patterns on bad data_files
lhoestq Apr 4, 2023
3fcc6de
check default config name in get_dataset_builder_class
lhoestq Apr 4, 2023
ce1bd8f
fix creation of metadata config for old push_to_hub
lhoestq Apr 4, 2023
68f850b
test push_to_hub when no metadata configs
lhoestq Apr 4, 2023
4683cf8
more tests
lhoestq Apr 4, 2023
33aa94a
Merge branch 'arbitrary-config-parameters-in-meta-yaml' into add-data…
lhoestq Apr 4, 2023
2f40214
fix ignored_params check
lhoestq Apr 5, 2023
0be02e2
Merge pull request #1 from polinaeterna/add-data_files-in-yaml
lhoestq Apr 6, 2023
7e65054
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Apr 11, 2023
1cd73bd
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Apr 13, 2023
28144f6
remove configs support from PackagedDatasetModuleFactory
polinaeterna Apr 13, 2023
7ac4111
remove it from tests
polinaeterna Apr 13, 2023
a47f05f
fix missed parameter for reduce
polinaeterna Apr 13, 2023
3c75124
add integration tests for loading
polinaeterna Apr 13, 2023
d1f31de
Merge branch 'arbitrary-config-parameters-in-meta-yaml' of github.com…
polinaeterna Apr 13, 2023
00275d6
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Apr 24, 2023
d650381
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Apr 26, 2023
b369b47
fix regex for parquet filenames
polinaeterna Apr 26, 2023
5d9b664
fix metadata configs creation: put all splits in yaml, not only the l…
polinaeterna Apr 26, 2023
5462f44
fix tests for push_to_hub
polinaeterna Apr 26, 2023
a8e7d8a
roll back push_to_hub_without_meta pattern string
polinaeterna Apr 27, 2023
9fbc546
escape/replace some special characters in pattern in string_to_dict
polinaeterna Apr 27, 2023
a65d385
fix: escape '*' in string_to_dict again
polinaeterna Apr 27, 2023
c57acde
fix: pattern in tests for backward compatibility in push to hub
polinaeterna Apr 27, 2023
dd9733e
Merge branch 'main' into arbitrary-config-parameters-in-meta-yaml
polinaeterna Apr 27, 2023
f4ed8aa
join quentin's and mine tests (lot's of copypaste but more checks)
polinaeterna Apr 27, 2023
0559e49
remove '-' from sharded parquet pattern
polinaeterna Apr 28, 2023
1f4a1b1
separate DataFilesDict and MetadataConfigs
lhoestq Apr 28, 2023
b6dd4ad
update tests
lhoestq Apr 28, 2023
bfc81ed
fix missing "info"
lhoestq Apr 28, 2023
1267667
set default config when has only one dataset_info
lhoestq Apr 28, 2023
f8d080d
typing
lhoestq Apr 28, 2023
f7d1c3c
fix typing
lhoestq Apr 28, 2023
e8ec9c9
fix: pass config name to resolve_data_files_locally
polinaeterna May 1, 2023
8ae552f
fix: tests for local module without script
polinaeterna May 1, 2023
50bd4c5
Merge branch 'arbitrary-config-parameters-in-meta-yaml' into separate…
polinaeterna May 1, 2023
57887c0
update tests
lhoestq May 2, 2023
07f4745
fix: default config=None when creating a builder
lhoestq May 3, 2023
eda5e10
add tests
lhoestq May 3, 2023
b65e57f
Merge pull request #2 from polinaeterna/separate-metadataconfigs-and-…
lhoestq May 4, 2023
5a2b729
delete ds object to see if windows CI work
polinaeterna May 4, 2023
4ccdb9f
Merge branch 'arbitrary-config-parameters-in-meta-yaml' of github.com…
polinaeterna May 4, 2023
d2114c8
Merge branch 'main' into arbitrary-config-parameters-in-meta-yaml
polinaeterna May 4, 2023
369c766
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna May 10, 2023
cb52f64
cache backward compat
lhoestq May 10, 2023
91a3ed9
polina's comments
lhoestq May 11, 2023
01a57c7
Update src/datasets/builder.py
lhoestq May 12, 2023
5eb8a13
Merge pull request #3 from polinaeterna/cache-backward-compat
lhoestq May 12, 2023
aded146
fix test
lhoestq May 12, 2023
586d046
fix legacy cache path creation for local datasets
polinaeterna May 12, 2023
e493f5f
fix of fix of legacy cache path for local datasets
polinaeterna May 12, 2023
adbb466
fix dataset_name creation (make it not None only for packaged modules)
polinaeterna May 23, 2023
cdbbf5c
remove custom reduce and add check if dynamical builder class is pick…
polinaeterna May 25, 2023
586faa2
test if builder is pickable with 'pickle', not 'dill'
polinaeterna May 25, 2023
d39b457
Apply suggestions from code review
polinaeterna May 25, 2023
657bd0e
fix review changes
polinaeterna May 25, 2023
4ad8de3
get back custom reduce to enable pickle serialization
polinaeterna May 25, 2023
f107ef9
fix test for pickle: pickle instance, not class
polinaeterna May 29, 2023
29b7cb8
remove get_builder_config method from metadata class
polinaeterna May 29, 2023
58824f1
fix: pass config_kwargs as arguments to builder class
polinaeterna May 29, 2023
e0efb52
fix typo
polinaeterna May 29, 2023
76d3996
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna May 29, 2023
0cd2afd
get dataset_name in get_module()
polinaeterna May 29, 2023
bde1554
wrap yaml error message in metadata
polinaeterna May 29, 2023
edb3174
move glob->regex to a func, add ref to fsspec
polinaeterna May 29, 2023
85cc56a
implement DataFilesList additions
polinaeterna May 29, 2023
fea31bf
get back all data_files resolving logic to load.py
polinaeterna May 30, 2023
e153313
move inferring modules for all splits to a func
polinaeterna May 30, 2023
6b86cce
refactor data files resolving and creation of builder configs from me…
polinaeterna May 31, 2023
75fd712
Merge branch 'main' into arbitrary-config-parameters-in-meta-yaml
polinaeterna May 31, 2023
9989ebd
rename metadata configs field: builder_config -> configs
polinaeterna May 31, 2023
8f6dc46
make error message in sanitize_patterns yaml-agnostic
polinaeterna May 31, 2023
1a49ead
improve error message in yaml validation
polinaeterna May 31, 2023
b8e4153
fix yaml data files validation (raise error)
polinaeterna May 31, 2023
6e0910d
(temporarily) change metadata configs yaml field: configs -> builder_…
polinaeterna Jun 1, 2023
df6e55b
move test datasets to datasets-maintainters repo
polinaeterna Jun 1, 2023
335d4a1
fix yaml validation
polinaeterna Jun 1, 2023
e9109f7
change yaml format to only allow lists and have a required config_nam…
polinaeterna Jun 2, 2023
2a49293
rename yaml field: builder_configs -> configs
polinaeterna Jun 2, 2023
1e6a338
update datasets ids in tests
polinaeterna Jun 5, 2023
aa629e0
Update tests/test_load.py
polinaeterna Jun 8, 2023
6f8309b
Merge remote-tracking branch 'upstream/main' into arbitrary-config-pa…
lhoestq Jul 7, 2023
5b5061a
fix
lhoestq Jul 7, 2023
c775278
fix
lhoestq Jul 7, 2023
73e9aca
docs
lhoestq Jul 7, 2023
2fa59f3
skip windows test
lhoestq Jul 7, 2023
7711f2a
fix test
lhoestq Jul 7, 2023
cc7464d
rename to dataset_card_data
lhoestq Jul 7, 2023
e8562c4
remove glob_pattern_to_regex from string_to_dict, use it only where w…
polinaeterna Jul 7, 2023
38d260f
group dataset module fields (related to configs construction)
polinaeterna Jul 7, 2023
0ca672c
Merge branch 'huggingface:main' into arbitrary-config-parameters-in-m…
polinaeterna Jul 11, 2023
0dc02d8
more docs
lhoestq Jul 11, 2023
ab5d45b
rename builder configs dataclass and add docstring
polinaeterna Jul 11, 2023
93c49e6
Merge branch 'arbitrary-config-parameters-in-meta-yaml' of github.com…
polinaeterna Jul 11, 2023
e9c63b9
minor
lhoestq Jul 11, 2023
e3edb1c
don't instantiate BASE_FEATURE
lhoestq Jul 11, 2023
0aa2a58
update docs
polinaeterna Jul 11, 2023
2291a76
Merge branch 'main' into arbitrary-config-parameters-in-meta-yaml
polinaeterna Jul 11, 2023
83ab47a
remove imagefolder example as it's too specific
polinaeterna Jul 11, 2023
c0a5d90
raise if data files resolving raised in during metadata config resolving
polinaeterna Jul 12, 2023
493fba2
Merge branch 'main' into arbitrary-config-parameters-in-meta-yaml
polinaeterna Jul 13, 2023
b3a963b
fix ci style check: add codes to noqa
polinaeterna Jul 13, 2023
224d28a
Revert "fix ci style check: add codes to noqa"
polinaeterna Jul 13, 2023
d2b5b83
Merge branch 'main' into arbitrary-config-parameters-in-meta-yaml
polinaeterna Jul 13, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions ADD_NEW_DATASET.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@ Add datasets directly to the 🤗 Hugging Face Hub!

You can share your dataset on https://huggingface.co/datasets directly using your account, see the documentation:

* [Create a dataset and upload files](https://huggingface.co/docs/datasets/upload_dataset)
* [Advanced guide using dataset scripts](https://huggingface.co/docs/datasets/share)
* [Create a dataset and upload files on the website](https://huggingface.co/docs/datasets/upload_dataset)
* [Advanced guide using the CLI](https://huggingface.co/docs/datasets/share)
4 changes: 2 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,8 +95,8 @@ Note that if any files were formatted by `pre-commit` hooks during committing, y

You can share your dataset on https://huggingface.co/datasets directly using your account, see the documentation:

* [Create a dataset and upload files](https://huggingface.co/docs/datasets/upload_dataset)
* [Advanced guide using dataset scripts](https://huggingface.co/docs/datasets/share)
* [Create a dataset and upload files on the website](https://huggingface.co/docs/datasets/upload_dataset)
* [Advanced guide using the CLI](https://huggingface.co/docs/datasets/share)

## How to contribute to the dataset cards

Expand Down
4 changes: 2 additions & 2 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -88,12 +88,12 @@
- sections:
- local: share
title: Share
- local: dataset_script
title: Create a dataset loading script
- local: dataset_card
title: Create a dataset card
- local: repository_structure
title: Structure your repository
- local: dataset_script
title: Create a dataset loading script
title: "Dataset repository"
title: "How-to guides"
- sections:
Expand Down
32 changes: 19 additions & 13 deletions docs/source/about_dataset_load.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,22 +9,35 @@ Let's begin with a basic Explain Like I'm Five.
A dataset is a directory that contains:

- Some data files in generic formats (JSON, CSV, Parquet, text, etc.)
- An optional dataset script if it requires some code to read the data files. This is used to load files of all formats and structures.
- A dataset card named `README.md` that contains documentation about the dataset as well as a YAML header to define the datasets tags and configurations
- An optional dataset script if it requires some code to read the data files. This is sometimes used to load files of specific formats and structures.

The [`load_dataset`] function fetches the requested dataset locally or from the Hugging Face Hub.
The Hub is a central repository where all the Hugging Face datasets and models are stored.

If the dataset only contains data files, then [`load_dataset`] automatically infers how to load the data files from their extensions (json, csv, parquet, txt, etc.).
Under the hood, 🤗 Datasets will use an appropriate [`DatasetBuilder`] based on the data files format. There exist one builder per data file format in 🤗 Datasets:

* [`datasets.packaged_modules.text.Text`] for text
* [`datasets.packaged_modules.csv.Csv`] for CSV and TSV
* [`datasets.packaged_modules.json.Json`] for JSON and JSONL
* [`datasets.packaged_modules.parquet.Parquet`] for Parquet
* [`datasets.packaged_modules.arrow.Arrow`] for Arrow (streaming file format)
* [`datasets.packaged_modules.sql.Sql`] for SQL databases
* [`datasets.packaged_modules.imagefolder.ImageFolder`] for image folders
* [`datasets.packaged_modules.audiofolder.AudioFolder`] for audio folders

If the dataset has a dataset script, then it downloads and imports it from the Hugging Face Hub.
Code in the dataset script defines the dataset information (description, features, URL to the original files, etc.), and tells 🤗 Datasets how to generate and display examples from it.
Code in the dataset script defines a custom [`DatasetBuilder`] the dataset information (description, features, URL to the original files, etc.), and tells 🤗 Datasets how to generate and display examples from it.

<Tip>

Read the [Share](./share) section to learn more about how to share a dataset. This section also provides a step-by-step guide on how to write your own dataset loading script!
Read the [Share](./upload_dataset) section to learn more about how to share a dataset. This section also provides a step-by-step guide on how to write your own dataset loading script!

</Tip>

The dataset script downloads the dataset files from the original URL, generates the dataset and caches it in an Arrow table on your drive. If you've downloaded the dataset before, then 🤗 Datasets will reload it from the cache to save you the trouble of downloading it again.
🤗 Datasets downloads the dataset files from the original URL, generates the dataset and caches it in an Arrow table on your drive.
If you've downloaded the dataset before, then 🤗 Datasets will reload it from the cache to save you the trouble of downloading it again.

Now that you have a high-level understanding about how datasets are built, let's take a closer look at the nuts and bolts of how all this works.

Expand Down Expand Up @@ -83,21 +96,14 @@ There are three main methods in [`DatasetBuilder`]:

The dataset is generated with a Python generator, which doesn't load all the data in memory. As a result, the generator can handle large datasets. However, before the generated samples are flushed to the dataset file on disk, they are stored in an `ArrowWriter` buffer. This means the generated samples are written by batch. If your dataset samples consumes a lot of memory (images or videos), then make sure to specify a low value for the `DEFAULT_WRITER_BATCH_SIZE` attribute in [`DatasetBuilder`]. We recommend not exceeding a size of 200 MB.

## Without loading scripts

As a user, you want to be able to quickly use a dataset. Implementing a dataset loading script can sometimes get in the way, or it may be a barrier for some people without a developer background. 🤗 Datasets removes this barrier by making it possible to load any dataset from the Hub without a dataset loading script. All a user has to do is upload the data files (see [upload_dataset_repo](#upload_dataset_repo) for a list of supported file formats) to a dataset repository on the Hub, and they will be able to load that dataset without having to create a loading script. This doesn't mean we are moving away from loading scripts because they still offer the most flexibility in controlling how a dataset is generated.

The loading script-free method uses the [huggingface_hub](https://github.com/huggingface/huggingface_hub) library to list the files in a dataset repository. You can also provide a path to a local directory instead of a repository name, in which case 🤗 Datasets will use [glob](https://docs.python.org/3/library/glob) instead. Depending on the format of the data files available, one of the data file builders will create your dataset for you. If you have a CSV file, the CSV builder will be used and if you have a Parquet file, the Parquet builder will be used. The drawback of this approach is it's not possible to simultaneously load a CSV and JSON file. You will need to load the two file types separately, and then concatenate them.

## Maintaining integrity

To ensure a dataset is complete, [`load_dataset`] will perform a series of tests on the downloaded files to make sure everything is there. This way, you don't encounter any surprises when your requested dataset doesn't get generated as expected. [`load_dataset`] verifies:

- The list of downloaded files.
- The number of bytes of the downloaded files.
- The SHA256 checksums of the downloaded files.
- The number of splits in the generated `DatasetDict`.
- The number of samples in each split of the generated `DatasetDict`.
- The list of downloaded files.
- The SHA256 checksums of the downloaded files (disabled by defaut).

If the dataset doesn't pass the verifications, it is likely that the original host of the dataset made some changes in the data files.

Expand Down
2 changes: 2 additions & 0 deletions docs/source/dataset_card.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -25,4 +25,6 @@ Creating a dataset card is easy and can be done in a just a few steps:

4. Once you're done, commit the changes to the `README.md` file and you'll see the completed dataset card on your repository.

YAML also allows you to customize the way your dataset is loaded by [defining splits and/or configurations](./repository_structure#define-your-splits-and-subsets-in-yaml) without the need to write any code.

Feel free to take a look at the [SNLI](https://huggingface.co/datasets/snli), [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail), and [Allociné](https://huggingface.co/datasets/allocine) dataset cards as examples to help you get started.
9 changes: 6 additions & 3 deletions docs/source/dataset_script.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,15 @@

<Tip>

The dataset script is optional if your dataset is in one of the following formats: CSV, JSON, JSON lines, text or Parquet.
With those formats, you should be able to load your dataset automatically with [`~datasets.load_dataset`].
The dataset script is likely not needed if your dataset is in one of the following formats: CSV, JSON, JSON lines, text or Parquet.
With those formats, you should be able to load your dataset automatically with [`~datasets.load_dataset`],
as long as your dataset repository has a [required structure](./repository_structure).

</Tip>

Write a dataset script to load and share your own datasets. It is a Python file that defines the different configurations and splits of your dataset, as well as how to download and process the data.
Write a dataset script to load and share datasets that consist of data files in unsupported formats or require more complex data preparation.
This is a more advanced way to define a dataset than using [YAML metadata in the dataset card](./repository_structure#define-your-splits-in-yaml).
A dataset script is a Python file that defines the different configurations and splits of your dataset, as well as how to download and process the data.

The script can download data files from any website, or from the same dataset repository.

Expand Down
16 changes: 16 additions & 0 deletions docs/source/package_reference/loading_methods.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -53,30 +53,46 @@ load_dataset("csv", data_dir="path/to/data/dir", sep="\t")

[[autodoc]] datasets.packaged_modules.text.TextConfig

[[autodoc]] datasets.packaged_modules.text.Text

### CSV

[[autodoc]] datasets.packaged_modules.csv.CsvConfig

[[autodoc]] datasets.packaged_modules.csv.Csv

### JSON

[[autodoc]] datasets.packaged_modules.json.JsonConfig

[[autodoc]] datasets.packaged_modules.json.Json

### Parquet

[[autodoc]] datasets.packaged_modules.parquet.ParquetConfig

[[autodoc]] datasets.packaged_modules.parquet.Parquet

### Arrow

[[autodoc]] datasets.packaged_modules.arrow.ArrowConfig

[[autodoc]] datasets.packaged_modules.arrow.Arrow

### SQL

[[autodoc]] datasets.packaged_modules.sql.SqlConfig

[[autodoc]] datasets.packaged_modules.sql.Sql

### Images

[[autodoc]] datasets.packaged_modules.imagefolder.ImageFolderConfig

[[autodoc]] datasets.packaged_modules.imagefolder.ImageFolder

### Audio

[[autodoc]] datasets.packaged_modules.audiofolder.AudioFolderConfig

[[autodoc]] datasets.packaged_modules.audiofolder.AudioFolder
Loading
Loading