Skip to content

Commit

Permalink
Merge branch 'main' into overwrite-legacy-config-dataset-infos-json
Browse files Browse the repository at this point in the history
  • Loading branch information
polinaeterna authored Sep 25, 2023
2 parents 744ad74 + a1e1867 commit 28312c5
Show file tree
Hide file tree
Showing 14 changed files with 44 additions and 45 deletions.
24 changes: 12 additions & 12 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,14 +33,14 @@ pip install git+https://github.com/huggingface/doc-builder
**NOTE**

You only need to generate the documentation to inspect it locally (if you're planning changes and want to
check how they look like before committing for instance). You don't have to commit the built documentation.
check how they look before committing for instance). You don't have to `git commit` the built documentation.

---

## Building the documentation

Once you have setup the `doc-builder` and additional packages, you can generate the documentation by typing th
following command:
Once you have setup the `doc-builder` and additional packages, you can generate the documentation by typing
the following command:

```bash
doc-builder build datasets docs/source/ --build_dir ~/tmp/test-build
Expand All @@ -67,7 +67,7 @@ the filename without the extension in the [`_toctree.yml`](https://github.com/hu

## Renaming section headers and moving sections

It helps to keep the old links working when renaming section header and/or moving sections from one document to another. This is because the old links are likely to be used in Issues, Forums and Social media and it'd be make for a much more superior user experience if users reading those months later could still easily navigate to the originally intended information.
It helps to keep the old links working when renaming the section header and/or moving sections from one document to another. This is because the old links are likely to be used in Issues, Forums and Social media and it'd make for a much more superior user experience if users reading those months later could still easily navigate to the originally intended information.

Therefore we simply keep a little map of moved sections at the end of the document where the original section was. The key is to preserve the original anchor.

Expand Down Expand Up @@ -105,7 +105,7 @@ Adding a new tutorial or section is done in two steps:
- Link that file in `./source/_toctree.yml` on the correct toc-tree.

Make sure to put your new file under the proper section. It's unlikely to go in the first section (*Get Started*), so
depending on the intended targets (beginners, more advanced users or researchers) it should go in section two, three or
depending on the intended targets (beginners, more advanced users or researchers) it should go into section two, three or
four.

### Adding a new model
Expand Down Expand Up @@ -151,8 +151,8 @@ not to be displayed in the documentation, you can do so by specifying which meth
- save_vocabulary
```

If you just want to add a method that is not documented (for instance magic method like `__call__` are not documented
byt default) you can put the list of methods to add in a list that contains `all`:
If you just want to add a method that is not documented (for instance magic method like `__call__` is not documented
by default) you can put the list of methods to add in a list that contains `all`:

```
## XXXTokenizer
Expand Down Expand Up @@ -190,7 +190,7 @@ description:
```

If the description is too long to fit in one line, another indentation is necessary before writing the description
after th argument.
after the argument.

Here's an example showcasing everything so far:

Expand Down Expand Up @@ -223,7 +223,7 @@ then its documentation should look like this:
```

Note that we always omit the "defaults to \`None\`" when None is the default for any argument. Also note that even
if the first line describing your argument type and its default gets long, you can't break it on several lines. You can
if the first line describing your argument type and its default gets long, you can't break it into several lines. You can
however write as many lines as you want in the indented description (see the example above with `input_ids`).

#### Writing a multi-line code block
Expand All @@ -248,14 +248,14 @@ The return block should be introduced with the `Returns:` prefix, followed by a
The first line should be the type of the return, followed by a line return. No need to indent further for the elements
building the return.

Here's an example for a single value return:
Here's an example of a single value return:

```
Returns:
`List[int]`: A list of integers in the range [0, 1] --- 1 for a special token, 0 for a sequence token.
```

Here's an example for tuple return, comprising several objects:
Here's an example of tuple return, comprising several objects:

```
Returns:
Expand All @@ -280,6 +280,6 @@ We have an automatic script running with the `make style` comment that will make
- the docstrings fully take advantage of the line width
- all code examples are formatted using black, like the code of the Transformers library

This script may have some weird failures if you made a syntax mistake or if you uncover a bug. Therefore, it's
This script may have some weird failures if you make a syntax mistake or if you uncover a bug. Therefore, it's
recommended to commit your changes before running `make style`, so you can revert the changes done by that script
easily.
6 changes: 3 additions & 3 deletions docs/source/create_dataset.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -37,15 +37,15 @@ Then this is how the folder-based builder generates an example:
Create the image dataset by specifying `imagefolder` in [`load_dataset`]:

```py
>>> from datasets import ImageFolder
>>> from datasets import load_dataset

>>> dataset = load_dataset("imagefolder", data_dir="/path/to/pokemon")
```

An audio dataset is created in the same way, except you specify `audiofolder` in [`load_dataset`] instead:

```py
>>> from datasets import AudioFolder
>>> from datasets import load_dataset

>>> dataset = load_dataset("audiofolder", data_dir="/path/to/folder")
```
Expand Down Expand Up @@ -109,4 +109,4 @@ We didn't mention this in the tutorial, but you can also create a dataset with a
To learn more about how to write loading scripts, take a look at the <a href="https://huggingface.co/docs/datasets/main/en/image_dataset#loading-script"><span class="underline decoration-yellow-400 decoration-2 font-semibold">image loading script</span></a>, <a href="https://huggingface.co/docs/datasets/main/en/audio_dataset"><span class="underline decoration-pink-400 decoration-2 font-semibold">audio loading script</span></a>, and <a href="https://huggingface.co/docs/datasets/main/en/dataset_script"><span class="underline decoration-green-400 decoration-2 font-semibold">text loading script</span></a> guides.
Now that you know how to create a dataset, consider sharing it on the Hub so the community can also benefit from your work! Go on to the next section to learn how to share your dataset.
Now that you know how to create a dataset, consider sharing it on the Hub so the community can also benefit from your work! Go on to the next section to learn how to share your dataset.
2 changes: 1 addition & 1 deletion notebooks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ limitations under the License.
You can find here a list of the official notebooks provided by Hugging Face.

Also, we would like to list here interesting content created by the community.
If you wrote some notebook(s) leveraging 🤗 Datasets and would like be listed here, please open a
If you wrote some notebook(s) leveraging 🤗 Datasets and would like it to be listed here, please open a
Pull Request so it can be included under the Community notebooks.

## Hugging Face's notebooks 🤗
Expand Down
6 changes: 1 addition & 5 deletions src/datasets/download/download_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -228,14 +228,10 @@ def _iter_from_paths(cls, urlpaths: Union[str, List[str]]) -> Generator[str, Non
urlpaths = [urlpaths]
for urlpath in urlpaths:
if os.path.isfile(urlpath):
if os.path.basename(urlpath).startswith((".", "__")):
# skipping hidden files
continue
yield urlpath
else:
for dirpath, dirnames, filenames in os.walk(urlpath):
# skipping hidden directories; prune the search
# [:] for the in-place list modification required by os.walk
# in-place modification to prune the search
dirnames[:] = sorted([dirname for dirname in dirnames if not dirname.startswith((".", "__"))])
if os.path.basename(dirpath).startswith((".", "__")):
# skipping hidden directories
Expand Down
2 changes: 0 additions & 2 deletions src/datasets/download/mock_download_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -232,8 +232,6 @@ def iter_files(self, paths):
paths = [paths]
for path in paths:
if os.path.isfile(path):
if os.path.basename(path).startswith((".", "__")):
return
yield path
else:
for dirpath, dirnames, filenames in os.walk(path):
Expand Down
7 changes: 1 addition & 6 deletions src/datasets/download/streaming_download_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -914,15 +914,10 @@ def _iter_from_urlpaths(
urlpaths = [urlpaths]
for urlpath in urlpaths:
if xisfile(urlpath, download_config=download_config):
if xbasename(urlpath).startswith((".", "__")):
# skipping hidden files
continue
yield urlpath
elif xisdir(urlpath, download_config=download_config):
for dirpath, dirnames, filenames in xwalk(urlpath, download_config=download_config):
# skipping hidden directories; prune the search
# [:] for the in-place list modification required by os.walk
# (only works for local paths as fsspec's walk doesn't support the in-place modification)
# in-place modification to prune the search
dirnames[:] = sorted([dirname for dirname in dirnames if not dirname.startswith((".", "__"))])
if xbasename(dirpath).startswith((".", "__")):
# skipping hidden directories
Expand Down
3 changes: 0 additions & 3 deletions src/datasets/features/audio.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,6 @@
from .features import FeatureType


_ffmpeg_warned, _librosa_warned, _audioread_warned = False, False, False


@dataclass
class Audio:
"""Audio [`Feature`] to extract audio data from an audio file.
Expand Down
2 changes: 1 addition & 1 deletion src/datasets/fingerprint.py
Original file line number Diff line number Diff line change
Expand Up @@ -466,7 +466,7 @@ def fingerprint_transform(

def _fingerprint(func):
if not inplace and not all(name in func.__code__.co_varnames for name in fingerprint_names):
raise ValueError("function {func} is missing parameters {fingerprint_names} in signature")
raise ValueError(f"function {func} is missing parameters {fingerprint_names} in signature")

if randomized_function: # randomized function have seed and generator parameters
if "seed" not in func.__code__.co_varnames:
Expand Down
4 changes: 3 additions & 1 deletion src/datasets/inspect.py
Original file line number Diff line number Diff line change
Expand Up @@ -358,7 +358,9 @@ def get_dataset_config_names(
**download_kwargs,
)
builder_cls = get_dataset_builder_class(dataset_module, dataset_name=os.path.basename(path))
return list(builder_cls.builder_configs.keys()) or [dataset_module.builder_kwargs.get("config_name", "default")]
return list(builder_cls.builder_configs.keys()) or [
dataset_module.builder_kwargs.get("config_name", builder_cls.DEFAULT_CONFIG_NAME or "default")
]


def get_dataset_config_info(
Expand Down
6 changes: 3 additions & 3 deletions src/datasets/iterable_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@ def _convert_to_arrow(
iterator = iter(iterable)
for key, example in iterator:
iterator_batch = islice(iterator, batch_size - 1)
key_examples_list = [(key, example)] + [(key, example) for key, example in iterator_batch]
key_examples_list = [(key, example)] + list(iterator_batch)
if len(key_examples_list) < batch_size and drop_last_batch:
return
keys, examples = zip(*key_examples_list)
Expand Down Expand Up @@ -697,7 +697,7 @@ def _iter(self):
if self.batch_size is None or self.batch_size <= 0
else islice(iterator, self.batch_size - 1)
)
key_examples_list = [(key, example)] + [(key, example) for key, example in iterator_batch]
key_examples_list = [(key, example)] + list(iterator_batch)
keys, examples = zip(*key_examples_list)
if (
self.drop_last_batch
Expand Down Expand Up @@ -880,7 +880,7 @@ def _iter(self):
if self.batch_size is None or self.batch_size <= 0
else islice(iterator, self.batch_size - 1)
)
key_examples_list = [(key, example)] + [(key, example) for key, example in iterator_batch]
key_examples_list = [(key, example)] + list(iterator_batch)
keys, examples = zip(*key_examples_list)
batch = _examples_to_batch(examples)
batch = format_dict(batch) if format_dict else batch
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ def analyze(files_or_archives, downloaded_files_or_dirs, split):
datasets.SplitGenerator(
name=split_name,
gen_kwargs={
"files": [(file, downloaded_file) for file, downloaded_file in zip(files, downloaded_files)]
"files": list(zip(files, downloaded_files))
+ [(None, dl_manager.iter_files(downloaded_dir)) for downloaded_dir in downloaded_dirs],
"metadata_files": metadata_files,
"split_name": split_name,
Expand Down
7 changes: 4 additions & 3 deletions src/datasets/table.py
Original file line number Diff line number Diff line change
Expand Up @@ -2002,7 +2002,7 @@ def array_cast(array: pa.Array, pa_type: pa.DataType, allow_number_to_str=True):
pa_type.list_size,
)
elif pa.types.is_list(pa_type):
offsets_arr = pa.array(range(len(array) + 1), pa.int32())
offsets_arr = pa.array(np.arange(len(array) + 1) * array.type.list_size, pa.int32())
if array.null_count > 0:
if config.PYARROW_VERSION.major < 10:
warnings.warn(
Expand Down Expand Up @@ -2061,6 +2061,7 @@ def cast_array_to_feature(array: pa.Array, feature: "FeatureType", allow_number_
array = array.storage
if hasattr(feature, "cast_storage"):
return feature.cast_storage(array)

elif pa.types.is_struct(array.type):
# feature must be a dict or Sequence(subfeatures_dict)
if isinstance(feature, Sequence) and isinstance(feature.feature, dict):
Expand Down Expand Up @@ -2126,7 +2127,7 @@ def cast_array_to_feature(array: pa.Array, feature: "FeatureType", allow_number_
if feature.length * len(array) == len(array_values):
return pa.FixedSizeListArray.from_arrays(_c(array_values, feature.feature), feature.length)
else:
offsets_arr = pa.array(range(len(array) + 1), pa.int32())
offsets_arr = pa.array(np.arange(len(array) + 1) * array.type.list_size, pa.int32())
if array.null_count > 0:
if config.PYARROW_VERSION.major < 10:
warnings.warn(
Expand Down Expand Up @@ -2233,7 +2234,7 @@ def embed_array_storage(array: pa.Array, feature: "FeatureType"):
if feature.length * len(array) == len(array_values):
return pa.FixedSizeListArray.from_arrays(_e(array_values, feature.feature), feature.length)
else:
offsets_arr = pa.array(range(len(array) + 1), pa.int32())
offsets_arr = pa.array(np.arange(len(array) + 1) * array.type.list_size, pa.int32())
if array.null_count > 0:
if config.PYARROW_VERSION.major < 10:
warnings.warn(
Expand Down
6 changes: 2 additions & 4 deletions tests/test_iterable_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ def test_convert_to_arrow(batch_size, drop_last_batch):
num_batches = (num_rows // batch_size) + 1 if num_rows % batch_size else num_rows // batch_size
subtables = list(
_convert_to_arrow(
[(i, example) for i, example in enumerate(examples)],
list(enumerate(examples)),
batch_size=batch_size,
drop_last_batch=drop_last_batch,
)
Expand Down Expand Up @@ -162,9 +162,7 @@ def test_batch_arrow_tables(tables, batch_size, drop_last_batch):
num_rows = len(full_table) if not drop_last_batch else len(full_table) // batch_size * batch_size
num_batches = (num_rows // batch_size) + 1 if num_rows % batch_size else num_rows // batch_size
subtables = list(
_batch_arrow_tables(
[(i, table) for i, table in enumerate(tables)], batch_size=batch_size, drop_last_batch=drop_last_batch
)
_batch_arrow_tables(list(enumerate(tables)), batch_size=batch_size, drop_last_batch=drop_last_batch)
)
assert len(subtables) == num_batches
if drop_last_batch:
Expand Down
12 changes: 12 additions & 0 deletions tests/test_table.py
Original file line number Diff line number Diff line change
Expand Up @@ -1189,6 +1189,18 @@ def test_cast_array_to_features_sequence_classlabel():
assert cast_array_to_feature(arr, Sequence(ClassLabel(names=["foo", "bar"])))


def test_cast_fixed_size_array_to_features_sequence():
arr = pa.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]], pa.list_(pa.int32(), 3))
# Fixed size list
casted_array = cast_array_to_feature(arr, Sequence(Value("int64"), length=3))
assert casted_array.type == pa.list_(pa.int64(), 3)
assert casted_array.to_pylist() == arr.to_pylist()
# Variable size list
casted_array = cast_array_to_feature(arr, Sequence(Value("int64")))
assert casted_array.type == pa.list_(pa.int64())
assert casted_array.to_pylist() == arr.to_pylist()


def test_cast_sliced_fixed_size_array_to_features():
arr = pa.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]], pa.list_(pa.int32(), 3))
casted_array = cast_array_to_feature(arr[1:], Sequence(Value("int64"), length=3))
Expand Down

0 comments on commit 28312c5

Please sign in to comment.