Merge branch 'main' into overwrite-legacy-config-dataset-infos-json

polinaeterna · Sep 25, 2023 · 28312c5 · 28312c5
2 parents 744ad74 + a1e1867
commit 28312c5
Show file tree

Hide file tree

Showing 14 changed files with 44 additions and 45 deletions.
diff --git a/docs/README.md b/docs/README.md
@@ -33,14 +33,14 @@ pip install git+https://github.com/huggingface/doc-builder
 **NOTE**
 
 You only need to generate the documentation to inspect it locally (if you're planning changes and want to
-check how they look like before committing for instance). You don't have to commit the built documentation.
+check how they look before committing for instance). You don't have to `git commit` the built documentation.
 
 ---
 
 ## Building the documentation
 
-Once you have setup the `doc-builder` and additional packages, you can generate the documentation by typing th
-following command:
+Once you have setup the `doc-builder` and additional packages, you can generate the documentation by typing
+the following command:
 
 ```bash
 doc-builder build datasets docs/source/ --build_dir ~/tmp/test-build
@@ -67,7 +67,7 @@ the filename without the extension in the [`_toctree.yml`](https://github.com/hu
 
 ## Renaming section headers and moving sections
 
-It helps to keep the old links working when renaming section header and/or moving sections from one document to another. This is because the old links are likely to be used in Issues, Forums and Social media and it'd be make for a much more superior user experience if users reading those months later could still easily navigate to the originally intended information.
+It helps to keep the old links working when renaming the section header and/or moving sections from one document to another. This is because the old links are likely to be used in Issues, Forums and Social media and it'd make for a much more superior user experience if users reading those months later could still easily navigate to the originally intended information.
 
 Therefore we simply keep a little map of moved sections at the end of the document where the original section was. The key is to preserve the original anchor.
 
@@ -105,7 +105,7 @@ Adding a new tutorial or section is done in two steps:
 - Link that file in `./source/_toctree.yml` on the correct toc-tree.
 
 Make sure to put your new file under the proper section. It's unlikely to go in the first section (*Get Started*), so
-depending on the intended targets (beginners, more advanced users or researchers) it should go in section two, three or
+depending on the intended targets (beginners, more advanced users or researchers) it should go into section two, three or
 four.
 
 ### Adding a new model
@@ -151,8 +151,8 @@ not to be displayed in the documentation, you can do so by specifying which meth
     - save_vocabulary
 ```
 
-If you just want to add a method that is not documented (for instance magic method like `__call__` are not documented
-byt default) you can put the list of methods to add in a list that contains `all`:
+If you just want to add a method that is not documented (for instance magic method like `__call__` is not documented
+by default) you can put the list of methods to add in a list that contains `all`:
 
 ```
 ## XXXTokenizer
@@ -190,7 +190,7 @@ description:
 ```
 
 If the description is too long to fit in one line, another indentation is necessary before writing the description
-after th argument.
+after the argument.
 
 Here's an example showcasing everything so far:
 
@@ -223,7 +223,7 @@ then its documentation should look like this:
 ```
 
 Note that we always omit the "defaults to \`None\`" when None is the default for any argument. Also note that even
-if the first line describing your argument type and its default gets long, you can't break it on several lines. You can
+if the first line describing your argument type and its default gets long, you can't break it into several lines. You can
 however write as many lines as you want in the indented description (see the example above with `input_ids`).
 
 #### Writing a multi-line code block
@@ -248,14 +248,14 @@ The return block should be introduced with the `Returns:` prefix, followed by a
 The first line should be the type of the return, followed by a line return. No need to indent further for the elements
 building the return.
 
-Here's an example for a single value return:
+Here's an example of a single value return:
 
 ```
     Returns:
         `List[int]`: A list of integers in the range [0, 1] --- 1 for a special token, 0 for a sequence token.
 ```
 
-Here's an example for tuple return, comprising several objects:
+Here's an example of tuple return, comprising several objects:
 
 ```
     Returns:
@@ -280,6 +280,6 @@ We have an automatic script running with the `make style` comment that will make
 - the docstrings fully take advantage of the line width
 - all code examples are formatted using black, like the code of the Transformers library
 
-This script may have some weird failures if you made a syntax mistake or if you uncover a bug. Therefore, it's
+This script may have some weird failures if you make a syntax mistake or if you uncover a bug. Therefore, it's
 recommended to commit your changes before running `make style`, so you can revert the changes done by that script
 easily.
diff --git a/docs/source/create_dataset.mdx b/docs/source/create_dataset.mdx
@@ -37,15 +37,15 @@ Then this is how the folder-based builder generates an example:
 Create the image dataset by specifying `imagefolder` in [`load_dataset`]:
 
 ```py
->>> from datasets import ImageFolder
+>>> from datasets import load_dataset
 
 >>> dataset = load_dataset("imagefolder", data_dir="/path/to/pokemon")
 ```
 
 An audio dataset is created in the same way, except you specify `audiofolder` in [`load_dataset`] instead:
 
 ```py
->>> from datasets import AudioFolder
+>>> from datasets import load_dataset
 
 >>> dataset = load_dataset("audiofolder", data_dir="/path/to/folder")
 ```
@@ -109,4 +109,4 @@ We didn't mention this in the tutorial, but you can also create a dataset with a
 
 To learn more about how to write loading scripts, take a look at the <a href="https://huggingface.co/docs/datasets/main/en/image_dataset#loading-script"><span class="underline decoration-yellow-400 decoration-2 font-semibold">image loading script</span></a>, <a href="https://huggingface.co/docs/datasets/main/en/audio_dataset"><span class="underline decoration-pink-400 decoration-2 font-semibold">audio loading script</span></a>, and <a href="https://huggingface.co/docs/datasets/main/en/dataset_script"><span class="underline decoration-green-400 decoration-2 font-semibold">text loading script</span></a> guides.
 
-Now that you know how to create a dataset, consider sharing it on the Hub so the community can also benefit from your work! Go on to the next section to learn how to share your dataset.
+Now that you know how to create a dataset, consider sharing it on the Hub so the community can also benefit from your work! Go on to the next section to learn how to share your dataset.
diff --git a/notebooks/README.md b/notebooks/README.md
@@ -19,7 +19,7 @@ limitations under the License.
 You can find here a list of the official notebooks provided by Hugging Face.
 
 Also, we would like to list here interesting content created by the community.
-If you wrote some notebook(s) leveraging 🤗 Datasets and would like be listed here, please open a
+If you wrote some notebook(s) leveraging 🤗 Datasets and would like it to be listed here, please open a
 Pull Request so it can be included under the Community notebooks.
 
 ## Hugging Face's notebooks 🤗

diff --git a/src/datasets/download/download_manager.py b/src/datasets/download/download_manager.py
@@ -228,14 +228,10 @@ def _iter_from_paths(cls, urlpaths: Union[str, List[str]]) -> Generator[str, Non
             urlpaths = [urlpaths]
         for urlpath in urlpaths:
             if os.path.isfile(urlpath):
-                if os.path.basename(urlpath).startswith((".", "__")):
-                    # skipping hidden files
-                    continue
                 yield urlpath
             else:
                 for dirpath, dirnames, filenames in os.walk(urlpath):
-                    # skipping hidden directories; prune the search
-                    # [:] for the in-place list modification required by os.walk
+                    # in-place modification to prune the search
                     dirnames[:] = sorted([dirname for dirname in dirnames if not dirname.startswith((".", "__"))])
                     if os.path.basename(dirpath).startswith((".", "__")):
                         # skipping hidden directories

diff --git a/src/datasets/download/mock_download_manager.py b/src/datasets/download/mock_download_manager.py
@@ -232,8 +232,6 @@ def iter_files(self, paths):
             paths = [paths]
         for path in paths:
             if os.path.isfile(path):
-                if os.path.basename(path).startswith((".", "__")):
-                    return
                 yield path
             else:
                 for dirpath, dirnames, filenames in os.walk(path):

diff --git a/src/datasets/download/streaming_download_manager.py b/src/datasets/download/streaming_download_manager.py
@@ -914,15 +914,10 @@ def _iter_from_urlpaths(
             urlpaths = [urlpaths]
         for urlpath in urlpaths:
             if xisfile(urlpath, download_config=download_config):
-                if xbasename(urlpath).startswith((".", "__")):
-                    # skipping hidden files
-                    continue
                 yield urlpath
             elif xisdir(urlpath, download_config=download_config):
                 for dirpath, dirnames, filenames in xwalk(urlpath, download_config=download_config):
-                    # skipping hidden directories; prune the search
-                    # [:] for the in-place list modification required by os.walk
-                    # (only works for local paths as fsspec's walk doesn't support the in-place modification)
+                    # in-place modification to prune the search
                     dirnames[:] = sorted([dirname for dirname in dirnames if not dirname.startswith((".", "__"))])
                     if xbasename(dirpath).startswith((".", "__")):
                         # skipping hidden directories

diff --git a/src/datasets/features/audio.py b/src/datasets/features/audio.py
@@ -17,9 +17,6 @@
     from .features import FeatureType
 
 
-_ffmpeg_warned, _librosa_warned, _audioread_warned = False, False, False
-
-
 @dataclass
 class Audio:
     """Audio [`Feature`] to extract audio data from an audio file.

diff --git a/src/datasets/fingerprint.py b/src/datasets/fingerprint.py
@@ -466,7 +466,7 @@ def fingerprint_transform(
 
     def _fingerprint(func):
         if not inplace and not all(name in func.__code__.co_varnames for name in fingerprint_names):
-            raise ValueError("function {func} is missing parameters {fingerprint_names} in signature")
+            raise ValueError(f"function {func} is missing parameters {fingerprint_names} in signature")
 
         if randomized_function:  # randomized function have seed and generator parameters
             if "seed" not in func.__code__.co_varnames:

diff --git a/src/datasets/inspect.py b/src/datasets/inspect.py
@@ -358,7 +358,9 @@ def get_dataset_config_names(
         **download_kwargs,
     )
     builder_cls = get_dataset_builder_class(dataset_module, dataset_name=os.path.basename(path))
-    return list(builder_cls.builder_configs.keys()) or [dataset_module.builder_kwargs.get("config_name", "default")]
+    return list(builder_cls.builder_configs.keys()) or [
+        dataset_module.builder_kwargs.get("config_name", builder_cls.DEFAULT_CONFIG_NAME or "default")
+    ]
 
 
 def get_dataset_config_info(

diff --git a/src/datasets/iterable_dataset.py b/src/datasets/iterable_dataset.py
@@ -134,7 +134,7 @@ def _convert_to_arrow(
     iterator = iter(iterable)
     for key, example in iterator:
         iterator_batch = islice(iterator, batch_size - 1)
-        key_examples_list = [(key, example)] + [(key, example) for key, example in iterator_batch]
+        key_examples_list = [(key, example)] + list(iterator_batch)
         if len(key_examples_list) < batch_size and drop_last_batch:
             return
         keys, examples = zip(*key_examples_list)
@@ -697,7 +697,7 @@ def _iter(self):
                     if self.batch_size is None or self.batch_size <= 0
                     else islice(iterator, self.batch_size - 1)
                 )
-                key_examples_list = [(key, example)] + [(key, example) for key, example in iterator_batch]
+                key_examples_list = [(key, example)] + list(iterator_batch)
                 keys, examples = zip(*key_examples_list)
                 if (
                     self.drop_last_batch
@@ -880,7 +880,7 @@ def _iter(self):
                     if self.batch_size is None or self.batch_size <= 0
                     else islice(iterator, self.batch_size - 1)
                 )
-                key_examples_list = [(key, example)] + [(key, example) for key, example in iterator_batch]
+                key_examples_list = [(key, example)] + list(iterator_batch)
                 keys, examples = zip(*key_examples_list)
                 batch = _examples_to_batch(examples)
                 batch = format_dict(batch) if format_dict else batch

diff --git a/src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py b/src/datasets/packaged_modules/folder_based_builder/folder_based_builder.py
@@ -146,7 +146,7 @@ def analyze(files_or_archives, downloaded_files_or_dirs, split):
                 datasets.SplitGenerator(
                     name=split_name,
                     gen_kwargs={
-                        "files": [(file, downloaded_file) for file, downloaded_file in zip(files, downloaded_files)]
+                        "files": list(zip(files, downloaded_files))
                         + [(None, dl_manager.iter_files(downloaded_dir)) for downloaded_dir in downloaded_dirs],
                         "metadata_files": metadata_files,
                         "split_name": split_name,

diff --git a/src/datasets/table.py b/src/datasets/table.py
@@ -2002,7 +2002,7 @@ def array_cast(array: pa.Array, pa_type: pa.DataType, allow_number_to_str=True):
                 pa_type.list_size,
             )
         elif pa.types.is_list(pa_type):
-            offsets_arr = pa.array(range(len(array) + 1), pa.int32())
+            offsets_arr = pa.array(np.arange(len(array) + 1) * array.type.list_size, pa.int32())
             if array.null_count > 0:
                 if config.PYARROW_VERSION.major < 10:
                     warnings.warn(
@@ -2061,6 +2061,7 @@ def cast_array_to_feature(array: pa.Array, feature: "FeatureType", allow_number_
         array = array.storage
     if hasattr(feature, "cast_storage"):
         return feature.cast_storage(array)
+
     elif pa.types.is_struct(array.type):
         # feature must be a dict or Sequence(subfeatures_dict)
         if isinstance(feature, Sequence) and isinstance(feature.feature, dict):
@@ -2126,7 +2127,7 @@ def cast_array_to_feature(array: pa.Array, feature: "FeatureType", allow_number_
                 if feature.length * len(array) == len(array_values):
                     return pa.FixedSizeListArray.from_arrays(_c(array_values, feature.feature), feature.length)
             else:
-                offsets_arr = pa.array(range(len(array) + 1), pa.int32())
+                offsets_arr = pa.array(np.arange(len(array) + 1) * array.type.list_size, pa.int32())
                 if array.null_count > 0:
                     if config.PYARROW_VERSION.major < 10:
                         warnings.warn(
@@ -2233,7 +2234,7 @@ def embed_array_storage(array: pa.Array, feature: "FeatureType"):
                 if feature.length * len(array) == len(array_values):
                     return pa.FixedSizeListArray.from_arrays(_e(array_values, feature.feature), feature.length)
             else:
-                offsets_arr = pa.array(range(len(array) + 1), pa.int32())
+                offsets_arr = pa.array(np.arange(len(array) + 1) * array.type.list_size, pa.int32())
                 if array.null_count > 0:
                     if config.PYARROW_VERSION.major < 10:
                         warnings.warn(

diff --git a/tests/test_iterable_dataset.py b/tests/test_iterable_dataset.py
@@ -131,7 +131,7 @@ def test_convert_to_arrow(batch_size, drop_last_batch):
     num_batches = (num_rows // batch_size) + 1 if num_rows % batch_size else num_rows // batch_size
     subtables = list(
         _convert_to_arrow(
-            [(i, example) for i, example in enumerate(examples)],
+            list(enumerate(examples)),
             batch_size=batch_size,
             drop_last_batch=drop_last_batch,
         )
@@ -162,9 +162,7 @@ def test_batch_arrow_tables(tables, batch_size, drop_last_batch):
     num_rows = len(full_table) if not drop_last_batch else len(full_table) // batch_size * batch_size
     num_batches = (num_rows // batch_size) + 1 if num_rows % batch_size else num_rows // batch_size
     subtables = list(
-        _batch_arrow_tables(
-            [(i, table) for i, table in enumerate(tables)], batch_size=batch_size, drop_last_batch=drop_last_batch
-        )
+        _batch_arrow_tables(list(enumerate(tables)), batch_size=batch_size, drop_last_batch=drop_last_batch)
     )
     assert len(subtables) == num_batches
     if drop_last_batch:

diff --git a/tests/test_table.py b/tests/test_table.py
@@ -1189,6 +1189,18 @@ def test_cast_array_to_features_sequence_classlabel():
         assert cast_array_to_feature(arr, Sequence(ClassLabel(names=["foo", "bar"])))
 
 
+def test_cast_fixed_size_array_to_features_sequence():
+    arr = pa.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]], pa.list_(pa.int32(), 3))
+    # Fixed size list
+    casted_array = cast_array_to_feature(arr, Sequence(Value("int64"), length=3))
+    assert casted_array.type == pa.list_(pa.int64(), 3)
+    assert casted_array.to_pylist() == arr.to_pylist()
+    # Variable size list
+    casted_array = cast_array_to_feature(arr, Sequence(Value("int64")))
+    assert casted_array.type == pa.list_(pa.int64())
+    assert casted_array.to_pylist() == arr.to_pylist()
+
+
 def test_cast_sliced_fixed_size_array_to_features():
     arr = pa.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]], pa.list_(pa.int32(), 3))
     casted_array = cast_array_to_feature(arr[1:], Sequence(Value("int64"), length=3))