Datasets created with `push_to_hub` can't be accessed in offline mode #3547

TevenLeScao · 2022-01-07T15:12:25Z

Describe the bug

In offline mode, one can still access previously-cached datasets. This fails with datasets created with push_to_hub.

Steps to reproduce the bug

in Python:

import datasets
mpwiki = datasets.load_dataset("teven/matched_passages_wikidata")

in bash:

export HF_DATASETS_OFFLINE=1

in Python:

import datasets
mpwiki = datasets.load_dataset("teven/matched_passages_wikidata")

Expected results

datasets should find the previously-cached dataset.

Actual results

ConnectionError: Couln't reach the Hugging Face Hub for dataset 'teven/matched_passages_wikidata': Offline mode is enabled

Environment info

datasets version: 1.16.2.dev0
Platform: Linux-4.18.0-193.70.1.el8_2.x86_64-x86_64-with-glibc2.17
Python version: 3.8.10
PyArrow version: 3.0.0

The text was updated successfully, but these errors were encountered:

lhoestq · 2022-01-10T10:44:44Z

Thanks for reporting. I think this can be fixed by improving the CachedDatasetModuleFactory and making it look into the parquet cache directory (datasets from push_to_hub are loaded with the parquet dataset builder). I'll look into it

dangne · 2022-08-17T08:30:15Z

Hi, I'm having the same issue. Is there any update on this?

lhoestq · 2022-08-19T10:43:17Z

We haven't had a chance to fix this yet. If someone would like to give it a try I'd be happy to give some guidance

JohnGiorgi · 2022-08-19T16:08:16Z

@lhoestq Do you have an idea of what changes need to be made to CachedDatasetModuleFactory? I would be willing to take a crack at it. Currently unable to train with datasets I have push_to_hub on a cluster whose compute nodes are not connected to the internet.

It looks like it might be this line:

datasets/src/datasets/load.py

Line 994 in 0c1d099

    
           importable_directory_path = os.path.join(dynamic_modules_path, "datasets", self.name.replace("/", "--"))

Which wouldn't pick up the stuff saved under "datasets/allenai___parquet/*". Additionally, the datasets saved under "datasets/allenai___parquet/*" appear to have hashes in their name, e.g. "datasets/allenai___parquet/my_dataset-def9ee5552a1043e". This would not be detected by CachedDatasetModuleFactory, which currently looks for subdirectories

datasets/src/datasets/load.py

Lines 995 to 999 in 0c1d099

    
           hashes = ( 
        
               [h for h in os.listdir(importable_directory_path) if len(h) == 64] 
        
               if os.path.isdir(importable_directory_path) 
        
               else None 
        
           )

lhoestq · 2022-08-24T16:39:45Z

importable_directory_path is used to find a dataset script that was previously downloaded and cached from the Hub

However in your case there's no dataset script on the Hub, only parquet files. So the logic must be extended for this case.

In particular I think you can add a new logic in the case where hashes is None (i.e. if there's no dataset script associated to the dataset in the cache).

In this case you can check directly in the in the datasets cache for a directory named <namespace>__parquet and a subdirectory named <config_id>. The config_id must match {self.name.replace("/", "--")}-*.

In your case those two directories correspond to allenai___parquet and then allenai--my_dataset-def9ee5552a1043e

Then you can find the most recent version of the dataset in subdirectories (e.g. sorting using the last modified time of the dataset_info.json file).

Finally, we will need return the module that is used to load the dataset from the cache. It is the same module than the one that would have been normally used if you had an internet connection.

At that point you can ping me, because we will need to pass all this:

module_path = _PACKAGED_DATASETS_MODULES["parquet"][0]
hash it corresponds the name of the directory that contains the .arrow file, inside <namespace>__parquet/<config_id>
builder_kwargs = {"hash": hash, "repo_id": self.name, "config_id": config_id}
and currently config_id is not a valid argument for a DatasetBuilder

I think in the future we want to change this caching logic completely, since I don't find it super easy to play with.

zaccharieramzi · 2022-09-20T14:47:58Z

Hi! Is there a workaround for the time being?
Like passing data_dir or something like that?

I would like to use this diffuser example on my cluster whose nodes are not connected to the internet. I have downloaded the dataset online form the login node.

lhoestq · 2022-09-20T15:10:33Z

Hi ! Yes you can save your dataset locally with my_dataset.save_to_disk("path/to/local") and reload it later with load_from_disk("path/to/local")

(removing myself from assignees since I'm currently not working on this right now)

Due to huggingface/datasets#3547.

thuzhf · 2023-05-10T13:09:57Z

Still not fixed? ......

ManuelFay · 2023-10-26T13:13:19Z

Any idea @lhoestq who to tag to fix this ? This is a very annoying bug, which is becoming more and more present since the push_to_hub API is getting used more ?

ManuelFay · 2023-10-26T13:16:02Z

Perhaps @mariosasko ? Thanks a lot for the great work on the lib !

lhoestq · 2023-10-26T13:44:58Z

It should be easier to implement now that we improved the caching of datasets from push_to_hub: each dataset has its own directory in the cache.

The cache structure has been improved in #5331. Now the cache structure is "{namespace__}<dataset_name>/<config_name>/<version>/<hash>/" which contains the arrow files "<dataset_name>-<split>.arrow" and "dataset_info.json".

The idea is to extend CachedDatasetModuleFactory to also check if this directory exists in the cache (in addition to the already existing cache check) and return the requested dataset module. The module name can be found in the JSON file in the builder_name field.

Nugine · 2023-11-29T15:14:52Z

Any progress?

lhoestq · 2023-11-29T16:57:08Z

I started a PR to draft the logic to reload datasets from the cache fi they were created with push_to_hub: #6459

Feel free to try it out

fecet · 2024-01-28T04:15:31Z

It seems that this does not support dataset with uppercase name

lhoestq · 2024-01-29T16:36:15Z

Which version of datasets are you using ? This issue has been fixed with datasets 2.16

jaded0 · 2024-02-15T17:32:55Z

I can confirm that this problem is still happening with datasets 2.17.0, installed from pip

lhoestq · 2024-02-15T17:35:56Z

Can you share a code or a dataset that reproduces the issue ? It seems to work fine on my side.

jaded0 · 2024-02-15T17:41:23Z

Yeah,

dataset = load_dataset("roneneldan/TinyStories")

I tried it with:

dataset = load_dataset("roneneldan/tinystories")

and it worked.

It seems that this does not support dataset with uppercase name

@fecet was right, but if you just put the name lowercase, it works.

TevenLeScao added the bug Something isn't working label Jan 7, 2022

TevenLeScao assigned lhoestq Jan 7, 2022

lhoestq mentioned this issue Jul 28, 2022

Issue with offline mode #4760

Closed

lhoestq removed their assignment Sep 20, 2022

janEbert added a commit to OpenGPTX/lm-evaluation-harness that referenced this issue Mar 10, 2023

Add different caching mechanism

d4751a0

Due to huggingface/datasets#3547.

janEbert added a commit to OpenGPTX/lm-evaluation-harness that referenced this issue Mar 10, 2023

Add different caching mechanism

19eadaa

Due to huggingface/datasets#3547.

janEbert mentioned this issue Mar 10, 2023

Add another caching mechanism OpenGPTX/lm-evaluation-harness#74

Merged

janEbert added a commit to OpenGPTX/lm-evaluation-harness that referenced this issue Apr 3, 2023

Add different caching mechanism

b8ae478

Due to huggingface/datasets#3547.

janEbert added a commit to OpenGPTX/lm-evaluation-harness that referenced this issue Apr 3, 2023

Add different caching mechanism

afaecb7

Due to huggingface/datasets#3547.

lhoestq mentioned this issue Nov 29, 2023

Retrieve cached datasets that were pushed to hub when offline #6459

Closed

2 tasks

lhoestq mentioned this issue Dec 13, 2023

Lazy data files resolution and offline cache reload #6493

Merged

lhoestq closed this as completed in #6493 Dec 21, 2023

mattmalcher mentioned this issue Jan 26, 2024

ConnectionError casper-hansen/AutoAWQ#227

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets created with `push_to_hub` can't be accessed in offline mode #3547

Datasets created with `push_to_hub` can't be accessed in offline mode #3547

TevenLeScao commented Jan 7, 2022

lhoestq commented Jan 10, 2022

dangne commented Aug 17, 2022

lhoestq commented Aug 19, 2022

JohnGiorgi commented Aug 19, 2022 •

edited

Loading

lhoestq commented Aug 24, 2022

zaccharieramzi commented Sep 20, 2022

lhoestq commented Sep 20, 2022 •

edited

Loading

thuzhf commented May 10, 2023

ManuelFay commented Oct 26, 2023

ManuelFay commented Oct 26, 2023

lhoestq commented Oct 26, 2023

Nugine commented Nov 29, 2023

lhoestq commented Nov 29, 2023 •

edited

Loading

fecet commented Jan 28, 2024

lhoestq commented Jan 29, 2024

jaded0 commented Feb 15, 2024

lhoestq commented Feb 15, 2024

jaded0 commented Feb 15, 2024

Datasets created with push_to_hub can't be accessed in offline mode #3547

Datasets created with push_to_hub can't be accessed in offline mode #3547

Comments

TevenLeScao commented Jan 7, 2022

Describe the bug

Steps to reproduce the bug

Expected results

Actual results

Environment info

lhoestq commented Jan 10, 2022

dangne commented Aug 17, 2022

lhoestq commented Aug 19, 2022

JohnGiorgi commented Aug 19, 2022 • edited Loading

lhoestq commented Aug 24, 2022

zaccharieramzi commented Sep 20, 2022

lhoestq commented Sep 20, 2022 • edited Loading

thuzhf commented May 10, 2023

ManuelFay commented Oct 26, 2023

ManuelFay commented Oct 26, 2023

lhoestq commented Oct 26, 2023

Nugine commented Nov 29, 2023

lhoestq commented Nov 29, 2023 • edited Loading

fecet commented Jan 28, 2024

lhoestq commented Jan 29, 2024

jaded0 commented Feb 15, 2024

lhoestq commented Feb 15, 2024

jaded0 commented Feb 15, 2024

Datasets created with `push_to_hub` can't be accessed in offline mode #3547

Datasets created with `push_to_hub` can't be accessed in offline mode #3547

JohnGiorgi commented Aug 19, 2022 •

edited

Loading

lhoestq commented Sep 20, 2022 •

edited

Loading

lhoestq commented Nov 29, 2023 •

edited

Loading