-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datasets created with push_to_hub
can't be accessed in offline mode
#3547
Comments
Thanks for reporting. I think this can be fixed by improving the |
Hi, I'm having the same issue. Is there any update on this? |
We haven't had a chance to fix this yet. If someone would like to give it a try I'd be happy to give some guidance |
@lhoestq Do you have an idea of what changes need to be made to It looks like it might be this line: Line 994 in 0c1d099
Which wouldn't pick up the stuff saved under Lines 995 to 999 in 0c1d099
|
However in your case there's no dataset script on the Hub, only parquet files. So the logic must be extended for this case. In particular I think you can add a new logic in the case where In this case you can check directly in the in the datasets cache for a directory named In your case those two directories correspond to Then you can find the most recent version of the dataset in subdirectories (e.g. sorting using the last modified time of the Finally, we will need return the module that is used to load the dataset from the cache. It is the same module than the one that would have been normally used if you had an internet connection. At that point you can ping me, because we will need to pass all this:
I think in the future we want to change this caching logic completely, since I don't find it super easy to play with. |
Hi! Is there a workaround for the time being? I would like to use this diffuser example on my cluster whose nodes are not connected to the internet. I have downloaded the dataset online form the login node. |
Hi ! Yes you can save your dataset locally with (removing myself from assignees since I'm currently not working on this right now) |
Still not fixed? ...... |
Any idea @lhoestq who to tag to fix this ? This is a very annoying bug, which is becoming more and more present since the push_to_hub API is getting used more ? |
Perhaps @mariosasko ? Thanks a lot for the great work on the lib ! |
It should be easier to implement now that we improved the caching of datasets from The cache structure has been improved in #5331. Now the cache structure is The idea is to extend |
Any progress? |
I started a PR to draft the logic to reload datasets from the cache fi they were created with push_to_hub: #6459 Feel free to try it out |
It seems that this does not support dataset with uppercase name |
Which version of |
I can confirm that this problem is still happening with |
Can you share a code or a dataset that reproduces the issue ? It seems to work fine on my side. |
Yeah, dataset = load_dataset("roneneldan/TinyStories") I tried it with: dataset = load_dataset("roneneldan/tinystories") and it worked.
@fecet was right, but if you just put the name lowercase, it works. |
Describe the bug
In offline mode, one can still access previously-cached datasets. This fails with datasets created with
push_to_hub
.Steps to reproduce the bug
in Python:
in bash:
in Python:
Expected results
datasets
should find the previously-cached dataset.Actual results
ConnectionError: Couln't reach the Hugging Face Hub for dataset 'teven/matched_passages_wikidata': Offline mode is enabled
Environment info
datasets
version: 1.16.2.dev0The text was updated successfully, but these errors were encountered: