Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing ZARR files on S3 #122

Open
JarrodBWong opened this issue Feb 18, 2021 · 8 comments
Open

Missing ZARR files on S3 #122

JarrodBWong opened this issue Feb 18, 2021 · 8 comments

Comments

@JarrodBWong
Copy link

Hello, we have been trying to access CMIP6 data based on the file locations listed in cmip6-pds/cmip6.csv and cmip6-pds/pangeo-cmip6.csv but have been running into issues recently with some of the directories being empty.

FileNotFoundError: cmip6-pds/CMIP/CCCma/CanESM5/historical/r1i1p1f1/Omon/thetao/gn/.zmetadata

Last year we were following the same access pattern successfully, but either the file paths on S3 or the CSV seems to have changed.

Is this directory currently being restructured or should we be using another one of the CSV files listed?

@rabernat
Copy link
Member

Hi @LewisJarrod, thanks for bringing this to our attention. You can find up-to-date documentation for the CMIP6 data here: https://pangeo-data.github.io/pangeo-cmip6-cloud/

Some files have recently been renamed, but if you are going through the csv catalog, theoretically everything should be there.

@naomi-henderson may have some guidance.

@clittleaer
Copy link

hi @rabernat,

just to clarify, we were actually using the file named: cmip6-zarr-consolidated-stores.csv

when we query the .csv/dataframe using either this .csv or the pangeo-cmip6.csv, it gives zstore paths that reference directories that aren't S3.

this started happening maybe 3 weeks ago.

thanks in advance,
Chris

@naomi-henderson
Copy link
Contributor

@clittleaer, Sorry about this - the AWS collection is a clone of the GC collection. I did a massive re-organization of the naming scheme for the zarr stores on GC. Unfortunately for the rest of us, @charlesbluca, who set up the process to clone from GC to AWS has left us for a position at NVIDIA. Since ALL of the zarr stores need to be deleted and recopied, the cloning process to AWS is taking quite a while.

In addition, the CSV files need to be updated with the gs urls changed to s3 urls.

@charlesbluca and/or I will try to give an update of when to expect these changes to properly migrate to AWS.

@rabernat
Copy link
Member

Thanks for the update Naomi!

@naomi-henderson
Copy link
Contributor

@clittleaer, the Github Actions cloning scripts needed to be updated. Thanks again for opening this issue, there was indeed a problem. All is working now, but it may take a day or two to finish the whole process. I will try to remember to make a note here when everything is back to normal

@clittleaer
Copy link

that would be great @naomi-henderson p.s. thanks @rabernat and others for making this resource available!

@naomi-henderson
Copy link
Contributor

We now generate the S3 CMIP6 catalog directly from crawling the S3 collection. This means that, even though the GC and S3 collections might be temporarily out of sync, the catalog for each only list existing ZARR files. So, @clittleaer, even though the new restructured data has not been completely copied, the S3 catalog should now represent the current state. Please open an issue here if there are any more problems

@naomi-henderson
Copy link
Contributor

Restructuring on S3 is now complete, @clittleaer, and the S3 and GCS buckets and their catalogs should have the same datasets. If you find any discrepancies and/or suggestions, please open an issue here: pangeo-cmip6-cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants