-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem running ds.to_zarr on a dask cluster #116
Comments
Thanks for bringing this up! I don't have much experience with @rabernat's workflow for authentication with a Dask cluster, but since the main error in the notebook occurs when trying to get tokens using
If that doesn't help, this issue might get more exposure if posted on Pangeo's Discourse forum or Gitter chat. |
Thanks for this!! Your suggestion works - in the same way as the following from a cell further down the notebook works:
but unfortunately this doesn't lead to the cluster doing the .to_zarr step. My thought was that @rabernat's suggestion here was to get a mapper that give the cluster control, but I don't understand gcsfs enough to know how this works (and I have not found the right documentation yet). Thanks again for your help! |
Hi Jonny! 😊 The problem is that the dask workers don't share a filesystem with the notebook server, so this path - The solution is to use the syntax documented on the Pangeo Cloud site. with open('<your_token_file>.json') as token_file:
token = json.load(token_file)
gcs = gcsfs.GCSFileSystem(token=token)
mapper = gcs.get_mapper(...) That way you are loading the token data into memory as a dict and passing it around to the workers directly. Does this fix it? |
Hi Ryan,
|
Can you do anything at all with the cluster? Just compute something random? import dask.array as dsa
arr = dsa.ones((100,), chunks=(10,))
arr.compute() |
Do you see any data show up in your bucket? |
Yes, the zarrs shows up when I avoid the memory issues by only sending a small subset of the data, or by using the largest server on https://us-central1-b.gcp.pangeo.io/
Generally, yes, using this setup I am successfully doing things on the cluster, but I am going to check this with this notebook once the sever is up and running. I'll get back to you to confirm. |
So for some reason, writing this data is not going through the cluster, but other operations are...very weird. Doesn't make sense. How are you creating |
OK, apologies, I was forgetting the difference between a dask array and just an xarray that was loaded lazily. Now I am loading the netcdf as a dask array by choosing chunk size. A couple of new errors come up now. The notebook is here. It's getting a bit messy with non-dask- and dask-versions of the data, but hopefully it makes sense. First, a worker is killed when trying to load into memory a ~10MB dask array from zarr that is created from a non-dask array (i.e. the creation of this zarr does not use the cluster). This error does not occur if you don't start a cluster. Second, when trying to write a ~10MB dask array to zarr it fails by saying it "Failed to Serialize". |
If you bypass dask and have enough memory to handle things locally (in your notebook server), then things work. The problem appears to be serializing the reference to the netcdf file and passing it to the workers: url = 'https://storage.googleapis.com/ldeo-glaciology/bedmachine/BedMachineAntarctica_2019-11-05_v01.nc'
with fsspec.open(url, mode='rb') as openfile:
bm_dask = xr.open_dataset(openfile,chunks=3000) Forget about writing for a moment. Can you do anything with |
This is just what I was trying. No the same error comes up. |
I'm going to spend 15 minutes digging into this. I'll update you. It definitely SHOULD work. |
I really appreciate it! I have meetings for the rest of the day, so I may not be able to respond after 11. |
I'll add that before the Dask Gateway transition, Dask workers on the OOI Pangeo had the same home directory as the notebook server, which is why there may be examples out there of @jkingslake's original workflow. I have not had time to get NFS working with the new Gateway configs but I'm guessing it would be possible if it is something that is needed. |
Shared home space is not the core issue here. That is resolved by my suggestion in #116 (comment). The problem is that the dataset, as currently loaded, is not serializable. This is either an xarray or fsspec bug. |
So here is a workaround, which bypasses fsspec and uses the net netcdf4 library's new and undocumented ability to open files over http. url = 'https://storage.googleapis.com/ldeo-glaciology/bedmachine/BedMachineAntarctica_2019-11-05_v01.nc#mode=bytes'
ds = xr.open_dataset(url, engine='netcdf4', chunks=3000)
ds This works with distributed. I would play around with the chunk size and do some quick benchmarking before committing to 3000. |
Thanks @rabernat - however this does not work for me on OOI Pangeo (nor similar attempts with other public netCDF files on GCS)
Is it a parsing issue? That is certainly the link to the file (tested in browser). |
It's probably a netcdf-python version issue. I'm on netCDF4 1.5.4. |
The current OOI image has 1.5.3. @porterdf you can install 1.5.4 temporarily to test this using |
This also works: import gcsfs
import xarray as xr
gcs = gcsfs.GCSFileSystem()
url = 'gs://ldeo-glaciology/bedmachine/BedMachineAntarctica_2019-11-05_v01.nc'
openfile = gcs.open(url, mode='rb')
ds = xr.open_dataset(openfile, chunks=3000) |
@porterdf, I bumped the OOI image to the same one that @rabernat is using. You can try it out on staging at the moment to test against his suggestions here. If it works for you I will merge it into prod so that it is available for you on prod. It probably has the new cartopy version you were asking about as well. |
This works great on https://us-central1-b.gcp.pangeo.io/ Thanks all! |
Thanks @rabernat and @tjcrone - now working as expected on OOI staging, with current netCDF4 (and cartopy too)! Let me know when changes are merged on to production. Fine to close this from my end. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date. |
@porterdf and I are trying to setup a simple netcdf-to-zarr workflow, similar to the many useful examples in #8
We are successfully loading data from a netcdf and writing it to zarr in a google bucket using this notebook in us-central1-b.gcp.pangeo.io.
But we need to get .to_zarr running in a dask cluster to avoid memory issues we are running in to. It appears from the examples in #8 that starting a cluster should simply lead to the .to_zarr operation running in the cluster, but I must me missing something, as this doesn't seem to be how it behaves.
Any help on what I assume is an embarrassingly simple for this community, would be much appreciated!
The text was updated successfully, but these errors were encountered: