Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a MultiZarr json file from netcdf files of unequal time length. #447

Closed
oloapinivad opened this issue Apr 9, 2024 · 2 comments
Closed

Comments

@oloapinivad
Copy link

Hi there,

I am very new to kerchunk but I am trying to create a json file using zarr starting from a series of netcdf files, which might have unequal time (they can be 1 or 12). I do not want to replicate the data with zarr, for both storage and backward compatibility reasons.

Here below a very basic example, I can attach the data but I think it is clear what is done here.

filelist = ['file_201901.nc',  'file_2018.nc']

singles = [SingleHdf5ToZarr(filepath, inline_threshold=0).translate() for filepath in sorted(filelist)]

mzz = MultiZarrToZarr(
    singles,
    concat_dims=["time"],
    identical_dims = ['lat', 'lon']
)

mzz.translate()

This fails with:

ValueError: Found chunk size mismatch:
                        at prefix 2t in iteration 1 (file None)
                        new chunk: [12, 180, 360]
                        chunks so far: [1, 180, 360]

Browsing various issues in the repository (as this #430 (comment)) , it seems that this is due to a discussed limitation of Zarr that does not allow for unequal chunk sizes, which goes beyond kerchunk.

However, I am wondering if there is a way to force the chunk when accessing the data so that for example if I set chunks={"time": 1}, as for example done with xarray, I should be able to still load the data.

Thanks a lot for any hint!

@oloapinivad oloapinivad changed the title How to create json file with c Create a MultiZarr json file from netcdf files of unequal time length. Apr 9, 2024
@martindurant
Copy link
Member

No, you unfortunately cannot "subchunk" the data that have chunk sizes > 1. The sole exception is completely uncompressed/encoded data, which I assume is not your situation.

Explanation:
suppose you have a chunk of data in your original file of size 2 in time (+ some other dimensions). If we were to try to present this as chunk size 1 to zarr, when accessing time=0, it would need to load the zeroth chunk, decompress it, and slice it. When loading time=1, it would have to load and decompress the very same slice.
This "load-and-slice" logic does not exist, and clearly would be very inefficient. It would further be complicated in the case where chunks cross boundaries (original size 7, desired size 2). So we keep to logical 1-1 mapping of chunks, and remain therefore limited by zarr's model.

@oloapinivad
Copy link
Author

Ok, thanks a lot, very much clear.

Therefore I will proceed creating two different json files, one for the 12-step chunk and one for 1-step chunk, and then merging afterwards when opening them. in principle this could work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants