Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control chunksize of the underlying zarrdata #406

Open
chc-incom opened this issue Jan 8, 2024 · 4 comments
Open

Control chunksize of the underlying zarrdata #406

chc-incom opened this issue Jan 8, 2024 · 4 comments

Comments

@chc-incom
Copy link

I would appreciate to have more control over the way Kerchunk is writing "refs" -especially control the chunking.

Context:
I previously used fsspec and kerchunk to store my data while continously exanding my dataset.
My data always has the same dimensionality of course and even the same coordinetes except one dimension: "release"

When using class SingleHdf5ToZarr in Kerchunk I have no control over the zarr group/store created
I think it is because this part of the init is hardcoded and not mutable through any methods:

        self.store = {}
        self._zroot = zarr.group(store=self.store, overwrite=True)

My data files have the same coordinate size : datafile_shape=(1,n2,n3,n4)

When running SingleHdf5ToZarr(...).translate() on my old data, I get back data with some arbitrary chunksize (1,n_c2,n_c3,n_c4)

Now that I have updated some dependencies in my env I get another arbitrary chunksize (1,n_c2',n_c3',n_c4')

Here I would actually ideally just have had chunksize = datafile_shape. But the fatal issue is that I can no longer combine new and old data with MultiZarrToZarr. When I try to combine my kerchunk metadata chunks I get:

ValueError: Found chunk size mismatch:
                        at prefix [my variable name] in iteration 544 (file None)
                        new chunk: [1, 63, 200, 261]
                        chunks so far: [1, 42, 133, 174]

Problem in short:

  • I have one distributed dataset, that is constantly expanding. Old data is no longer compatible with new data, because my env has changed slightly.
  • I experience the arb. chunksize from Kerchunk to deviate between Old and new data, disabling me to combine my dataset properly.
  • Ideally I would like to choose the chunksize explicitly like I can in xarray. Then I would prefer my chunksizes to be the same as my datafiles shapes
@martindurant
Copy link
Member

When running SingleHdf5ToZarr(...).translate() on my old data, I get back data with some arbitrary chunksize (1,n_c2,n_c3,n_c4)

The chunksize outputted by translate() is not arbitrary: these are the data buffer shapes that were actually stored in the original by the HDF5 library. You may have had some control over that are file creation time (and apparently your software stack has changed how it guesses the size to use with time).

Kerchunk in no way changes the actual data storage in the original files it references, so there is no way to change the chunking. Unfortunately, this means you cannot combine incompatible data, as you have found.
This could be solved by

I would prefer my chunksizes to be the same as my datafiles shapes

You mean exactly one chunk for each input file? That is probably something we could do fairly easily, effectively making hdf (or kerchunk itslef) the codec for loading each file.

@chc-incom
Copy link
Author

Thanks for your detailed response @martindurant !

I would be happy to contribute to the last proposal (chunksizes=datafiles_shape), if you could give me some pointers on what to look at and how to get started? @martindurant

In xarrays .to_zarr(...), you can control the chunksizes. Would it not be a valuable contribution to have the same functionality in Kerchunk - or is it for some reason not feasible.

@martindurant
Copy link
Member

martindurant commented Feb 9, 2024

I would be happy to contribute to the last proposal (chunksizes=datafiles_shape), if you could give me some pointers on what to look at and how to get started?

You would have a numcodecs implementation something like

class HDF5file(Code):
    def __init__(self, path: str):
        self.path = path

    def decode(self, buffer):
        import io, h5py
        b = io.BytesIO(buffer)
        h = h5py.File(b, "r")
        return h[self.path][...]

and this would be the "codec" for the whole array. Here, "path" is the name of the array from each HDF5 file to read (but the whole of the file will be pulled into memory first).

@martindurant
Copy link
Member

Would it not be a valuable contribution to have the same functionality in Kerchunk - or is it for some reason not feasible.

It is not possible, because kerchunk works using the chunks as they are stored in the original files. The whole point is, that you don't need to rewrite/copy the data. If you do have the option to do so, you may as well use normal zarr output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants