Control chunksize of the underlying zarrdata #406

chc-incom · 2024-01-08T19:00:47Z

I would appreciate to have more control over the way Kerchunk is writing "refs" -especially control the chunking.

Context:
I previously used fsspec and kerchunk to store my data while continously exanding my dataset.
My data always has the same dimensionality of course and even the same coordinetes except one dimension: "release"

When using class SingleHdf5ToZarr in Kerchunk I have no control over the zarr group/store created
I think it is because this part of the init is hardcoded and not mutable through any methods:

        self.store = {}
        self._zroot = zarr.group(store=self.store, overwrite=True)

My data files have the same coordinate size : datafile_shape=(1,n2,n3,n4)

When running SingleHdf5ToZarr(...).translate() on my old data, I get back data with some arbitrary chunksize (1,n_c2,n_c3,n_c4)

Now that I have updated some dependencies in my env I get another arbitrary chunksize (1,n_c2',n_c3',n_c4')

Here I would actually ideally just have had chunksize = datafile_shape. But the fatal issue is that I can no longer combine new and old data with MultiZarrToZarr. When I try to combine my kerchunk metadata chunks I get:

ValueError: Found chunk size mismatch:
                        at prefix [my variable name] in iteration 544 (file None)
                        new chunk: [1, 63, 200, 261]
                        chunks so far: [1, 42, 133, 174]

Problem in short:

I have one distributed dataset, that is constantly expanding. Old data is no longer compatible with new data, because my env has changed slightly.
I experience the arb. chunksize from Kerchunk to deviate between Old and new data, disabling me to combine my dataset properly.
Ideally I would like to choose the chunksize explicitly like I can in xarray. Then I would prefer my chunksizes to be the same as my datafiles shapes

The text was updated successfully, but these errors were encountered:

martindurant · 2024-01-09T16:03:15Z

When running SingleHdf5ToZarr(...).translate() on my old data, I get back data with some arbitrary chunksize (1,n_c2,n_c3,n_c4)

The chunksize outputted by translate() is not arbitrary: these are the data buffer shapes that were actually stored in the original by the HDF5 library. You may have had some control over that are file creation time (and apparently your software stack has changed how it guesses the size to use with time).

Kerchunk in no way changes the actual data storage in the original files it references, so there is no way to change the chunking. Unfortunately, this means you cannot combine incompatible data, as you have found.
This could be solved by

adopting and implementing ZEP003 ( POC implementation of ZEP003 zarr-developers/zarr-python#1483 )
using something like xarray's flexible indexing; but that would not fit into the zarr model at all and would require some new work to get xarray to load

I would prefer my chunksizes to be the same as my datafiles shapes

You mean exactly one chunk for each input file? That is probably something we could do fairly easily, effectively making hdf (or kerchunk itslef) the codec for loading each file.

chc-incom · 2024-02-09T20:25:46Z

Thanks for your detailed response @martindurant !

I would be happy to contribute to the last proposal (chunksizes=datafiles_shape), if you could give me some pointers on what to look at and how to get started? @martindurant

In xarrays .to_zarr(...), you can control the chunksizes. Would it not be a valuable contribution to have the same functionality in Kerchunk - or is it for some reason not feasible.

martindurant · 2024-02-09T20:33:39Z

I would be happy to contribute to the last proposal (chunksizes=datafiles_shape), if you could give me some pointers on what to look at and how to get started?

You would have a numcodecs implementation something like

class HDF5file(Code):
    def __init__(self, path: str):
        self.path = path

    def decode(self, buffer):
        import io, h5py
        b = io.BytesIO(buffer)
        h = h5py.File(b, "r")
        return h[self.path][...]

and this would be the "codec" for the whole array. Here, "path" is the name of the array from each HDF5 file to read (but the whole of the file will be pulled into memory first).

martindurant · 2024-02-09T20:35:00Z

Would it not be a valuable contribution to have the same functionality in Kerchunk - or is it for some reason not feasible.

It is not possible, because kerchunk works using the chunks as they are stored in the original files. The whole point is, that you don't need to rewrite/copy the data. If you do have the option to do so, you may as well use normal zarr output.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Control chunksize of the underlying zarrdata #406

Control chunksize of the underlying zarrdata #406

chc-incom commented Jan 8, 2024

martindurant commented Jan 9, 2024

chc-incom commented Feb 9, 2024

martindurant commented Feb 9, 2024 •

edited

Loading

martindurant commented Feb 9, 2024

Control chunksize of the underlying zarrdata #406

Control chunksize of the underlying zarrdata #406

Comments

chc-incom commented Jan 8, 2024

martindurant commented Jan 9, 2024

chc-incom commented Feb 9, 2024

martindurant commented Feb 9, 2024 • edited Loading

martindurant commented Feb 9, 2024

martindurant commented Feb 9, 2024 •

edited

Loading