Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behavior with MultiZarrToZarr with partial chunks. #400

Open
sharkinsspatial opened this issue Nov 28, 2023 · 1 comment
Open

Comments

@sharkinsspatial
Copy link

While experimenting with kerchunking some Icesat2 ATL08 data I noticed an issue where using MultiZarrToZarr with non-dimension coordinates that had partial chunks resulted in empty values for those variables in the ouput kerchunk index.

A minimal example

import xarray as xr
import numpy as np
from kerchunk.hdf import SingleHdf5ToZarr
from kerchunk.combine import MultiZarrToZarr

# Set up fake netCDF files with latitude as a non-dimension coordinate.
time_1_3 = xr.Dataset(
    {
        "delta_time": [12, 13, 14]
    },
    coords={
        'latitude': ('delta_time', [1, 2, 3]),
 })
time_15_17 = xr.Dataset(
    {
        "delta_time": [15, 16, 17]
    },
    coords={
        'latitude': ('delta_time', [4, 5, 6]),
})
time_18_22 = xr.Dataset(
    {
        "delta_time": [18, 19, 20, 21, 22]
    },
    coords={
        'latitude': ('delta_time', [7, 8, 9, 10, 11]),
 })
time_1_3

Screen Shot 2023-11-28 at 4 21 41 PM

# Create netCDFs with chunksize aligned to data size
chunksize = 3
encoding = {"latitude": {"chunksizes": (chunksize,)},"delta_time": {"chunksizes": (chunksize,)}}
time_1_3.to_netcdf('time_1_3.nc', encoding=encoding, engine="h5netcdf", unlimited_dims=["delta_time"])
time_15_17.to_netcdf('time_15_17.nc', encoding=encoding, engine="h5netcdf", unlimited_dims=["delta_time"])

def create_reference(files: list[str]):
    single_jsons = [SingleHdf5ToZarr(filepath, inline_threshold=0).translate() for filepath in files]
    mzz = MultiZarrToZarr(
        single_jsons,
        concat_dims=["delta_time"],
    )
    combined_test_json = mzz.translate()

    combined_test = xr.open_dataset(
        "reference://", engine="zarr",
        backend_kwargs={
            "storage_options": {
                "fo": combined_test_json,
                },
            "consolidated": False,
        }
    )
    return combined_test

# This works as expected
combined_test_3 = create_reference(["time_1_3.nc", "time_15_17.nc"])
combined_test_3.latitude.data

Screen Shot 2023-11-28 at 4 22 41 PM

# Create netCDFs with a larger chunksize resulting in partially filled chunks not aligned to data size
chunksize = 10
encoding = {"latitude": {"chunksizes": (chunksize,)},"delta_time": {"chunksizes": (chunksize,)}}
time_15_17.to_netcdf('time_15_17.nc', encoding=encoding, engine="h5netcdf", unlimited_dims=["delta_time"])
time_18_22.to_netcdf('time_18_22.nc', encoding=encoding, engine="h5netcdf", unlimited_dims=["delta_time"])

# This results in empty values for the resulting non-dimension coordinate variable.
combined_test_10 = create_reference(["time_15_17.nc", "time_18_22.nc"])
combined_test_10.latitude.data

Screen Shot 2023-11-28 at 4 25 03 PM

This seems potentially related to some of the discussion in #305 (as it is also describing the case of data not aligned with chunk size).

If latitude is promoted to a concat_dim the output is correct (with all of the latitude values included).

I may be misunderstanding the MultiZarrToZarr logic in this case where we have regularly sized, partially filled chunks. Is it possible to have a non-dimension variable concatenated in a linear fashion in this situation?

@martindurant
Copy link
Member

If latitude is promoted to a concat_dim the output is correct (with all of the latitude values included).

I think in this case, the coordinates get inlined as a single chunk in the output references, so the chunking of the original doesn't matter. The same could be done with any variable. This is not chunk inlining but whole array inlining to a single chunk, e.g., kerchunk.utils.inline_array . However, MultiZarrToZarr doesn't even bother accumulating values of non-coordinates, since they are not used for figuring out the concatenation chunk placement.

More generally, non-equal chunks are just not possible because of the limitations of zarr at least until ZEP003 is accepted and implemented ( zarr-developers/zarr-python#1483 ). If this were done, no magic inlining would be needed. Feel free to ping on the issue or the ZEP repo saying you need this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants