In-memory representation of chunks: array instead of a dict? #33

TomNicholas · 2024-03-14T15:29:50Z

Currently chunks are stored as a mapping of chunk keys to chunk entries, i.e. a manifest dict. The ManifestArray is just a thin wrapper over this, it's not really an array at all internally. The main purpose of it is to do lazy concatenation, which is implemented via manipulating manifest dicts.

However, as was pointed out by @dcherian in fsspec/kerchunk#377 (comment), a lazily concatenated array is essentially just a chunked array. We could imagine an alternative design where ManifestArray works more like a dask array, which holds a grid of chunks, each of which is itself an array (i.e. numpy arrays).

It might make sense to re-implement ManifestArray to organise chunks in the same way dask does, perhaps storing some sort of ChunkReferenceArray object in place of numpy arrays. This would then handle concatenation at the chunk level, allow for indexing (as long as the indexers aligned with chunk boundaries), and possibly be implemented by vendoring code from inside dask.

In this design the ChunkManifest would be something that could be built from the ManifestArray when needed, not the fundamental object. This change could be done without changing the existing API of ManifestArray.

The text was updated successfully, but these errors were encountered:

TomNicholas · 2024-03-14T23:10:23Z

This is important for #16 and null value handling in general (#32), as it would allow for a ManifestArray of arbitrary shape to be backed a manifest with all chunks, some chunks, or no chunks at all (so returning fill_value everywhere).

rabernat · 2024-03-14T23:26:09Z

There are basically three pieces of data you need for a chunk reference. This is roughly what our data structure look like in Arraylake

class ReferenceData:
    uri: str
    offset: int
    length: int

Storing these in different separate Zarr arrays would offer major advantages in terms of compression. For the int data, we could use the delta codec, plus a lossless compressor like Zstd, to massively crush down the data. For the uris, the VlenUTF8 (plus lossless compression) would work great.

It's likely that we could store millions of references this way using < 1MB of storage.

TomNicholas · 2024-03-14T23:37:56Z

Your ReferenceData class is the same as my ChunkEntry class, i.e. one entry in a ChunkManifest.

https://github.com/TomNicholas/VirtualiZarr/blob/1e9273864b0e74a52249cf34b6d2afb5049e9e76/virtualizarr/manifests/manifest.py#L17

Storing these in different separate Zarr arrays would offer major advantages in terms of compression. For the int data, we could use the delta codec, plus a lossless compressor like Zstd, to massively crush down the data. For the uris, the VlenUTF8 (plus lossless compression) would work great.

It's likely that we could store millions of references this way using < 1MB of storage.

This seems like a cool idea but possibly orthogonal to the design issue I'm talking about above? Sounds like you're suggesting a particular on-disk storage format for the chunk references. In this issue I'm talking only about the in-memory representation of the chunked n-dimensional array + manifest. We can write that to disk in a number of ways (as kerchunk json, as kerchunk parquet, or as your triple-zarr-array suggestion here etc.).

EDIT: That's a great point about compression though - the URL and length fields are likely to exhibit very little variation over the whole array, and so compress down to almost nothing.

rabernat · 2024-03-14T23:44:44Z

Yeah I see what you mean. In my defense, the title of this issues does begin with the word "Store"! 😆

For the issue you are talking about (in memory representation of references), I think having an array indexed by chunk positions makes a lot of sense. The main downside would be in the case of very sparsely populated manifests (relative to the full array), in which case this would involve a lot of unused memory compared to the dict. I suppose you could opt to use an in-memory sparse array for that case.

rabernat · 2024-03-15T20:42:57Z

Getting super meta here...

What if we used an Xarray Dataset for the in-memory references, with three different variables (path, offset, length)? Then we could use Xarray to do the concatenation, broadcasting, etc. For example, this would really simplify concat_manifests.

There is something very satisfying about the idea of using Xarray itself to manage the manifests.

TomNicholas · 2024-03-15T21:15:08Z

Interesting... I'm not seeing how that would really work though. We need to use different xarray Variables at the top-level for different netCDF variables / zarr arrays, so we would need 3 variables (path, offset, length) per variable (lat, lon, temp, etc.).

Or are you suggesting using xarray twice, at two levels of abstraction? Once to hold the chunk grid in each ManifestArray, and once to hold all the ManifestArrays. That would be kind of wild.

Another idea that's along the same lines would be to store the manifest entries in a structured numpy array, i.e. using a structured dtype. But that would require xarray to be able to wrap structured numpy arrays, which I bet it can't right now.

EDIT: Actually xarray wouldn't have to directly wrap the structured array, it would wrap a ManifestArray that wraps a structured array... That could actually work...

Also the concat_manifests isn't really that complicated in my opinion. The concatenation and broadcasting of the chunk manifests is pretty easy to describe just as manipulation of the chunk keys, that's one of the things I like about this design. It works well as an abstraction, it's only a problem if we think it's really inefficient / can't represent some case we need.

TomNicholas · 2024-03-15T21:30:09Z

This structured array idea could work nicely... You have a ManifestArray that has the shape and dtype of the netcdf variable / zarr array it refers to, but it wraps a single structured array with shape corresponding to the shape of the chunk grid.

The structured array has 3 fields, for path, offset and length. Offset and length are ints, but for the path we could use the variable-length string dtype that it looks like was merged into numpy only about a month ago! EDIT: Ahh crap but it won't come out until numpy 2.0

All concatenation / broadcasting of the ManifestArray would just defer to numpy handling concatenation / broadcasting of the wrapped structured array (which should then be super efficient). The ManifestArray just has a similar job to now - it carts around the correct .zarray info and calculates the new overall shape.

Or you could go even more meta and have a ManifestArray wrap a dask array which wraps numpy structured arrays? Then you can do concat via tree-reduce on the references...

TomNicholas · 2024-03-15T21:57:34Z

One downside of all these in-memory array ideas compared the the dictionary chunk manifest we currently have is I don't know how the array would support missing chunks. The structured array wouldn't have anywhere you could put a NaN either.

EDIT: I guess a path containing only an empty string could be understood to represent a missing chunk?

rabernat · 2024-03-15T23:16:32Z

The current dictionary encoding is basically a poor man's sparse array. 😆

rabernat · 2024-03-15T23:18:29Z

I guess a path containing only an empty string could be understood to represent a missing chunk?

Or we could have a separate bitmask array.

The nice thing about manifest as dataset is that you can attach all kinds of chunk-level statistics. For example, chunk min, max, sum, count, etc.

TomNicholas · 2024-03-15T23:25:06Z

The current dictionary encoding is basically a poor man's sparse array. 😆

Haha yeah it kinda is 😅

Or we could have a separate bitmask array.

Okay so there are potentially multiple solutions to the missing chunk issue. @jhamman pointed out today that the main use case of this library is arrays where you do have every chunk (because netCDF files don't just omit chunks), so I don't think we are really talking about very sparse arrays anyway. The main reason to be able to represent NaNs is for padding with them (#22).

The nice thing about manifest as dataset is that you can attach all kinds of chunk-level statistics. For example, chunk min, max, sum, count, etc.

You could do that in a structured array too, just by having extra fields. Not sure what the use case of those chunk-level statistics is in the context of this library though.

I'm tempted to make a PR to try out the structured array idea, as I feel like that's the most memory-optimized data structure we can use to represent the manifest (without writing something ourselves in rust #23).

TomNicholas · 2024-05-09T17:58:25Z

See #104 (comment) for a simple experiment showing that using 3 (dense) numpy arrays we should be able to represent a manifest that points to 1 million chunks using only ~24MB in-memory.

Also note that apparently it isn't possible to put numpy 2.0's variable-length string dtype into a numpy structured array (see Jeremy's comment zarr-developers/zarr-specs#287 (comment)), which means I need to change how I had started implementing #39 to use 3 separate arrays (for path, offset, and length) instead.

dcherian · 2024-05-09T21:20:08Z

FWIW I think the 3 array approach will be more performant: https://numpy.org/doc/stable/user/basics.rec.html#introduction

For instance, the C-struct-like memory layout of structured arrays in numpy can lead to poor cache behavior in comparison.

TomNicholas changed the title ~~Store an array of chunks instead of a dict?~~ In-memory representation of chunks: array instead of a dict? Mar 15, 2024

TomNicholas mentioned this issue Mar 15, 2024

Treatment of final smaller chunk in the zarr model #38

Open

TomNicholas mentioned this issue Mar 17, 2024

[WIP] Structured array for manifest #39

Closed

2 tasks

This was referenced Mar 27, 2024

HTML repr for ManifestArray? #59

Open

Manifest storage transformer zarr-developers/zarr-specs#287

Open

This was referenced May 3, 2024

Writing to parquet (following kerchunk format) #72

Closed

Performance roadmap #104

Open

TomNicholas added the performance label May 9, 2024

TomNicholas mentioned this issue May 10, 2024

Use 3 numpy arrays for manifest internally #107

Merged

TomNicholas mentioned this issue Jun 8, 2024

Aspirational use case: [C]Worthy mCDR OAE Atlas dataset #132

Open

19 tasks

TomNicholas closed this as completed in #107 Jun 18, 2024

DahnJ mentioned this issue Jun 28, 2024

Incrementally-populated Zarr Arrays zarr-developers/zarr-specs#300

Open

TomNicholas mentioned this issue Jul 12, 2024

Support for numpy<2? #184

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In-memory representation of chunks: array instead of a dict? #33

In-memory representation of chunks: array instead of a dict? #33

TomNicholas commented Mar 14, 2024

TomNicholas commented Mar 14, 2024

rabernat commented Mar 14, 2024 •

edited

Loading

TomNicholas commented Mar 14, 2024 •

edited

Loading

rabernat commented Mar 14, 2024

rabernat commented Mar 15, 2024

TomNicholas commented Mar 15, 2024 •

edited

Loading

TomNicholas commented Mar 15, 2024 •

edited

Loading

TomNicholas commented Mar 15, 2024 •

edited

Loading

rabernat commented Mar 15, 2024

rabernat commented Mar 15, 2024

TomNicholas commented Mar 15, 2024 •

edited

Loading

TomNicholas commented May 9, 2024 •

edited

Loading

dcherian commented May 9, 2024

In-memory representation of chunks: array instead of a dict? #33

In-memory representation of chunks: array instead of a dict? #33

Comments

TomNicholas commented Mar 14, 2024

TomNicholas commented Mar 14, 2024

rabernat commented Mar 14, 2024 • edited Loading

TomNicholas commented Mar 14, 2024 • edited Loading

rabernat commented Mar 14, 2024

rabernat commented Mar 15, 2024

TomNicholas commented Mar 15, 2024 • edited Loading

TomNicholas commented Mar 15, 2024 • edited Loading

TomNicholas commented Mar 15, 2024 • edited Loading

rabernat commented Mar 15, 2024

rabernat commented Mar 15, 2024

TomNicholas commented Mar 15, 2024 • edited Loading

TomNicholas commented May 9, 2024 • edited Loading

dcherian commented May 9, 2024

rabernat commented Mar 14, 2024 •

edited

Loading

TomNicholas commented Mar 14, 2024 •

edited

Loading

TomNicholas commented Mar 15, 2024 •

edited

Loading

TomNicholas commented Mar 15, 2024 •

edited

Loading

TomNicholas commented Mar 15, 2024 •

edited

Loading

TomNicholas commented Mar 15, 2024 •

edited

Loading

TomNicholas commented May 9, 2024 •

edited

Loading