feat(memh5): zarr support #169

nritsche · 2021-04-01T18:23:43Z

TODO:

_distributed_group_to_zarr
unit tests
test: format some old hdf5 files in zarr
run daily processing pipeline with zarr
we are creating <filename>.zarr.sync folders for zarr's synchronizer, but they don't get deleted by zarr. do we have to do that? InterProcessLock files left behind harlowja/fasteners#26
optimize striping for small (chunk) files: https://scicomp.aalto.fi/triton/usage/lustre/
increase chunk size to ~16MB (compressed on disk)

Also see radiocosmology/draco#130
and chime-experiment/ch_pipeline#82

Closes https://github.com/chime-experiment/Pipeline/issues/33

Edit by @anjakefala
As of 2021-08-23, here is the discussed remaining work for this PR, along with any unchecked boxes above:

remove references to chunkify()
Ignore compression and chunking options when saving to .h5 (log DEBUG level)
add zarr to the base chime environment on Cedar, before we set the pipeline running with this new code
make sure the docs can build and upload to readthedocs; one way is to either make bitshuffle an optional dependency for building the docs, or build the docs and upload them through readthedocs through github actions (see ch_util)
- Update: ch_util actually pushes the docs to github pages, not readthedocs: ci(docs): move to github-pages chime-experiment/ch_util#10
clean up commits made by @anjakefala
@jrs65 and @anjakefala review the full PR
get a successful run with test_daily and then get a successful run with the full pipeline.

caput/mpiarray.py

lgtm-com · 2021-05-07T23:27:50Z

This pull request introduces 1 alert when merging 695c556 into 2fc0149 - view on LGTM.com

new alerts:

1 for Variable defined multiple times

lgtm-com · 2021-05-12T19:15:25Z

This pull request introduces 1 alert when merging 7499ed7 into 69a7bfa - view on LGTM.com

new alerts:

1 for Variable defined multiple times

caput/pipeline.py

caput/tests/conftest.py

caput/tod.py

setup.py

caput/memh5.py

caput/pipeline.py

setup.py

caput/config.py

caput/fileformats.py

jrs65 · 2021-07-08T16:56:55Z

caput/tests/test_truncate.py

@@ -0,0 +1,77 @@
+import numpy as np


@tristpinsm have you had a look at these tests? Do they seem reasonable? I think I remember there were a bunch of edge cases you needed to deal with, I wonder if we should add specific tests for them.

yeah I guess it could be useful to test the behaviour of some edge cases, if only to be aware of their outcomes. These could include:

precision set to 0/inf/NaN

weight set to 0/inf/NaN

value set to inf/NaN

combinations of the above

not an edge case exactly but truncating negative numbers

More generally a more robust test that I was using when developing these functions was to generate an ensemble of random numbers, truncate them, and check that the distribution of truncation errors was what you expect. That might be cumbersome and difficult to implement in this context though.

Another thing to note is there is a comment that claims one of the tests fails...

Let's add a few special cases in for this.

caput/memh5.py

jrs65 · 2022-02-24T17:39:44Z

caput/memh5.py

                deep_group_copy(
                    self,
                    f,
                    convert_attribute_strings=convert_attribute_strings,
                    convert_dataset_strings=convert_dataset_strings,
+                    file_format=file_format,


I'm not sure why these need to get passed into deep_group_copy???

This seems to be very confusing. We should try to simplify.

Problem is with file_format, compression opts have been removed.

caput/memh5.py

jrs65 · 2022-02-24T20:35:50Z

caput/mpiarray.py

@@ -712,12 +770,11 @@ def to_hdf5(

        Parameters
        ----------
-        filename : str, h5py.File or h5py.Group
+        f : str, h5py.File or h5py.Group
            File to write dataset into.
        dataset : string
            Name of dataset to write into. Should not exist.


Needs documentation for all parameters.

Oh wow, they are not documented anywhere.

Still need doc entry for chunks

caput/pipeline.py

jrs65 · 2022-02-24T20:43:42Z

caput/tests/test_selection_parallel.py

+@pytest.fixture
+def h5_file_select_parallel(datasets, h5_file):
+    """Prepare an HDF5 file for the select_parallel tests."""
+    if comm.rank == 0:
+        m1 = mpiarray.MPIArray.wrap(datasets[0], axis=0, comm=MPI.COMM_SELF)
+        m2 = mpiarray.MPIArray.wrap(datasets[1], axis=0, comm=MPI.COMM_SELF)
+        container = MemGroup(distributed=True, comm=MPI.COMM_SELF)
+        container.create_dataset("dset1", data=m1, distributed=True)
+        container.create_dataset("dset2", data=m2, distributed=True)
+        container.to_hdf5(h5_file)
+
+    comm.Barrier()
+
+    yield h5_file, datasets

-    fname = "tmp_test_memh5_select_parallel.h5"
+    comm.Barrier()

+    if comm.rank == 0:
+        rm_all_files(h5_file)
+
+
+@pytest.fixture
+def zarr_file_select_parallel(datasets, zarr_file):
+    """Prepare a Zarr file for the select_parallel tests."""
    if comm.rank == 0:
        m1 = mpiarray.MPIArray.wrap(datasets[0], axis=0, comm=MPI.COMM_SELF)
        m2 = mpiarray.MPIArray.wrap(datasets[1], axis=0, comm=MPI.COMM_SELF)
        container = MemGroup(distributed=True, comm=MPI.COMM_SELF)
        container.create_dataset("dset1", data=m1, distributed=True)
        container.create_dataset("dset2", data=m2, distributed=True)
-        container.to_hdf5(fname)
+        container.to_file(zarr_file, file_format=fileformats.Zarr)

    comm.Barrier()

-    yield fname, datasets
+    yield zarr_file, datasets

    comm.Barrier()

    # tear down
-
    if comm.rank == 0:
-        file_names = glob.glob(fname + "*")
-        for fname in file_names:
-            os.remove(fname)
+        rm_all_files(zarr_file)


It does seem like this could factor out the memh5 container creation

A nicety, but not very important.

I can do this when I fix the tests.

jrs65 · 2022-02-24T20:45:20Z

I added a bunch of comments, there not all that actionable, some are just notes to myself wondering how something is meant to work.

setup.py

jrs65 · 2022-02-25T22:40:12Z

caput/tests/test_truncate.py

@@ -0,0 +1,77 @@
+import numpy as np


Let's add a few special cases in for this.

tristpinsm · 2022-03-02T00:09:06Z

The last two commits I added (allowing non-contiguous arrays in truncate and an additional check on attributes) should probably be discussed

tristpinsm · 2022-03-04T20:42:48Z

caput/config.py

@@ -510,7 +510,7 @@ class Project:

            mode = file_format(default='zarr')
    """
-    options = ("hdf5", "zarr", None)


without a None default option, the caput.fileformats.check_file_format guessing functionality is broken.

@tristpinsm is looking into this.

Add support for serialising data into Zarr format files. This enables compression for distributed containers. Co-authored-by: Anja Kefala <anja.kefala@gmail.com> Co-authored-by: Tristan Pinsonneault-Marotte <tristpinsm@gmail.com> Co-authored-by: Richard Shaw <richard@phas.ubc.ca>

…ites This significantly cleans up the handling of distributed file writes and condenses the logic into a single flow for all file types.

Co-authored-by: Tristan Pinsonneault-Marotte <tristpinsm@gmail.com>

anjakefala · 2022-05-30T18:53:39Z

👏 👏 👏 👏

jrs65 · 2022-05-30T18:55:55Z

👏 👏 👏 👏

Slowly getting there. Now I'm fighting to merge mpiarray in!!

nritsche force-pushed the rn/zarr branch 2 times, most recently from a6ab52e to 8b6656c Compare April 13, 2021 23:02

anjakefala reviewed Apr 15, 2021

View reviewed changes

caput/mpiarray.py Outdated Show resolved Hide resolved

nritsche force-pushed the rn/zarr branch 2 times, most recently from 1a61bcc to 22a159d Compare April 19, 2021 20:05

nritsche force-pushed the rn/zarr branch 4 times, most recently from e6bcdcb to 977f32a Compare April 28, 2021 23:58

radiocosmology deleted a comment from lgtm-com bot Apr 29, 2021

nritsche force-pushed the rn/zarr branch 6 times, most recently from 191d37a to c03d3f8 Compare May 4, 2021 02:31

nritsche mentioned this pull request May 6, 2021

feat(task): zarr support radiocosmology/draco#130

Merged

nritsche force-pushed the rn/zarr branch 3 times, most recently from dcd6fa2 to ceef244 Compare May 13, 2021 17:54

nritsche force-pushed the rn/zarr branch from 2c96ef3 to a0a8904 Compare June 7, 2021 22:48

nritsche requested a review from jrs65 July 6, 2021 17:58

nritsche marked this pull request as ready for review July 9, 2021 20:33

anjakefala force-pushed the rn/zarr branch 5 times, most recently from 8c37c6e to 21f081b Compare August 20, 2021 21:50

anjakefala reviewed Sep 16, 2021

View reviewed changes

caput/pipeline.py Show resolved Hide resolved

anjakefala reviewed Sep 16, 2021

View reviewed changes

caput/tests/conftest.py Show resolved Hide resolved

anjakefala reviewed Sep 16, 2021

View reviewed changes

caput/tod.py Show resolved Hide resolved

anjakefala reviewed Sep 16, 2021

View reviewed changes

setup.py Show resolved Hide resolved

anjakefala reviewed Sep 17, 2021

View reviewed changes

caput/memh5.py Show resolved Hide resolved

anjakefala force-pushed the rn/zarr branch from b553d92 to 9bbc8c9 Compare September 20, 2021 18:09

anjakefala reviewed Sep 20, 2021

View reviewed changes

caput/pipeline.py Outdated Show resolved Hide resolved

anjakefala reviewed Sep 20, 2021

View reviewed changes

setup.py Outdated Show resolved Hide resolved

anjakefala force-pushed the rn/zarr branch from ea1fa37 to eb88d87 Compare October 4, 2021 17:42

jrs65 requested changes Feb 24, 2022

View reviewed changes

jrs65 reviewed Feb 25, 2022

View reviewed changes

tristpinsm reviewed Mar 4, 2022

View reviewed changes

anjakefala mentioned this pull request Mar 14, 2022

Merge of rn/zarr and mpiarray for testing #195

Closed

jrs65 force-pushed the rn/zarr branch from 939a926 to b8160e2 Compare May 30, 2022 16:56

nritsche and others added 7 commits May 30, 2022 11:29

refactor(memh5): create a single routine to handle all distributed wr…

1da78b6

…ites This significantly cleans up the handling of distributed file writes and condenses the logic into a single flow for all file types.

feat(truncate): add truncate for double values

d7ec065

Co-authored-by: Tristan Pinsonneault-Marotte <tristpinsm@gmail.com>

test(truncate): Add special cases.

47bac08

fix(profile): workaround missing metrics on macOS

05f3278

ci: update workflow and use bitshuffle binary wheels

8c8d023

doc: restrict sphinx version to workaround bug in v5.0

92fc4a2

jrs65 force-pushed the rn/zarr branch from 1403147 to 92fc4a2 Compare May 30, 2022 18:29

jrs65 self-requested a review May 30, 2022 18:34

jrs65 approved these changes May 30, 2022

View reviewed changes

jrs65 merged commit 75987f1 into master May 30, 2022

jrs65 deleted the rn/zarr branch May 30, 2022 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(memh5): zarr support #169

feat(memh5): zarr support #169

nritsche commented Apr 1, 2021 •

edited by anjakefala

Loading

lgtm-com bot commented May 7, 2021

lgtm-com bot commented May 12, 2021

jrs65 Jul 8, 2021

tristpinsm Feb 25, 2022

jrs65 Feb 25, 2022

jrs65 Feb 24, 2022

jrs65 Feb 25, 2022

jrs65 Mar 4, 2022

jrs65 Feb 24, 2022

anjakefala Mar 2, 2022

jrs65 Mar 4, 2022

jrs65 Feb 24, 2022

jrs65 Feb 25, 2022

jrs65 Mar 4, 2022

jrs65 commented Feb 24, 2022

jrs65 Feb 25, 2022

tristpinsm commented Mar 2, 2022

tristpinsm Mar 4, 2022

jrs65 Mar 4, 2022

anjakefala commented May 30, 2022

jrs65 commented May 30, 2022

feat(memh5): zarr support #169

feat(memh5): zarr support #169

Conversation

nritsche commented Apr 1, 2021 • edited by anjakefala Loading

lgtm-com bot commented May 7, 2021

lgtm-com bot commented May 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrs65 commented Feb 24, 2022

Choose a reason for hiding this comment

tristpinsm commented Mar 2, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anjakefala commented May 30, 2022

jrs65 commented May 30, 2022

nritsche commented Apr 1, 2021 •

edited by anjakefala

Loading