-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(memh5): zarr support #169
Conversation
a6ab52e
to
8b6656c
Compare
1a61bcc
to
22a159d
Compare
e6bcdcb
to
977f32a
Compare
191d37a
to
c03d3f8
Compare
This pull request introduces 1 alert when merging 695c556 into 2fc0149 - view on LGTM.com new alerts:
|
This pull request introduces 1 alert when merging 7499ed7 into 69a7bfa - view on LGTM.com new alerts:
|
dcd6fa2
to
ceef244
Compare
8c37c6e
to
21f081b
Compare
@@ -0,0 +1,77 @@ | |||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tristpinsm have you had a look at these tests? Do they seem reasonable? I think I remember there were a bunch of edge cases you needed to deal with, I wonder if we should add specific tests for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I guess it could be useful to test the behaviour of some edge cases, if only to be aware of their outcomes. These could include:
- precision set to 0/inf/NaN
- weight set to 0/inf/NaN
- value set to inf/NaN
- combinations of the above
- not an edge case exactly but truncating negative numbers
More generally a more robust test that I was using when developing these functions was to generate an ensemble of random numbers, truncate them, and check that the distribution of truncation errors was what you expect. That might be cumbersome and difficult to implement in this context though.
Another thing to note is there is a comment that claims one of the tests fails...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a few special cases in for this.
deep_group_copy( | ||
self, | ||
f, | ||
convert_attribute_strings=convert_attribute_strings, | ||
convert_dataset_strings=convert_dataset_strings, | ||
file_format=file_format, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why these need to get passed into deep_group_copy
???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be very confusing. We should try to simplify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem is with file_format, compression opts have been removed.
@@ -712,12 +770,11 @@ def to_hdf5( | |||
|
|||
Parameters | |||
---------- | |||
filename : str, h5py.File or h5py.Group | |||
f : str, h5py.File or h5py.Group | |||
File to write dataset into. | |||
dataset : string | |||
Name of dataset to write into. Should not exist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs documentation for all parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow, they are not documented anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need doc entry for chunks
@pytest.fixture | ||
def h5_file_select_parallel(datasets, h5_file): | ||
"""Prepare an HDF5 file for the select_parallel tests.""" | ||
if comm.rank == 0: | ||
m1 = mpiarray.MPIArray.wrap(datasets[0], axis=0, comm=MPI.COMM_SELF) | ||
m2 = mpiarray.MPIArray.wrap(datasets[1], axis=0, comm=MPI.COMM_SELF) | ||
container = MemGroup(distributed=True, comm=MPI.COMM_SELF) | ||
container.create_dataset("dset1", data=m1, distributed=True) | ||
container.create_dataset("dset2", data=m2, distributed=True) | ||
container.to_hdf5(h5_file) | ||
|
||
comm.Barrier() | ||
|
||
yield h5_file, datasets | ||
|
||
fname = "tmp_test_memh5_select_parallel.h5" | ||
comm.Barrier() | ||
|
||
if comm.rank == 0: | ||
rm_all_files(h5_file) | ||
|
||
|
||
@pytest.fixture | ||
def zarr_file_select_parallel(datasets, zarr_file): | ||
"""Prepare a Zarr file for the select_parallel tests.""" | ||
if comm.rank == 0: | ||
m1 = mpiarray.MPIArray.wrap(datasets[0], axis=0, comm=MPI.COMM_SELF) | ||
m2 = mpiarray.MPIArray.wrap(datasets[1], axis=0, comm=MPI.COMM_SELF) | ||
container = MemGroup(distributed=True, comm=MPI.COMM_SELF) | ||
container.create_dataset("dset1", data=m1, distributed=True) | ||
container.create_dataset("dset2", data=m2, distributed=True) | ||
container.to_hdf5(fname) | ||
container.to_file(zarr_file, file_format=fileformats.Zarr) | ||
|
||
comm.Barrier() | ||
|
||
yield fname, datasets | ||
yield zarr_file, datasets | ||
|
||
comm.Barrier() | ||
|
||
# tear down | ||
|
||
if comm.rank == 0: | ||
file_names = glob.glob(fname + "*") | ||
for fname in file_names: | ||
os.remove(fname) | ||
rm_all_files(zarr_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does seem like this could factor out the memh5 container creation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nicety, but not very important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do this when I fix the tests.
I added a bunch of comments, there not all that actionable, some are just notes to myself wondering how something is meant to work. |
@@ -0,0 +1,77 @@ | |||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a few special cases in for this.
The last two commits I added (allowing non-contiguous arrays in truncate and an additional check on attributes) should probably be discussed |
caput/config.py
Outdated
@@ -510,7 +510,7 @@ class Project: | |||
|
|||
mode = file_format(default='zarr') | |||
""" | |||
options = ("hdf5", "zarr", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without a None
default option, the caput.fileformats.check_file_format
guessing functionality is broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tristpinsm is looking into this.
Add support for serialising data into Zarr format files. This enables compression for distributed containers. Co-authored-by: Anja Kefala <anja.kefala@gmail.com> Co-authored-by: Tristan Pinsonneault-Marotte <tristpinsm@gmail.com> Co-authored-by: Richard Shaw <richard@phas.ubc.ca>
…ites This significantly cleans up the handling of distributed file writes and condenses the logic into a single flow for all file types.
Co-authored-by: Tristan Pinsonneault-Marotte <tristpinsm@gmail.com>
👏 👏 👏 👏 |
Slowly getting there. Now I'm fighting to merge mpiarray in!! |
TODO:
<filename>.zarr.sync
folders for zarr's synchronizer, but they don't get deleted by zarr. do we have to do that? InterProcessLock files left behind harlowja/fasteners#26Also see radiocosmology/draco#130
and chime-experiment/ch_pipeline#82
Closes https://github.com/chime-experiment/Pipeline/issues/33
Edit by @anjakefala
As of 2021-08-23, here is the discussed remaining work for this PR, along with any unchecked boxes above:
chunkify()
.h5
(logDEBUG
level)zarr
to the base chime environment on Cedar, before we set the pipeline running with this new codebitshuffle
an optional dependency for building the docs, or build the docs and upload them through readthedocs through github actions (see ch_util)- Update: ch_util actually pushes the docs to github pages, not readthedocs: ci(docs): move to github-pages chime-experiment/ch_util#10
test_daily
and then get a successful run with the full pipeline.