Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(memh5): implement .copy() method #193

Closed
wants to merge 29 commits into from
Closed

feat(memh5): implement .copy() method #193

wants to merge 29 commits into from

Conversation

leungcalvin
Copy link
Contributor

Implements a .copy() method for memh5 objects that makes a deep copy of datasets. Currently does not support nested groups, or non-numpy arrays as datasets.

@leungcalvin
Copy link
Contributor Author

MemDiskGroup objects now have a copy() method which works recursively and corresponding unit tests, but still assumes all datasets are np.arrays.

caput/memh5.py Outdated
self[key][:], order="A"
), # ...so lets ensure that deep copies are actually made.
)
if hasattr(self, "index_map"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 - you should never have to explicitly copy index maps, since they are just an abstraction of an internal set of groups and datasets.
2 - MemGroups don't have index maps.

caput/memh5.py Outdated
@@ -1774,6 +1796,41 @@ def save(
**kwargs
)

def copy(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function probably shouldn't reimplement crawling through the tree, since that has already been implemented for the underlying storage (MemGroup.copy() or deep_group_copy() on an hdf5 group). You will just want to call that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair I mostly said no to doing that as I don't want anything to unexpectedly break by changing the behaviour of deep_group_copy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a shared argument which is a list of paths to datasets which should remain shared? That way I can use it to entirely replace the version in draco.core.containers.ContainerBase

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do kind of agree with @kiyo-masui here too. This seems a little confusing in here. Either it should live in MemDiskGroup and blindly copy the tree (which should work fine), or it should live in BasicCont where it can exploit the restricted structure of that class (i.e. non-recursive, index_maps etc). At the moment, it does a bit of a hybrid, i.e. it recursively copies everything, but with a special treatment for index_maps which aren't even a concept in MemDiskGroup

What you're doing kind of works, but as a bit of style advice generally a method should only call functionality it knows is there. In here you've protected the calls with a conditional, but having to do this is a good indicator, you probably should structure things differently:

  • Move the method down in the class hierarchy (i.e. into the subclass)
  • Move functionality into a hook method
  • Move the functionality you're calling up in the class hierearchy
  • Use an overload (you can call the base class method, and then do your own work)

caput/memh5.py Outdated
Could be generalized to MemGroups whose datasets are not all numpy arrays.
In particular, if the data structure is hierarchical (datasets more than one layer deep), this will complain and fail.
"""
for key in new.keys():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that new isn't defined here I don't see how this would work, and so you could probably do with another test.

Copy link
Contributor

@jrs65 jrs65 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @leungcalvin this is looking like a good start. Thanks for this. I've added a few review comments. Ping me here or on slack if you have any comments about them.

caput/memh5.py Outdated
_copy(f[key], g_key)
else:
try:
g.create_dataset(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this won't work correctly with distributed datasets:

  • It doesn't pass on the distributed keywords
  • It doesn't attempt to pass the array via an MPIArray which would be expected for creating a distributed dataset (this may or may not work)
  • This may also strip any chunking or compression params that were set on the source.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show me where to look to figure out how to write this?

caput/memh5.py Outdated
@@ -1774,6 +1796,41 @@ def save(
**kwargs
)

def copy(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a shared argument which is a list of paths to datasets which should remain shared? That way I can use it to entirely replace the version in draco.core.containers.ContainerBase

caput/memh5.py Outdated
@@ -1774,6 +1796,41 @@ def save(
**kwargs
)

def copy(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do kind of agree with @kiyo-masui here too. This seems a little confusing in here. Either it should live in MemDiskGroup and blindly copy the tree (which should work fine), or it should live in BasicCont where it can exploit the restricted structure of that class (i.e. non-recursive, index_maps etc). At the moment, it does a bit of a hybrid, i.e. it recursively copies everything, but with a special treatment for index_maps which aren't even a concept in MemDiskGroup

What you're doing kind of works, but as a bit of style advice generally a method should only call functionality it knows is there. In here you've protected the calls with a conditional, but having to do this is a good indicator, you probably should structure things differently:

  • Move the method down in the class hierarchy (i.e. into the subclass)
  • Move functionality into a hook method
  • Move the functionality you're calling up in the class hierearchy
  • Use an overload (you can call the base class method, and then do your own work)

@leungcalvin
Copy link
Contributor Author

Apologies for the last commit, I was careless with a git pull and ended up unknowingly having two implementations of copy floating around. I have refactored everything to work for MemDiskGroup, and I've implemented the shared keyword.

@leungcalvin
Copy link
Contributor Author

Do either of you know what is the source of the readthedocs error? @jrs65 @kiyo-masui

@kiyo-masui
Copy link
Contributor

I do not know where the error is from. However, readthedocs parses the doc strings, so its possible they aren't formatted correctly (it expects a very particular formatting conversion).

@ketiltrout
Copy link
Member

It's apparently a Sphinx bug: sphinx-doc/sphinx#9817 which is fixed in Sphinx 4.3, but we've pinned Sphinx to version 3.4.2.

@leungcalvin
Copy link
Contributor Author

leungcalvin commented Apr 7, 2022

@zkader and I would like to push this soon. Is the sphinx bug not an issue for the main branch of caput? Is there some wy to pull the docs from main?

@ljgray
Copy link
Contributor

ljgray commented Mar 16, 2023

I'm going to close this as this functionality was added in #236, based on feedback from this PR

@ljgray ljgray closed this Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants