Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memh5.from_group does not have the expected behavior #190

Closed
leungcalvin opened this issue Oct 17, 2021 · 9 comments
Closed

memh5.from_group does not have the expected behavior #190

leungcalvin opened this issue Oct 17, 2021 · 9 comments
Assignees

Comments

@leungcalvin
Copy link
Contributor

This bug was discovered downstream (in baseband_analysis) whose main data container BBData inherits from BasicCont. I tried to make a new, independent instance of a BBData object that lives in memory. I tried doing this using from_group but it looks like the new object created is, in fact, not a deep copy of the original (even though from_group calls something which is supposed to be deep_group_copy).

Modifying the datasets of the new object modifies the original. Is this the expected behavior? This is potentially dangerous for downstream analysis; tagging @jrs65 and @kiyo-masui because I'm so out of my depth here.

image

@leungcalvin leungcalvin self-assigned this Oct 17, 2021
@kiyo-masui
Copy link
Contributor

kiyo-masui commented Oct 17, 2021 via email

@kiyo-masui
Copy link
Contributor

Took a look at the code and believe it or not there doesn't seem to be an official way to do what you want. That said, what you are doing is definitely wrong, and there are a could things that are less wrong that might work. Try the following:

chime_bbdata2 = BBdata()
memh5.deep_group_copy(chime_bbdata, chime_bbdata2)

or maybe

g = memh5.MemGroup.from_group(chime_bbdata)
chime_bbdata2 = BBData.from_group(g)

Both of these require importing memh5, which isn't ideal since then you need to know memh5 exists. What should probably really happen is an appropriate method or class method should be written for memh5.MemDiskGroup or possibly memh5._BaseGroup, but that would require some thought.

@jrs65
Copy link
Contributor

jrs65 commented Oct 18, 2021

For CHIME cosmology we built this functionality in at a higher level of abstraction into draco.core.containers.ContainerBase.copy, so we've never really noticed it was missing.

I think we could feasibly move the functionality down to a lower level (probably BasicCont) though if that would be helpful. It would require a little bit of refactoring, but provided the API remained the same I don't think we would care.

@leungcalvin
Copy link
Contributor Author

leungcalvin commented Oct 18, 2021

Unfortunately neither of @kiyo-masui's suggestions do the trick. The former suggestion is actually what's being done in baseband-analysis here: https://github.com/CHIMEFRB/baseband-analysis/blob/312c2a473b0470c5728ecea85ca169cc1c4c1673/baseband_analysis/core/bbdata.py#L321

chime_bbdata2 = BBdata()
memh5.deep_group_copy(chime_bbdata, chime_bbdata2)
image

g = memh5.MemGroup.from_group(chime_bbdata)
chime_bbdata2 = BBData.from_group(g)

image

@ginoble
Copy link

ginoble commented Oct 18, 2021

Seems like memh5.deep_group_copy doesn't deep copy datasets. In particular, I think memh5.py#L2242 is shallow:

g2.create_dataset(
    key,
    shape=data.shape,
    dtype=data.dtype,
    data=data,
    chunks=entry.chunks,
    compression=entry.compression,
    compression_opts=entry.compression_opts,
)

Is this intentional? If not, can we simply do something like g2.create_dataset(..., data=data.copy(order="A"), ...) to avoid API changes?

@kiyo-masui
Copy link
Contributor

That's baffling. Digging through many layers of git blame, this has been the same since I first wrote it 8 years ago... inexcusable. Its possible that the behavior of g.create_dataset() changed... I didn't dig through that blame. Anyway, that's obviously not how deep_group_copy() should behave given its name. @ginoble's suggestion should work.

I would advocate also adding draco's .copy() to MemDiskGroup though. Seems like good functionality to have.

@leungcalvin
Copy link
Contributor Author

leungcalvin commented Oct 18, 2021

I think the suggested fix is sufficient, in the sense that the other parts of deep_group_copy (namely copyattrs) makes deep copies as far as I can tell. Eight years....who knows, maybe numpy didn't do shallow copies back then....Does the suggested fix adversely affect any cosmology analysis? In any case, I will make a PR for caput implementing Gavin's one-liner as follows--could somebody merge it? @jrs65 would you mind checking that?

image

@jrs65
Copy link
Contributor

jrs65 commented Oct 18, 2021

I'm not sure it's really a bug here, but it is at least poor naming. I think it's effectively a deep copy when doing hdf5 <-> memory, but not when doing memory -> memory.

Unfortunately I'm going to veto Gavin's fix as is. Regardless of whether it's a bug or not, it's been in there way too long and I imagine parts of our pipeline will implicitly depend on the way it works, and I'm really not keen on having to go and find them.

I would mostly recommend writing an actual .copy() method, as it's clear what it should do.

I would also be okay if Gavin's fix was hidden behind an option to the deep_group_copy routine (and was obviously not the default). However you need to be careful there, the object that you'd be attempting to copy is often not a numpy array, but could be a distributed MPIArray, or an h5py Dataset, and I think the former has different semantics with a copy, and the latter doesn't have a copy method. It needs to work with all permutations as that routine is the key part of some file reading and writing operations.

@ljgray
Copy link
Contributor

ljgray commented May 22, 2024

@ketiltrout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants