Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groups.unique does not sort the group #3364

Closed
lilyminium opened this issue Jul 19, 2021 · 20 comments · Fixed by #3368
Closed

groups.unique does not sort the group #3364

lilyminium opened this issue Jul 19, 2021 · 20 comments · Fixed by #3368

Comments

@lilyminium
Copy link
Member

Expected behavior

Related to #2977 (comment)

The documentation implies that atomgroup.unique will return a sorted atomgroup, so I expect it to. However, there is a confusing contradiction in the documentation.

    def unique(self):
        """An :class:`AtomGroup` containing sorted and unique
        :class:`Atoms<Atom>` only.

        If the :class:`AtomGroup` is unique, this is the group itself.

Actual behavior

If the atomgroup is already unique, as per docs, the original unsorted atomgroup is returned. @coredevs is this the desired behaviour? Should we change it or add a new sorted attribute?

The .unique attribute is primarily used in select_atoms and wrapping-related methods, i.e. bsphere, wrap, unwrap, translate, and rotate. I don't think being ordered is important in any of those -- but being sorted is probably important to select_atoms? Are any users relying on unique being sorted, or conversely not being sorted?

Code to reproduce the behavior

>>> import MDAnalysis as mda
>>> from MDAnalysis.tests.datafiles import PDB
>>> u = mda.Universe(PDB)
>>> ag = u.atoms[[2, 1, 0]]
>>> ag.unique.ix
array([2, 1, 0])

Current version of MDAnalysis

  • Which version are you using? (run python -c "import MDAnalysis as mda; print(mda.__version__)") 2.0.0-dev0
  • Which version of Python (python -V)?
  • Which operating system?
@orbeckst
Copy link
Member

The expectation for selections has been that they come back sorted. Right now that's a defect, at least for 1.x.

@lilyminium
Copy link
Member Author

The expectation for selections has been that they come back sorted.

Would you know the expectation for atomgroup.unique? There are two ways to fix this:

  • sort atomgroup.unique and that should fix select_atoms upstream
  • add a sorted() method or property that sorts the atomgroup without ensuring uniqueness; alter the selection methods to call atomgroup.sorted().unique. This has the advantage of adding more general functionality to groups too

@ceri-richmond
Copy link

Just to say I think you got the wrong @coredevs

@lilyminium
Copy link
Member Author

Oops, sorry! I did mean @MDAnalysis/coredevs.

@orbeckst
Copy link
Member

I might have prematurely slapped the "defect" tag on here; it would have been more appropriate on an issue directly referencing select_atoms() because current behavior of unsorted atomgroups clearly breaks stated behavior

Selections always return an AtomGroup with atoms sorted according to their index in the topology (this is to ensure that there are not any duplicates, which can happen with complicated selections).

Regarding your #3364 (comment)

Would you know the expectation for atomgroup.unique?

No, I am not sure. But I don't like the current behavior that will produce erratic results. It would be cleaner if unique just did the same thing as numpy.unique:

Find the unique elements of an array.

Returns the sorted unique elements of an array.

There might have been speed considerations in the current logic somewhere, so I can see that you might not want to sort unless you have to. I can also see that your proposal of decoupling sorting and uniqueness gives more flexibility. IIRC, sorted arrays can be much faster de-duplicated than unsorted ones, though, so that was probably why the two went hand-in-hand. On balance, however, I like re-affirming user expectations, namely that MDAnalysis unique works very similar to the numpy unique, especially as we tend to treat AtomGroups like "numpy arrays of atoms".

@jbarnoud
Copy link
Contributor

At which point of the selection process is unique called? Because if you make a compound selection, then you need to be careful when you sort. Also, selections are sorted because atoms are looked at in order, adding a sort would be unnecessary slow, wouldn't it?

I think the most important point is that if an atomgroup is already not redundant, then calling unique does return the same atomgroup. Sorting it breaks this.

@lilyminium
Copy link
Member Author

At which point of the selection process is unique called?

Somewhat repetitively, at the end of each logical operation selection class (and, or, all, not, global) and at the end of many if not all of the keyword classes. It should probably just be another method on Selection.

Because if you make a compound selection, then you need to be careful when you sort.

Yeah, the select_atoms method sums each selection in compounds so as long as we keep the sorting within the selection classes, order will be retained.

Also, selections are sorted because atoms are looked at in order, adding a sort would be unnecessary slow, wouldn't it?

Yeah, but in the order of the input atom group -- so in the example where the input is u.atoms[[2, 1, 0]], the selection class will inspect the group with the third atom first.

I think the most important point is that if an atomgroup is already not redundant, then calling unique does return the same atomgroup. Sorting it breaks this.

Just to be clear, in your opinion then u.atoms[[2, 1, 0]].unique should return u.atoms[[2, 1, 0]] instead of [0, 1, 2]?

@jbarnoud
Copy link
Contributor

Yeah, but in the order of the input atom group -- so in the example where the input is u.atoms[[2, 1, 0]], the selection class will inspect the group with the third atom first.

This is the behaviours I expect. But I may not be representative. My issue with sorting here, is that I can sort afterward, but I cannot go back to the weird order I may have spent time building.

I think the most important point is that if an atomgroup is already not redundant, then calling unique does return the same atomgroup. Sorting it breaks this.

Just to be clear, in your opinion then u.atoms[[2, 1, 0]].unique should return u.atoms[[2, 1, 0]] instead of [0, 1, 2]?

Indeed.

@lilyminium
Copy link
Member Author

I'm happy to have .unique not sort, but IMO select_atoms is really advertised as sorting the output:

We could add a keyword sorted=True?

@lilyminium
Copy link
Member Author

lilyminium commented Jul 25, 2021

Could we just get a brief consensus please, thumbs up 👍 for sorting and thumbs down 👎 for not sorting groups.unique? (pinging @MDAnalysis/coredevs too)

Pros for sorting

  • It follows np.unique
  • I think it was previous behaviour until 0.9.2, so
  • It is currently documented as sorting

Cons for sorting

  • unique != sorted in English. np.unique is not a direct parallel to group.unique; they do not have ndarray.unique.
  • If a group is already unique, this will not return the same group but a new one that is sorted
  • It is currently documented as returning any group that is unique

(Please feel free to edit this comment with more points)

@IAlibay
Copy link
Member

IAlibay commented Jul 25, 2021

So my understanding here is that if it's not unique it sorts, if it is unique it doesn't sort?

Personally I think that for 2.0 the behaviour should be aligned to the original intended behaviour. Then if we want to re-works this and do an API break then fine, but that should be handled in 3.0.

@lilyminium
Copy link
Member Author

So my understanding here is that if it's not unique it sorts, if it is unique it doesn't sort?

Yes. So one way or another we should make it consistent.

@IAlibay
Copy link
Member

IAlibay commented Jul 25, 2021

So for context, group.unique was implemented in #1922 by @zemanj - to avoid doing a copy operation where possible. If we tag on a sort we'll probably have a performance regression.

That being said, it looks like the target for performance improvement here was pack_into_box/wrap, which as far as I can tell don't actually call unique if isunique is True anyways?

That's not the case for unwrap, but surely we could just apply the same isunique strategy to it too?

Same thing for bsphere, transform, and rotate?

We'd have to change the behaviour of _unique_restore_mask, but that's not user facing so hopefully shouldn't be too much of a deal?

@orbeckst
Copy link
Member

orbeckst commented Jul 25, 2021

I am in favor of making unique behave consistently so that users are not surprised (EDIT: and making it behave as documented and similar to np.unique, i.e., sort).

Behind the scene we can special case for performance.

@orbeckst
Copy link
Member

orbeckst commented Jul 25, 2021

EDIT: I thought of ag.unique as a function that could have kwargs but it's actually an attribute so the following won't work with ag.unique:

Would it be useful to have a kwarg to unique like mode = {sort|nosort|fast} where "fast" is the current behavior that doesn't guarantee any particular ordering or identity, "sort" which always sorts, "nosort" (or a better name) which maintains the original order as much as possible and starts deleting duplicates after the first appearance? "sort" and "nosort" would always return a copy of the AG (for consistency) even if nothing was done.


I think we have to distinguish between internal uses where we want to optimize for speed (e.g. reduce copies) and external uses, where consistency by default (and avoiding surprises) is important.

@lilyminium
Copy link
Member Author

lilyminium commented Jul 25, 2021

How about asunique() for returning atomgroups that are unique but unsorted? Parallels with asarray(), ascontiguous() etc.

Edit: and I would be in favour of asunique() taking a sort argument but defaulting to not sorting

@IAlibay
Copy link
Member

IAlibay commented Jul 25, 2021

Might be me being naive, but do we actually use unique enough that lots of optmization is required here?

How about asunique() for returning atomgroups that are unique but unsorted? Parallels with asarray(), ascontiguous() etc.
Edit: and I would be in favour of asunique() taking a sort argument but defaulting to not sorting

👍🏽 For 2.0 at least, I'm in favour of not changing unique from being a property, so asunique sounds like a good compromise to me.

@orbeckst
Copy link
Member

orbeckst commented Jul 25, 2021 via email

@zemanj
Copy link
Member

zemanj commented Jul 25, 2021

Looks like by fixing the duplicate handling in #1922 I broke the behavior of *Group.unique, which was definitely unintended. The correct way would have been to make *Group.uinque not only check for uniqueness but also for order. As index sorting is done anyway for the uniqueness check in *Group.isunique, checking the sort order shouldn't add a lot of computational complexity there (only O(n), to be precise). That, in turn, would call for also having a cached *Group.issorted property, which could be leveraged by (and also set from) *Group.isuinque, and would also come in handy in AtomGroup.sort.
So much for what I should have done three years ago... my (very late) apologies for that!

What really surprises me, though, is the fact that it took so long for this behavior-breaking change to even being noticed.
In that respect, I don't have any strong opinion on whether we should change it back to the original behavior or just leave it as it is. Either way, the documented behavior of other methods (like atom selections, etc.) that was broken by the changed behavior of *Group.unique should be fixed.

I think the behavior of *Group.isunique should not be changed as I personally wouldn't expect it to return False in case of a group without duplicates that is not sorted. I'm pretty sure that if we changed the behavior of *Group.unique to only return the group itself if it is already unique and sorted, it wouldn't even be noticed by anyone - but yeah, we better play it nice here, whichever route we take.

@zemanj
Copy link
Member

zemanj commented Jul 25, 2021

To comment on @IAlibay's question about the necessity of optimizing: I guess it depends. If you process very large systems, avoiding full copies of atom groups will certainly have an impact. However, my guess is that the most relevant speedup is achieved by caching, and that will also be effective if the isunique check is omitted. To be sure, I guess we'd have to measure...

IAlibay pushed a commit that referenced this issue Aug 17, 2021
Fixes #3364 #2977 

## Work done in this PR
 - Add .asunique. This function (through ._asunique) now does all the grunt work in making unique groups and caching
 - Change .unique to always return a copy of the sorted unique group
 - Add .issorted
 - .unique is no longer cached. _cache['unique'] no longer exists
 - .sorted_unique and .unsorted_unique are now cached
 - .unsorted_unique and .sorted_unique can be the same group, which can be self
 - Added .cache_properties
 - uniqueness and sorted properties now propagate to children. They did not use to, the new function is called `_set_unique_caches_from`
 - Add ability to selectively sort `select_atoms` calls through `sorted` argument
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants