groups.unique does not sort the group #3364

lilyminium · 2021-07-19T14:21:46Z

Expected behavior

The documentation implies that atomgroup.unique will return a sorted atomgroup, so I expect it to. However, there is a confusing contradiction in the documentation.

    def unique(self):
        """An :class:`AtomGroup` containing sorted and unique
        :class:`Atoms<Atom>` only.

        If the :class:`AtomGroup` is unique, this is the group itself.

Actual behavior

If the atomgroup is already unique, as per docs, the original unsorted atomgroup is returned. @coredevs is this the desired behaviour? Should we change it or add a new sorted attribute?

The .unique attribute is primarily used in select_atoms and wrapping-related methods, i.e. bsphere, wrap, unwrap, translate, and rotate. I don't think being ordered is important in any of those -- but being sorted is probably important to select_atoms? Are any users relying on unique being sorted, or conversely not being sorted?

Code to reproduce the behavior

>>> import MDAnalysis as mda
>>> from MDAnalysis.tests.datafiles import PDB
>>> u = mda.Universe(PDB)
>>> ag = u.atoms[[2, 1, 0]]
>>> ag.unique.ix
array([2, 1, 0])

Current version of MDAnalysis

Which version are you using? (run python -c "import MDAnalysis as mda; print(mda.__version__)") 2.0.0-dev0
Which version of Python (python -V)?
Which operating system?

The text was updated successfully, but these errors were encountered:

orbeckst · 2021-07-19T16:29:23Z

The expectation for selections has been that they come back sorted. Right now that's a defect, at least for 1.x.

lilyminium · 2021-07-19T16:47:39Z

The expectation for selections has been that they come back sorted.

Would you know the expectation for atomgroup.unique? There are two ways to fix this:

sort atomgroup.unique and that should fix select_atoms upstream
add a sorted() method or property that sorts the atomgroup without ensuring uniqueness; alter the selection methods to call atomgroup.sorted().unique. This has the advantage of adding more general functionality to groups too

ceri-richmond · 2021-07-19T17:28:27Z

Just to say I think you got the wrong @coredevs

lilyminium · 2021-07-19T17:31:56Z

Oops, sorry! I did mean @MDAnalysis/coredevs.

orbeckst · 2021-07-19T17:45:16Z

I might have prematurely slapped the "defect" tag on here; it would have been more appropriate on an issue directly referencing select_atoms() because current behavior of unsorted atomgroups clearly breaks stated behavior

Selections always return an AtomGroup with atoms sorted according to their index in the topology (this is to ensure that there are not any duplicates, which can happen with complicated selections).

Regarding your #3364 (comment)

Would you know the expectation for atomgroup.unique?

No, I am not sure. But I don't like the current behavior that will produce erratic results. It would be cleaner if unique just did the same thing as numpy.unique:

Find the unique elements of an array.

Returns the sorted unique elements of an array.

There might have been speed considerations in the current logic somewhere, so I can see that you might not want to sort unless you have to. I can also see that your proposal of decoupling sorting and uniqueness gives more flexibility. IIRC, sorted arrays can be much faster de-duplicated than unsorted ones, though, so that was probably why the two went hand-in-hand. On balance, however, I like re-affirming user expectations, namely that MDAnalysis unique works very similar to the numpy unique, especially as we tend to treat AtomGroups like "numpy arrays of atoms".

jbarnoud · 2021-07-19T18:04:02Z

At which point of the selection process is unique called? Because if you make a compound selection, then you need to be careful when you sort. Also, selections are sorted because atoms are looked at in order, adding a sort would be unnecessary slow, wouldn't it?

I think the most important point is that if an atomgroup is already not redundant, then calling unique does return the same atomgroup. Sorting it breaks this.

lilyminium · 2021-07-19T18:17:04Z

At which point of the selection process is unique called?

Somewhat repetitively, at the end of each logical operation selection class (and, or, all, not, global) and at the end of many if not all of the keyword classes. It should probably just be another method on Selection.

Because if you make a compound selection, then you need to be careful when you sort.

Yeah, the select_atoms method sums each selection in compounds so as long as we keep the sorting within the selection classes, order will be retained.

Also, selections are sorted because atoms are looked at in order, adding a sort would be unnecessary slow, wouldn't it?

Yeah, but in the order of the input atom group -- so in the example where the input is u.atoms[[2, 1, 0]], the selection class will inspect the group with the third atom first.

I think the most important point is that if an atomgroup is already not redundant, then calling unique does return the same atomgroup. Sorting it breaks this.

Just to be clear, in your opinion then u.atoms[[2, 1, 0]].unique should return u.atoms[[2, 1, 0]] instead of [0, 1, 2]?

jbarnoud · 2021-07-19T18:21:02Z

Yeah, but in the order of the input atom group -- so in the example where the input is u.atoms[[2, 1, 0]], the selection class will inspect the group with the third atom first.

This is the behaviours I expect. But I may not be representative. My issue with sorting here, is that I can sort afterward, but I cannot go back to the weird order I may have spent time building.

I think the most important point is that if an atomgroup is already not redundant, then calling unique does return the same atomgroup. Sorting it breaks this.

Just to be clear, in your opinion then u.atoms[[2, 1, 0]].unique should return u.atoms[[2, 1, 0]] instead of [0, 1, 2]?

Indeed.

lilyminium · 2021-07-19T18:32:30Z

I'm happy to have .unique not sort, but IMO select_atoms is really advertised as sorting the output:

mdanalysis/package/MDAnalysis/analysis/align.py

Lines 376 to 380 in 8f9f402

    
               .. Warning:: The atom order for `mobile` and `reference` is *only* 
        
                  preserved when `select` is either "all" or ``None``. In any other case, 
        
                  a new selection will be made that will sort the resulting AtomGroup by 
        
                  index and therefore destroy the correspondence between the two groups. 
        
                  **It is safest not to mix ordered AtomGroups with selection strings.**

mdanalysis/package/MDAnalysis/analysis/align.py

Lines 478 to 480 in 8f9f402

    
           # keep the EXACT order in the input AtomGroups; select_atoms('all') 
        
           # orders them by index, which can lead to wrong results if the user 
        
           # has crafted mobile and reference to match atom by atom

mdanalysis/package/MDAnalysis/analysis/rms.py

Lines 382 to 383 in 8f9f402

    
                       :class:`~MDAnalysis.core.groups.AtomGroup` with defined atom order as 
        
                       described under :ref:`ordered-selections-label`).

mdanalysis/package/MDAnalysis/core/groups.py

Line 2705 in 8f9f402

Returns an :class:`AtomGroup` sorted according to their index in the
https://docs.mdanalysis.org/stable/documentation_pages/selections.html#ordered-selections

We could add a keyword sorted=True?

lilyminium · 2021-07-25T06:12:01Z

Could we just get a brief consensus please, thumbs up 👍 for sorting and thumbs down 👎 for not sorting groups.unique? (pinging @MDAnalysis/coredevs too)

Pros for sorting

It follows np.unique
I think it was previous behaviour until 0.9.2, so
It is currently documented as sorting

Cons for sorting

unique != sorted in English. np.unique is not a direct parallel to group.unique; they do not have ndarray.unique.
If a group is already unique, this will not return the same group but a new one that is sorted
It is currently documented as returning any group that is unique

(Please feel free to edit this comment with more points)

IAlibay · 2021-07-25T06:25:41Z

So my understanding here is that if it's not unique it sorts, if it is unique it doesn't sort?

Personally I think that for 2.0 the behaviour should be aligned to the original intended behaviour. Then if we want to re-works this and do an API break then fine, but that should be handled in 3.0.

lilyminium · 2021-07-25T06:27:05Z

So my understanding here is that if it's not unique it sorts, if it is unique it doesn't sort?

Yes. So one way or another we should make it consistent.

IAlibay · 2021-07-25T07:17:34Z

So for context, group.unique was implemented in #1922 by @zemanj - to avoid doing a copy operation where possible. If we tag on a sort we'll probably have a performance regression.

That being said, it looks like the target for performance improvement here was pack_into_box/wrap, which as far as I can tell don't actually call unique if isunique is True anyways?

That's not the case for unwrap, but surely we could just apply the same isunique strategy to it too?

Same thing for bsphere, transform, and rotate?

We'd have to change the behaviour of _unique_restore_mask, but that's not user facing so hopefully shouldn't be too much of a deal?

orbeckst · 2021-07-25T16:04:58Z

I am in favor of making unique behave consistently so that users are not surprised (EDIT: and making it behave as documented and similar to np.unique, i.e., sort).

Behind the scene we can special case for performance.

orbeckst · 2021-07-25T16:09:22Z

EDIT: I thought of ag.unique as a function that could have kwargs but it's actually an attribute so the following won't work with ag.unique:

Would it be useful to have a kwarg to unique like mode = {sort|nosort|fast} where "fast" is the current behavior that doesn't guarantee any particular ordering or identity, "sort" which always sorts, "nosort" (or a better name) which maintains the original order as much as possible and starts deleting duplicates after the first appearance? "sort" and "nosort" would always return a copy of the AG (for consistency) even if nothing was done.

I think we have to distinguish between internal uses where we want to optimize for speed (e.g. reduce copies) and external uses, where consistency by default (and avoiding surprises) is important.

lilyminium · 2021-07-25T16:28:43Z

How about asunique() for returning atomgroups that are unique but unsorted? Parallels with asarray(), ascontiguous() etc.

Edit: and I would be in favour of asunique() taking a sort argument but defaulting to not sorting

IAlibay · 2021-07-25T17:11:59Z

Might be me being naive, but do we actually use unique enough that lots of optmization is required here?

How about asunique() for returning atomgroups that are unique but unsorted? Parallels with asarray(), ascontiguous() etc.
Edit: and I would be in favour of asunique() taking a sort argument but defaulting to not sorting

👍🏽 For 2.0 at least, I'm in favour of not changing unique from being a property, so asunique sounds like a good compromise to me.

orbeckst · 2021-07-25T18:26:14Z

I like the suggestion of adding asunique() and make unique always sort.

…

Am 7/25/21 um 10:12 schrieb Irfan Alibay ***@***.***>: Might be me being naive, but do we actually use unique enough that lots of optmization is required here? How about asunique() for returning atomgroups that are unique but unsorted? Parallels with asarray(), ascontiguous() etc. Edit: and I would be in favour of asunique() taking a sort argument but defaulting to not sorting 👍🏽 For 2.0 at least, I'm in favour of not changing unique from being a property, so asunique sounds like a good compromise to me. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

zemanj · 2021-07-25T20:51:42Z

Looks like by fixing the duplicate handling in #1922 I broke the behavior of *Group.unique, which was definitely unintended. The correct way would have been to make *Group.uinque not only check for uniqueness but also for order. As index sorting is done anyway for the uniqueness check in *Group.isunique, checking the sort order shouldn't add a lot of computational complexity there (only O(n), to be precise). That, in turn, would call for also having a cached *Group.issorted property, which could be leveraged by (and also set from) *Group.isuinque, and would also come in handy in AtomGroup.sort.
So much for what I should have done three years ago... my (very late) apologies for that!

What really surprises me, though, is the fact that it took so long for this behavior-breaking change to even being noticed.
In that respect, I don't have any strong opinion on whether we should change it back to the original behavior or just leave it as it is. Either way, the documented behavior of other methods (like atom selections, etc.) that was broken by the changed behavior of *Group.unique should be fixed.

I think the behavior of *Group.isunique should not be changed as I personally wouldn't expect it to return False in case of a group without duplicates that is not sorted. I'm pretty sure that if we changed the behavior of *Group.unique to only return the group itself if it is already unique and sorted, it wouldn't even be noticed by anyone - but yeah, we better play it nice here, whichever route we take.

zemanj · 2021-07-25T21:19:27Z

To comment on @IAlibay's question about the necessity of optimizing: I guess it depends. If you process very large systems, avoiding full copies of atom groups will certainly have an impact. However, my guess is that the most relevant speedup is achieved by caching, and that will also be effective if the isunique check is omitted. To be sure, I guess we'd have to measure...

Fixes #3364 #2977 ## Work done in this PR - Add .asunique. This function (through ._asunique) now does all the grunt work in making unique groups and caching - Change .unique to always return a copy of the sorted unique group - Add .issorted - .unique is no longer cached. _cache['unique'] no longer exists - .sorted_unique and .unsorted_unique are now cached - .unsorted_unique and .sorted_unique can be the same group, which can be self - Added .cache_properties - uniqueness and sorted properties now propagate to children. They did not use to, the new function is called `_set_unique_caches_from` - Add ability to selectively sort `select_atoms` calls through `sorted` argument

lilyminium added the Priority-High label Jul 19, 2021

lilyminium added this to the 2.0 milestone Jul 19, 2021

orbeckst mentioned this issue Jul 19, 2021

MDAnalysis.analysis.align.alignto won't reindex the atomgroup as is documented #2977

Closed

orbeckst added the defect label Jul 19, 2021

lilyminium mentioned this issue Jul 25, 2021

Add asunique, change unique, change atom selection behaviour #3368

Merged

4 tasks

IAlibay mentioned this issue Jul 30, 2021

Invalid PDB files produced from mdanalysis v1.1.1 DeepDriveMD/DeepDriveMD-pipeline#41

Closed

IAlibay closed this as completed in #3368 Aug 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groups.unique does not sort the group #3364

groups.unique does not sort the group #3364

lilyminium commented Jul 19, 2021

orbeckst commented Jul 19, 2021

lilyminium commented Jul 19, 2021

ceri-richmond commented Jul 19, 2021

lilyminium commented Jul 19, 2021

orbeckst commented Jul 19, 2021

jbarnoud commented Jul 19, 2021

lilyminium commented Jul 19, 2021

jbarnoud commented Jul 19, 2021

lilyminium commented Jul 19, 2021

lilyminium commented Jul 25, 2021 •

edited

Loading

IAlibay commented Jul 25, 2021

lilyminium commented Jul 25, 2021

IAlibay commented Jul 25, 2021

orbeckst commented Jul 25, 2021 •

edited

Loading

orbeckst commented Jul 25, 2021 •

edited

Loading

lilyminium commented Jul 25, 2021 •

edited

Loading

IAlibay commented Jul 25, 2021

orbeckst commented Jul 25, 2021 via email

zemanj commented Jul 25, 2021 •

edited

Loading

zemanj commented Jul 25, 2021

groups.unique does not sort the group #3364

groups.unique does not sort the group #3364

Comments

lilyminium commented Jul 19, 2021

Expected behavior

Actual behavior

Code to reproduce the behavior

Current version of MDAnalysis

orbeckst commented Jul 19, 2021

lilyminium commented Jul 19, 2021

ceri-richmond commented Jul 19, 2021

lilyminium commented Jul 19, 2021

orbeckst commented Jul 19, 2021

jbarnoud commented Jul 19, 2021

lilyminium commented Jul 19, 2021

jbarnoud commented Jul 19, 2021

lilyminium commented Jul 19, 2021

lilyminium commented Jul 25, 2021 • edited Loading

Pros for sorting

Cons for sorting

IAlibay commented Jul 25, 2021

lilyminium commented Jul 25, 2021

IAlibay commented Jul 25, 2021

orbeckst commented Jul 25, 2021 • edited Loading

orbeckst commented Jul 25, 2021 • edited Loading

lilyminium commented Jul 25, 2021 • edited Loading

IAlibay commented Jul 25, 2021

orbeckst commented Jul 25, 2021 via email

zemanj commented Jul 25, 2021 • edited Loading

zemanj commented Jul 25, 2021

lilyminium commented Jul 25, 2021 •

edited

Loading

orbeckst commented Jul 25, 2021 •

edited

Loading

orbeckst commented Jul 25, 2021 •

edited

Loading

lilyminium commented Jul 25, 2021 •

edited

Loading

zemanj commented Jul 25, 2021 •

edited

Loading