Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of compounds involves excessive sorting and fancy-indexing of AtomGroups and compound indices. #3000

Closed
mnmelo opened this issue Oct 20, 2020 · 2 comments · Fixed by #3005

Comments

@mnmelo
Copy link
Member

mnmelo commented Oct 20, 2020

Hi,
I realized that there is potential for optimization of most AtomGroup methods that implement compounds. Currently, compounds are implemented by cleverly splitting into compound groups of the same size, to allow vectorization at least on a per-size basis. This was a great idea! (git blame tells me kudos go to @kain88-de and @richardjgowers).

Current improvable aspects:

  • By default, compound_indices are always being sorted, to be robust to cases where compounds in an AtomGroup are not contiguous. This is slow and unneeded for the most common use case of contiguous compounds. A contiguity check is cheap (this was already implemented for AtomGroup.center only, in the resolution of AtomGroup.center fails for compounds+wrapping #2984 and Fixes AtomGroup.center when using compounds + unwrapping #2992);
  • Most compound-using methods currently do their own splitting. This is inefficient code- and readability-wise, and should be concentrated in a dedicated method;
  • Splitting masks for a given compound type need to be calculated only once per AtomGroup and could be cached. This would positively impact a number of methods, especially when iterating over frames (with due care to invalidate such caches if the user changes the AtomGroup's compound indices). Even within a single calculation this can already be impactful: when using AtomGroup.center with unwrap=True indices are split once to do the unwrapping and then again to do the center calculation.

I'd like to work on such refactoring, including some benchmarking for performance, but I need input so I don't set off on an effort that will either run against current code philosophy or is too large for anyone to review properly.

Currently needed pointers:

  • Should this be a project?
  • What is the recommended way of implementing caching?
  • This will enlarge the memory footprint of an AtomGroup due to caching. I expect about 1x-to-2x the size of ._ix per cached compound-type splitting. Is this acceptable?
  • What test cases should be used for performance benchmarking? This will of course be more visible for large AtomGroups with many compounds, but I wouldn't want to harm the small-case performance. I thought about systems with a combination of the following characteristics: number of atoms (100, 10k, 100k), number of compounds (1, 10, 100), heterogeneity of compound sizes (all_same_size, all_different), contiguity (ordered, unordered);
  • What should be the scope of the enhancements? For instance, I stumbled upon this while trying to speed up AtomGroup.unwrap (Implements compact wrapping and nojump, and updates associated transforms. #2982), and there are unwrap-specific aspects that can be tweaked to take advantage of a centralized splitting solution. Should those be left out for an unwrap-specific effort?
  • Following from the previous point: what other aspects do you think could be optimized along with centralized splitting?
  • Who wants to join in on the fun? 😀
@mnmelo
Copy link
Member Author

mnmelo commented Oct 20, 2020

@zemanj, I see from the discussion in #2376 that you already have a headstart in unwrap optimization. What do you think of how this refactor could impact it? Could it be a base from where to merge your optimization work?

@mnmelo
Copy link
Member Author

mnmelo commented Oct 23, 2020

So, I went ahead and implemented most of what I put forth above in the draft PR #3005.

Besides a cleaner code, performance is improved. I timed the AtomGroup.center method (best time of 1000 repeats) for several combinations of conditions, and almost all cases become faster, and some by more than twofold:

Legend:

  • n_atoms: total number of atoms in the system;
  • n_comp.: total number of compounds in the system;
  • heter.: (bool) whether compounds are all of the same size;
  • contig.: (bool) whether compound indices are monotonically increasing;
  • current: timing (in seconds) of current implementation in develop (beb5232);
  • speedup: ratio of timings between current and the optimized code, with caching disabled (higher is better);
  • speedup_cached: ratio of timings between current and the optimized code, with caching (higher is better).

The heter. and contig. flags were ignored for the combinations that don't make sense (like having a single compound and heter.=True or contig.=False).

n_atoms   n_comp.  heter.  contig.  current   speedup  speedup_cached
100       1        1       1        1.02e-04  1.36     3.21     
100       1        1       0        1.01e-04  1.22     3.05     
100       1        0       1        1.07e-04  1.24     3.43     
100       1        0       0        1.01e-04  1.37     3.18     
100       10       1       1        1.29e-04  1.63     4.46     
100       10       1       0        1.45e-04  1.79     4.97     
100       10       0       1        2.96e-04  1.48     3.32     
100       10       0       0        3.73e-04  1.50     2.76     
100       100      1       1        1.53e-04  1.86     4.68     
100       100      1       0        1.64e-04  1.91     5.76     
100       100      0       1        1.50e-04  1.76     4.67     
100       100      0       0        1.71e-04  1.87     5.97     
10000     1        1       1        7.90e-04  1.14     1.53     
10000     1        1       0        7.82e-04  1.13     1.49     
10000     1        0       1        7.84e-04  1.16     1.45     
10000     1        0       0        7.75e-04  1.12     1.44     
10000     10       1       1        8.38e-04  1.22     1.56     
10000     10       1       0        1.34e-03  1.38     2.56     
10000     10       0       1        1.36e-03  1.42     2.17     
10000     10       0       0        1.83e-03  1.31     2.71     
10000     100      1       1        1.08e-03  1.65     1.98     
10000     100      1       0        1.69e-03  1.51     2.94     
10000     100      0       1        3.43e-03  2.13     3.47     
10000     100      0       0        4.02e-03  1.75     3.94     
1000000   1        1       1        7.93e-02  1.31     1.46     
1000000   1        1       0        7.75e-02  1.18     1.43     
1000000   1        0       1        7.90e-02  1.25     1.40     
1000000   1        0       0        7.88e-02  1.24     1.39     
1000000   10       1       1        8.41e-02  1.33     1.50     
1000000   10       1       0        1.93e-01  1.12     1.77     
1000000   10       0       1        1.09e-01  1.51     2.18     
1000000   10       0       0        2.23e-01  0.98     1.56     
1000000   100      1       1        1.07e-01  1.71     1.98     
1000000   100      1       0        2.66e-01  1.24     2.01     
1000000   100      0       1        3.87e-01  2.66     7.47     
1000000   100      0       0        5.63e-01  1.53     4.10

Only in one case was there a slowdown, and the small 2% difference is likely not significant. All others were speedups, with a maximum uncached speedup of 2.66-fold and cached of 7.47-fold.

I'm continuing this discussion in the draft PR #3005. Pointers are still needed, especially on cache management.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant