Higher Performance nuclinfo #3611

ALescoulie · 2022-04-07T21:49:27Z

Changes made in this Pull Request:

Add AnalysisBase derived classes for calculating different nucleic acid base pair distances.

I built a new class NucPairDist that calculates distances between atoms in nucleic acid. It has 3 subclasses WCDist, MajorDist and MinorDist. Those are indented to improve the performance of their equivalents in nucinfo. They also support taking groups of base pairs and running the distance over a trajectory.

TODO:

Docs

PR Checklist

Tests?
Docs?
CHANGELOG updated?
Issue raised/referenced?

pep8speaks · 2022-04-07T21:49:30Z

Hello @ALescoulie! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file package/MDAnalysis/analysis/nucleicacids.py:

Line 105:80: E501 line too long (86 > 79 characters)
Line 141:80: E501 line too long (86 > 79 characters)
Line 142:80: E501 line too long (84 > 79 characters)

Comment last updated at 2022-04-19 08:17:12 UTC

ALescoulie · 2022-04-07T22:04:49Z

Just realized that pandas is not included in MDAnalysis, I had been using tidy DataFrames to record my results

IAlibay · 2022-04-07T22:09:48Z

I'll let the other @MDAnalysis/coredevs speak here too, but I'm not particularly keen on having pandas as a core dependency (in fact my current goal is to remove more dependencies from MDAnalysis). I would prefer if numpy arrays could be used to store results where possible.

ALescoulie · 2022-04-07T22:15:01Z

@IAlibay Don't think switching to arrays will be too much trouble, I'm trying to store selection, distance and time results in an organized way so I deflated to a DataFrame.

orbeckst · 2022-04-07T22:15:26Z

I second @IAlibay 's #3611 (comment) — keep it as a numpy array. I don't see pandas here to be essential.

If you really need to make data available then you could create to_df() that imports pandas only in the function and that can fail and tell you to install it. We have packages that are optional and then we follow that approach.

orbeckst · 2022-04-07T22:16:42Z

Btw, pandas.DataFrame does not have great performance in my experience — you get convenience but not the speed of numpy arrays.

ALescoulie · 2022-04-07T22:18:37Z

@orbeckst I don't use it during iteration, I just build it from a dictionary in the _conclude step, but I could just return a names list and a numpy array of distances and times. Overall is this like what you were wanting for #3600 .

orbeckst · 2022-04-07T22:22:56Z

I understand; my point was that when you want to process your results further (e.g., calculate statistics) then this will be faster when the data are already an array than if it were a DataFrame. For people comfortable with pandas, the df is probably more convenient and organized. MDAnalysis has been embracing numpy arrays as the foundation for everything so it makes sense to use them for results, too. But I don't see anything wrong with adding a convenience function that presents the raw data as a DataFrame (although other devs may have different opinions on the matter).

ALescoulie · 2022-04-07T22:55:08Z

@orbeckst and @IAlibay looking at how my DataFrame (selection, time, dist as cols with results in rows) is structured, how do you think is best to structure the results into an array. I'm thinking a dictionary with selection number as keys.

orbeckst · 2022-04-07T23:45:03Z

I'd say look at some of the other standard analysis, e.g., RMSD.

Also @PicoCentauri and @joaomcteixeira might also have an opinion on how to store data in results.

PicoCentauri · 2022-04-08T07:29:36Z

I am also in favor of keeping things as numpy arrays. I don't see much benefit of a pandas data frame here. Additionally, the new classes will be part of the CLI, and the CLI's saving workflow is (currently) not able to handle data frames. The latter is maybe an issue of the CLI I have to fix though.

@ALescoulie I would also be very happy if you could test your modules from the command (once you have a proper docstring)!

orbeckst

A few quick comments — I see it's in draft but this might already be helpful

don't use pandas.DataFrame the container for the data, use our base.Results dict-like thing
use numpydoc formatting for documentation
eventually: update CHANGELOG and add yourself to AUTHORS

In general your idea looks good!

package/MDAnalysis/analysis/nucleicacids.py

orbeckst · 2022-04-08T21:49:49Z

package/MDAnalysis/analysis/nucleicacids.py

+    :Arguments:
+    *segid*
+        The name of the segment the base is on
+
+    *base*
+        The number of the base pair in the segment
+
+    .. rubric:: Example


We use numpydoc formatting for doc strings. See https://userguide.mdanalysis.org/stable/contributing_code.html#working-with-the-code-documentation

Ok, I'll start on the docs soon now that I fixed the some of the functionality of the code.

package/MDAnalysis/analysis/nucleicacids.py

ojeda-e

One quick question @ALescoulie (And while this is a draft PR): is there any particular reason to use assert_almost_equal for your tests? I think MDAnalysis is using assert_allclose for float comparisons in tests. The numpy docs also suggests to use assert_allclose instead.

ojeda-e · 2022-04-08T22:41:09Z

testsuite/MDAnalysisTests/analysis/test_nucleicacids.py

+from MDAnalysisTests.datafiles import RNA_PSF, RNA_PDB
+from numpy.testing import (
+    assert_almost_equal,
+    assert_allclose,


and you are importing it here already

I switched to assert_all_close, I was imitating an older set of tests in test_nucinfo.py

ALescoulie · 2022-04-12T18:28:48Z

I updated my code based of the feedback, I apologize for the delay, my systems programing class is keeping me really busy. I still need to write documentation for new module, but it retains the core functionality of nucinfo.py making measuring multiple groups and distance over time more efficient. Now I need to finish the docs.

ALescoulie · 2022-04-12T18:32:26Z

@orbeckst do you want me to also update the phase and torsion functions from nucinfo.py

orbeckst · 2022-04-13T00:54:03Z

testsuite/MDAnalysisTests/analysis/test_nucleicacids.py

+    sel = [(BaseSelect('RNAA', 1), BaseSelect('RNAA', 2)),
+           (BaseSelect('RNAA', 22), BaseSelect('RNAA', 23))]
+    WC = WCDist(u, sel)


Is the user supposed to build selections in this way? That looks quite cumbersome.

I think I'd prefer something where the user has to provide AtomGroups, e.g. for strand_1 and and strand_2 and the residues are matched up with zip(strand_1.residues, strand_2.residues). We can then provide a helper function to build these groups but the classes themselves don't have to much guessing baggage, except identifying the correct atoms for the distance.

I'm trying to get an idea of how you want this to work, so the user provides the strands, then the residue pairs are formed in a list of tuples. The classes would then take the residue selections, use the zipped list to find the specific residues than grab the specific atoms. The super class would just accept a list of AtomGroups then merge them into a single AtomGroup. I think to do this I'd write a helper function to select the strands of DNA from a Universe. Is this what you are envisioning?

The question I have it are how do you select a specific strand of strand of DNA?

I mostly used the named tuple for the sake of organizing inputs.

so the user provides the strands,

Yes, right now make it the user's responsibility to identify the WC pairs. strand_1.residues[i] will be considered base pairing with strand_2.residues[i].

This is similar to the existing function, which asks the user to identify a single base pair via resid and segid. However, for your module we should move to an object-oriented approach and just work with AtomGroups instead of resid/segid selections.

then the residue pairs are formed in a list of tuples.

Yes.

The classes would then take the residue selections,

They take strand_1.residues and strand_2.residues

use the zipped list to find the specific residues

Yes.

than grab the specific atoms.

This is where the class does some actual work in order to identify the specific atoms that are required for the W-C base pair. Depending on the residue type, a different atom is chosen.

The super class would just accept a list of AtomGroups then merge them into a single AtomGroup.

I don't think that it would merge them into a single group. Ultimately you want to be able to use calc_bonds() on two arrays of positions. Each strand should have N residues and in the end, for each strand you should end up with an AtomGroup with N atoms, where the atom is either an N3 or and N1, depending on the nucleobase.

I was thinking it would merge them into two atom groups which then are put into calc_bonds

hmacdope · 2022-04-15T03:35:28Z

@ALescoulie I love the typed code! It looks amazing. 👍 I was going to add that you need a bunch of tests for the errors you are raising. Have a look at the lines codecov is complaining about. :)

orbeckst · 2022-04-15T22:19:10Z

package/MDAnalysis/analysis/nucleicacids.py

+        dist = self._s1.positions - self._s2.positions
+        wc: np.ndarray = mdamath.pnorm(dist)


use calc_bonds() because it will properly take PBC into account

orbeckst · 2022-04-15T22:23:30Z

You can reduce the scope of this PR and focus on the base class and just the WatsonCrick pair distance — the other functionality can then be added separately later. Just raise issues for the missing parts once the initial PR has been merged.

richardjgowers · 2022-04-16T10:29:55Z

package/MDAnalysis/analysis/nucleicacids.py

+            if universe.select_atoms(f" resid {s[0].resid} ").resnames[0] in ["DC", "DT", "U", "C", "T", "CYT", "THY", "URA"]:
+                a1, a2 = n3_name, n1_name
+            elif universe.select_atoms(f" resid {s[0].resid} ").resnames[0] in ["DG", "DA", "A", "G", "ADE", "GUA"]:
+                a1, a2 = n1_name, n3_name


Worth bringing these hardcoded names out into a kwarg so someone with different names can still use the class?

Ok I think a good way to do this would be a kwarg for each base. I can pretty easily set that up.

richardjgowers

I think @orbeckst has already said this, but I'm much more in favour of writing analysis where the atom selecting happens outside of the analysis class/function... so

ag1 = u.select_atoms('foo')
ag2 = u.select_atoms('bar')

analysis(ag1, ag2)

rather than

analysis(u, 'foo', 'bar')

The former is much easier to debug if you're suspicious about what's happening inside the analysis class, and also caters for people who might have differently named systems.

ALescoulie · 2022-04-16T21:21:58Z

I think @orbeckst has already said this, but I'm much more in favour of writing analysis where the atom selecting happens outside of the analysis class/function... so

I agree, the one thing is needing to select different things depending on the base pair, for example selecting N1 or N3 depending on the base pair, so I think I'll do something where they select the base pairs and my just selects the specific atoms.

richardjgowers

This is good to go as it stands. I think as a modernisation I'd like to convert this from passing in selection strings to passing in AtomGroups, but this can be a separate step so we can enjoy the benefits of this code as-is.

ALescoulie · 2022-04-17T18:41:08Z

@richardjgowers I'd like to make a few quick changes as well as add some docs then we can merge.

@richardjgowers

I let @richardjgowers shepherd the PR, sorry I don’t have the bandwidth today/tomorrow.

ALescoulie · 2022-04-17T22:05:34Z

@richardjgowers I implemented the suggestions from @orbeckst, removed selections from inside the Classes aside from selecting atoms and added keyword arguments for each base pair. I still need to add documentation but am working on that now.

ALescoulie · 2022-04-17T22:35:37Z

ok I wrote some docs. I made sure to keep the original citation from nuclinfo in since I essentially copied her system for solving the problem, just restructured it to a faster OOP approach.

ALescoulie · 2022-04-19T05:58:17Z

I got the docs working and moved the selection outside the classes

ALescoulie added 8 commits April 4, 2022 01:42

commit WCBase

a86508d

commit nucleicacids tests

5fc4db4

commit nucleicaid

51aa6bb

commit NucPairDist

3e8e0fc

finish WCDist

e1a17e2

commit wc_dist test

369d29c

commit major and minor dist

ac04d14

add tests

779557b

github-actions bot added the Component-Analysis label Apr 7, 2022

orbeckst linked an issue Apr 7, 2022 that may be closed by this pull request

nuclinfo.wc_pair extremely slow #3310

Closed

orbeckst previously requested changes Apr 8, 2022

View reviewed changes

ojeda-e reviewed Apr 8, 2022

View reviewed changes

ALescoulie added 2 commits April 12, 2022 11:19

convert results to Results object

f1dc889

commit tests

c8367bc

ALescoulie added 2 commits April 12, 2022 11:59

add nucleicacids.py to analysis imports

ff18452

remove relative imports

294c710

orbeckst reviewed Apr 13, 2022

View reviewed changes

lilyminium mentioned this pull request Apr 15, 2022

add typo issue #3639

Closed

orbeckst reviewed Apr 15, 2022

View reviewed changes

richardjgowers reviewed Apr 16, 2022

View reviewed changes

richardjgowers requested changes Apr 16, 2022

View reviewed changes

richardjgowers approved these changes Apr 17, 2022

View reviewed changes

ALescoulie marked this pull request as ready for review April 17, 2022 18:50

orbeckst assigned richardjgowers Apr 17, 2022

ALescoulie added 2 commits April 17, 2022 14:46

remove internal selection and simplify pr

cd193ed

add nuc acid name kwargs

d3050e0

update docs

b954c97

ALescoulie added 3 commits April 17, 2022 15:57

add analysis to tree

13d4924

fix docs

625d238

update AUTHORS and CHANGELOG

211a74f

ALescoulie requested a review from richardjgowers April 19, 2022 05:56

fix username in CHANGELOG

a16c79a

richardjgowers approved these changes Apr 19, 2022

View reviewed changes

Merge branch 'develop' into wc_pair_perf

374ba49

richardjgowers merged commit be4b6ee into MDAnalysis:develop Apr 19, 2022

ALescoulie mentioned this pull request Jun 17, 2022

Re-implement nuclinfo using AnalysisBase style subclasses #3720

Open

5 tasks

orbeckst mentioned this pull request Jun 30, 2022

change nucleicacids results dict #3744

Closed

IAlibay added the enhancement label Sep 25, 2023

		dist = self._s1.positions - self._s2.positions
		wc: np.ndarray = mdamath.pnorm(dist)

Higher Performance nuclinfo #3611

Higher Performance nuclinfo #3611

Conversation

ALescoulie commented Apr 7, 2022 • edited Loading

PR Checklist

pep8speaks commented Apr 7, 2022 • edited Loading

Comment last updated at 2022-04-19 08:17:12 UTC

ALescoulie commented Apr 7, 2022

IAlibay commented Apr 7, 2022

ALescoulie commented Apr 7, 2022

orbeckst commented Apr 7, 2022

orbeckst commented Apr 7, 2022

ALescoulie commented Apr 7, 2022 • edited Loading

orbeckst commented Apr 7, 2022

ALescoulie commented Apr 7, 2022

orbeckst commented Apr 7, 2022

PicoCentauri commented Apr 8, 2022

orbeckst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ojeda-e left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ALescoulie commented Apr 12, 2022 • edited Loading

ALescoulie commented Apr 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hmacdope commented Apr 15, 2022

Choose a reason for hiding this comment

orbeckst commented Apr 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardjgowers left a comment

Choose a reason for hiding this comment

ALescoulie commented Apr 16, 2022

richardjgowers left a comment

Choose a reason for hiding this comment

ALescoulie commented Apr 17, 2022

ALescoulie commented Apr 17, 2022

ALescoulie commented Apr 17, 2022

ALescoulie commented Apr 19, 2022

ALescoulie commented Apr 7, 2022 •

edited

Loading

pep8speaks commented Apr 7, 2022 •

edited

Loading

ALescoulie commented Apr 7, 2022 •

edited

Loading

ALescoulie commented Apr 12, 2022 •

edited

Loading