-
Notifications
You must be signed in to change notification settings - Fork 647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Higher Performance nuclinfo #3611
Conversation
Hello @ALescoulie! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2022-04-19 08:17:12 UTC |
Just realized that pandas is not included in MDAnalysis, I had been using tidy DataFrames to record my results |
I'll let the other @MDAnalysis/coredevs speak here too, but I'm not particularly keen on having pandas as a core dependency (in fact my current goal is to remove more dependencies from MDAnalysis). I would prefer if numpy arrays could be used to store results where possible. |
@IAlibay Don't think switching to arrays will be too much trouble, I'm trying to store selection, distance and time results in an organized way so I deflated to a DataFrame. |
I second @IAlibay 's #3611 (comment) — keep it as a numpy array. I don't see pandas here to be essential. If you really need to make data available then you could create |
Btw, pandas.DataFrame does not have great performance in my experience — you get convenience but not the speed of numpy arrays. |
I understand; my point was that when you want to process your results further (e.g., calculate statistics) then this will be faster when the data are already an array than if it were a DataFrame. For people comfortable with pandas, the df is probably more convenient and organized. MDAnalysis has been embracing numpy arrays as the foundation for everything so it makes sense to use them for results, too. But I don't see anything wrong with adding a convenience function that presents the raw data as a DataFrame (although other devs may have different opinions on the matter). |
I'd say look at some of the other standard analysis, e.g., RMSD. Also @PicoCentauri and @joaomcteixeira might also have an opinion on how to store data in |
I am also in favor of keeping things as @ALescoulie I would also be very happy if you could test your modules from the command (once you have a proper docstring)! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few quick comments — I see it's in draft but this might already be helpful
- don't use pandas.DataFrame the container for the data, use our base.Results dict-like thing
- use numpydoc formatting for documentation
- eventually: update CHANGELOG and add yourself to AUTHORS
In general your idea looks good!
:Arguments: | ||
*segid* | ||
The name of the segment the base is on | ||
|
||
*base* | ||
The number of the base pair in the segment | ||
|
||
.. rubric:: Example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use numpydoc formatting for doc strings. See https://userguide.mdanalysis.org/stable/contributing_code.html#working-with-the-code-documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll start on the docs soon now that I fixed the some of the functionality of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One quick question @ALescoulie (And while this is a draft PR): is there any particular reason to use assert_almost_equal
for your tests? I think MDAnalysis is using assert_allclose
for float comparisons in tests. The numpy docs also suggests to use assert_allclose
instead.
from MDAnalysisTests.datafiles import RNA_PSF, RNA_PDB | ||
from numpy.testing import ( | ||
assert_almost_equal, | ||
assert_allclose, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and you are importing it here already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched to assert_all_close
, I was imitating an older set of tests in test_nucinfo.py
I updated my code based of the feedback, I apologize for the delay, my systems programing class is keeping me really busy. I still need to write documentation for new module, but it retains the core functionality of |
@orbeckst do you want me to also update the phase and torsion functions from |
sel = [(BaseSelect('RNAA', 1), BaseSelect('RNAA', 2)), | ||
(BaseSelect('RNAA', 22), BaseSelect('RNAA', 23))] | ||
WC = WCDist(u, sel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the user supposed to build selections in this way? That looks quite cumbersome.
I think I'd prefer something where the user has to provide AtomGroups, e.g. for strand_1 and and strand_2 and the residues are matched up with zip(strand_1.residues, strand_2.residues). We can then provide a helper function to build these groups but the classes themselves don't have to much guessing baggage, except identifying the correct atoms for the distance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to get an idea of how you want this to work, so the user provides the strands, then the residue pairs are formed in a list of tuples. The classes would then take the residue selections, use the zipped list to find the specific residues than grab the specific atoms. The super class would just accept a list of AtomGroups then merge them into a single AtomGroup. I think to do this I'd write a helper function to select the strands of DNA from a Universe. Is this what you are envisioning?
The question I have it are how do you select a specific strand of strand of DNA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly used the named tuple for the sake of organizing inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the user provides the strands,
Yes, right now make it the user's responsibility to identify the WC pairs. strand_1.residues[i]
will be considered base pairing with strand_2.residues[i]
.
This is similar to the existing function, which asks the user to identify a single base pair via resid and segid. However, for your module we should move to an object-oriented approach and just work with AtomGroups instead of resid/segid selections.
then the residue pairs are formed in a list of tuples.
Yes.
The classes would then take the residue selections,
They take strand_1.residues
and strand_2.residues
use the zipped list to find the specific residues
Yes.
than grab the specific atoms.
This is where the class does some actual work in order to identify the specific atoms that are required for the W-C base pair. Depending on the residue type, a different atom is chosen.
The super class would just accept a list of AtomGroups then merge them into a single AtomGroup.
I don't think that it would merge them into a single group. Ultimately you want to be able to use calc_bonds()
on two arrays of positions. Each strand should have N residues and in the end, for each strand you should end up with an AtomGroup with N atoms, where the atom is either an N3 or and N1, depending on the nucleobase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking it would merge them into two atom groups which then are put into calc_bonds
@ALescoulie I love the typed code! It looks amazing. 👍 I was going to add that you need a bunch of tests for the errors you are raising. Have a look at the lines codecov is complaining about. :) |
dist = self._s1.positions - self._s2.positions | ||
wc: np.ndarray = mdamath.pnorm(dist) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use calc_bonds() because it will properly take PBC into account
You can reduce the scope of this PR and focus on the base class and just the WatsonCrick pair distance — the other functionality can then be added separately later. Just raise issues for the missing parts once the initial PR has been merged. |
if universe.select_atoms(f" resid {s[0].resid} ").resnames[0] in ["DC", "DT", "U", "C", "T", "CYT", "THY", "URA"]: | ||
a1, a2 = n3_name, n1_name | ||
elif universe.select_atoms(f" resid {s[0].resid} ").resnames[0] in ["DG", "DA", "A", "G", "ADE", "GUA"]: | ||
a1, a2 = n1_name, n3_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth bringing these hardcoded names out into a kwarg so someone with different names can still use the class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I think a good way to do this would be a kwarg for each base. I can pretty easily set that up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @orbeckst has already said this, but I'm much more in favour of writing analysis where the atom selecting happens outside of the analysis class/function... so
ag1 = u.select_atoms('foo')
ag2 = u.select_atoms('bar')
analysis(ag1, ag2)
rather than
analysis(u, 'foo', 'bar')
The former is much easier to debug if you're suspicious about what's happening inside the analysis class, and also caters for people who might have differently named systems.
I agree, the one thing is needing to select different things depending on the base pair, for example selecting N1 or N3 depending on the base pair, so I think I'll do something where they select the base pairs and my just selects the specific atoms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good to go as it stands. I think as a modernisation I'd like to convert this from passing in selection strings to passing in AtomGroups, but this can be a separate step so we can enjoy the benefits of this code as-is.
@richardjgowers I'd like to make a few quick changes as well as add some docs then we can merge. |
I let @richardjgowers shepherd the PR, sorry I don’t have the bandwidth today/tomorrow.
@richardjgowers I implemented the suggestions from @orbeckst, removed selections from inside the Classes aside from selecting atoms and added keyword arguments for each base pair. I still need to add documentation but am working on that now. |
ok I wrote some docs. I made sure to keep the original citation from nuclinfo in since I essentially copied her system for solving the problem, just restructured it to a faster OOP approach. |
I got the docs working and moved the selection outside the classes |
Fixes #3600, #3310
Changes made in this Pull Request:
AnalysisBase
derived classes for calculating different nucleic acid base pair distances.I built a new class
NucPairDist
that calculates distances between atoms in nucleic acid. It has 3 subclassesWCDist
,MajorDist
andMinorDist
. Those are indented to improve the performance of their equivalents in nucinfo. They also support taking groups of base pairs and running the distance over a trajectory.TODO:
PR Checklist