Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Higher Performance nuclinfo #3611

Merged
merged 20 commits into from
Apr 19, 2022
Merged

Conversation

ALescoulie
Copy link
Contributor

@ALescoulie ALescoulie commented Apr 7, 2022

Fixes #3600, #3310

Changes made in this Pull Request:

  • Add AnalysisBase derived classes for calculating different nucleic acid base pair distances.

I built a new class NucPairDist that calculates distances between atoms in nucleic acid. It has 3 subclasses WCDist, MajorDist and MinorDist. Those are indented to improve the performance of their equivalents in nucinfo. They also support taking groups of base pairs and running the distance over a trajectory.

TODO:

  • Docs

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@pep8speaks
Copy link

pep8speaks commented Apr 7, 2022

Hello @ALescoulie! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 105:80: E501 line too long (86 > 79 characters)
Line 141:80: E501 line too long (86 > 79 characters)
Line 142:80: E501 line too long (84 > 79 characters)

Comment last updated at 2022-04-19 08:17:12 UTC

@ALescoulie
Copy link
Contributor Author

Just realized that pandas is not included in MDAnalysis, I had been using tidy DataFrames to record my results

@IAlibay
Copy link
Member

IAlibay commented Apr 7, 2022

I'll let the other @MDAnalysis/coredevs speak here too, but I'm not particularly keen on having pandas as a core dependency (in fact my current goal is to remove more dependencies from MDAnalysis). I would prefer if numpy arrays could be used to store results where possible.

@ALescoulie
Copy link
Contributor Author

@IAlibay Don't think switching to arrays will be too much trouble, I'm trying to store selection, distance and time results in an organized way so I deflated to a DataFrame.

@orbeckst
Copy link
Member

orbeckst commented Apr 7, 2022

I second @IAlibay 's #3611 (comment) — keep it as a numpy array. I don't see pandas here to be essential.

If you really need to make data available then you could create to_df() that imports pandas only in the function and that can fail and tell you to install it. We have packages that are optional and then we follow that approach.

@orbeckst
Copy link
Member

orbeckst commented Apr 7, 2022

Btw, pandas.DataFrame does not have great performance in my experience — you get convenience but not the speed of numpy arrays.

@orbeckst orbeckst linked an issue Apr 7, 2022 that may be closed by this pull request
@ALescoulie
Copy link
Contributor Author

ALescoulie commented Apr 7, 2022

@orbeckst I don't use it during iteration, I just build it from a dictionary in the _conclude step, but I could just return a names list and a numpy array of distances and times. Overall is this like what you were wanting for #3600 .

@orbeckst
Copy link
Member

orbeckst commented Apr 7, 2022

I understand; my point was that when you want to process your results further (e.g., calculate statistics) then this will be faster when the data are already an array than if it were a DataFrame. For people comfortable with pandas, the df is probably more convenient and organized. MDAnalysis has been embracing numpy arrays as the foundation for everything so it makes sense to use them for results, too. But I don't see anything wrong with adding a convenience function that presents the raw data as a DataFrame (although other devs may have different opinions on the matter).

@ALescoulie
Copy link
Contributor Author

@orbeckst and @IAlibay looking at how my DataFrame (selection, time, dist as cols with results in rows) is structured, how do you think is best to structure the results into an array. I'm thinking a dictionary with selection number as keys.

@orbeckst
Copy link
Member

orbeckst commented Apr 7, 2022

I'd say look at some of the other standard analysis, e.g., RMSD.

Also @PicoCentauri and @joaomcteixeira might also have an opinion on how to store data in results.

@PicoCentauri
Copy link
Contributor

I am also in favor of keeping things as numpy arrays. I don't see much benefit of a pandas data frame here. Additionally, the new classes will be part of the CLI, and the CLI's saving workflow is (currently) not able to handle data frames. The latter is maybe an issue of the CLI I have to fix though.

@ALescoulie I would also be very happy if you could test your modules from the command (once you have a proper docstring)!

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few quick comments — I see it's in draft but this might already be helpful

  • don't use pandas.DataFrame the container for the data, use our base.Results dict-like thing
  • use numpydoc formatting for documentation
  • eventually: update CHANGELOG and add yourself to AUTHORS

In general your idea looks good!

package/MDAnalysis/analysis/nucleicacids.py Outdated Show resolved Hide resolved
package/MDAnalysis/analysis/nucleicacids.py Outdated Show resolved Hide resolved
Comment on lines 51 to 58
:Arguments:
*segid*
The name of the segment the base is on

*base*
The number of the base pair in the segment

.. rubric:: Example
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll start on the docs soon now that I fixed the some of the functionality of the code.

package/MDAnalysis/analysis/nucleicacids.py Outdated Show resolved Hide resolved
package/MDAnalysis/analysis/nucleicacids.py Outdated Show resolved Hide resolved
Copy link
Member

@ojeda-e ojeda-e left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One quick question @ALescoulie (And while this is a draft PR): is there any particular reason to use assert_almost_equal for your tests? I think MDAnalysis is using assert_allclose for float comparisons in tests. The numpy docs also suggests to use assert_allclose instead.

from MDAnalysisTests.datafiles import RNA_PSF, RNA_PDB
from numpy.testing import (
assert_almost_equal,
assert_allclose,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and you are importing it here already

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to assert_all_close, I was imitating an older set of tests in test_nucinfo.py

@ALescoulie
Copy link
Contributor Author

ALescoulie commented Apr 12, 2022

I updated my code based of the feedback, I apologize for the delay, my systems programing class is keeping me really busy. I still need to write documentation for new module, but it retains the core functionality of nucinfo.py making measuring multiple groups and distance over time more efficient. Now I need to finish the docs.

@ALescoulie
Copy link
Contributor Author

@orbeckst do you want me to also update the phase and torsion functions from nucinfo.py

Comment on lines 37 to 39
sel = [(BaseSelect('RNAA', 1), BaseSelect('RNAA', 2)),
(BaseSelect('RNAA', 22), BaseSelect('RNAA', 23))]
WC = WCDist(u, sel)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the user supposed to build selections in this way? That looks quite cumbersome.

I think I'd prefer something where the user has to provide AtomGroups, e.g. for strand_1 and and strand_2 and the residues are matched up with zip(strand_1.residues, strand_2.residues). We can then provide a helper function to build these groups but the classes themselves don't have to much guessing baggage, except identifying the correct atoms for the distance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to get an idea of how you want this to work, so the user provides the strands, then the residue pairs are formed in a list of tuples. The classes would then take the residue selections, use the zipped list to find the specific residues than grab the specific atoms. The super class would just accept a list of AtomGroups then merge them into a single AtomGroup. I think to do this I'd write a helper function to select the strands of DNA from a Universe. Is this what you are envisioning?

The question I have it are how do you select a specific strand of strand of DNA?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly used the named tuple for the sake of organizing inputs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the user provides the strands,

Yes, right now make it the user's responsibility to identify the WC pairs. strand_1.residues[i] will be considered base pairing with strand_2.residues[i].

This is similar to the existing function, which asks the user to identify a single base pair via resid and segid. However, for your module we should move to an object-oriented approach and just work with AtomGroups instead of resid/segid selections.

then the residue pairs are formed in a list of tuples.

Yes.

The classes would then take the residue selections,

They take strand_1.residues and strand_2.residues

use the zipped list to find the specific residues

Yes.

than grab the specific atoms.

This is where the class does some actual work in order to identify the specific atoms that are required for the W-C base pair. Depending on the residue type, a different atom is chosen.

The super class would just accept a list of AtomGroups then merge them into a single AtomGroup.

I don't think that it would merge them into a single group. Ultimately you want to be able to use calc_bonds() on two arrays of positions. Each strand should have N residues and in the end, for each strand you should end up with an AtomGroup with N atoms, where the atom is either an N3 or and N1, depending on the nucleobase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking it would merge them into two atom groups which then are put into calc_bonds

@hmacdope
Copy link
Member

@ALescoulie I love the typed code! It looks amazing. 👍 I was going to add that you need a bunch of tests for the errors you are raising. Have a look at the lines codecov is complaining about. :)

@lilyminium lilyminium mentioned this pull request Apr 15, 2022
Comment on lines 112 to 113
dist = self._s1.positions - self._s2.positions
wc: np.ndarray = mdamath.pnorm(dist)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use calc_bonds() because it will properly take PBC into account

@orbeckst
Copy link
Member

You can reduce the scope of this PR and focus on the base class and just the WatsonCrick pair distance — the other functionality can then be added separately later. Just raise issues for the missing parts once the initial PR has been merged.

Comment on lines 134 to 137
if universe.select_atoms(f" resid {s[0].resid} ").resnames[0] in ["DC", "DT", "U", "C", "T", "CYT", "THY", "URA"]:
a1, a2 = n3_name, n1_name
elif universe.select_atoms(f" resid {s[0].resid} ").resnames[0] in ["DG", "DA", "A", "G", "ADE", "GUA"]:
a1, a2 = n1_name, n3_name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth bringing these hardcoded names out into a kwarg so someone with different names can still use the class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think a good way to do this would be a kwarg for each base. I can pretty easily set that up.

Copy link
Member

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @orbeckst has already said this, but I'm much more in favour of writing analysis where the atom selecting happens outside of the analysis class/function... so

ag1 = u.select_atoms('foo')
ag2 = u.select_atoms('bar')

analysis(ag1, ag2)

rather than

analysis(u, 'foo', 'bar')

The former is much easier to debug if you're suspicious about what's happening inside the analysis class, and also caters for people who might have differently named systems.

@ALescoulie
Copy link
Contributor Author

I think @orbeckst has already said this, but I'm much more in favour of writing analysis where the atom selecting happens outside of the analysis class/function... so

I agree, the one thing is needing to select different things depending on the base pair, for example selecting N1 or N3 depending on the base pair, so I think I'll do something where they select the base pairs and my just selects the specific atoms.

Copy link
Member

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good to go as it stands. I think as a modernisation I'd like to convert this from passing in selection strings to passing in AtomGroups, but this can be a separate step so we can enjoy the benefits of this code as-is.

@ALescoulie
Copy link
Contributor Author

@richardjgowers I'd like to make a few quick changes as well as add some docs then we can merge.

@ALescoulie ALescoulie marked this pull request as ready for review April 17, 2022 18:50
@orbeckst orbeckst dismissed their stale review April 17, 2022 19:50

I let @richardjgowers shepherd the PR, sorry I don’t have the bandwidth today/tomorrow.

@ALescoulie
Copy link
Contributor Author

@richardjgowers I implemented the suggestions from @orbeckst, removed selections from inside the Classes aside from selecting atoms and added keyword arguments for each base pair. I still need to add documentation but am working on that now.

@ALescoulie
Copy link
Contributor Author

ok I wrote some docs. I made sure to keep the original citation from nuclinfo in since I essentially copied her system for solving the problem, just restructured it to a faster OOP approach.

@ALescoulie
Copy link
Contributor Author

I got the docs working and moved the selection outside the classes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nuclinfo module does not use AnalysisBase nuclinfo.wc_pair extremely slow
8 participants