Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major overhaul and enhancement of cluster analysis code. #915

Merged
merged 441 commits into from
Sep 17, 2021

Conversation

drroe
Copy link
Contributor

@drroe drroe commented Sep 10, 2021

Version 6.0.0.

The clustering code has been pretty much entirely rewritten. It was getting to the point where adding new functionality to the old code was becoming prohibitively difficult.

Code Changes

The clustering code has been reorganized along more logical lines. Clustering consists of an Algorithm class (which drives the actual clustering), a Metric class (which provides the ability to determine centroids, calculate distances between frames/centroids, etc), and various other helper classes (such a classes for each type of centroid). This generalization should hopefully make it easier to add new algorithms in the future, as well as new types of output and cluster-related calculations. The pairwise matrix functionality is now more rational as well; there is a MetricArray class containing all Metrics which is responsible for handling all distance calculations (frame to frame and frame to centroid), centroid determination and centroid operations, and the cache for previously calculated pairwise distances. The pairwise cache is a separate DataSet, and is located either in memory or on-disk (eliminating the need for the somewhat nonsensical DataSet_Cmatrix_NOMEM class). NetCDF disk cache is now the default.

The new code is in its own namespace and subdirectory, which is probably a tactic I will use more often going forward. The encapsulation of the clustering functionality should make it easier in the future to use this as a backend for e.g. the Wavelet analysis code (which currently has its own DBscan implementation) or for doing something like clustering molecules within a Frame. I've taken pains not to break the interface with pytraj so that should still work, but @hainm and I should talk forward about making the interface more robust (for example, pytraj currently expects a certain ordering to data sets generated by clustering).

Functionality Changes

All functionality is backwards-compatible except for the summary-by-parts output, which now includes for each cluster line a best representative frame calculation for each part (#914). There are 2 major functionality changes:

  1. Any combination of COORDS and 1D sets can be clustered on, and custom weights can be set for each set. Previously it was either COORDS or a combination of 1D sets. This greatly expands the types of clustering possible. A new keyword, metricstats <file> can be used to gain more insight into how each metric is contributing to the total distance.
  2. The readinfo keyword can now be used to restart clustering with a different metric/algorithm if desired. The user can also just specify a cluster number vs time data set for readinfo instead of providing an info file. This works for hierarchical aggolmerative, K-means, and DBscan. Other issues that are addressed: DBSCAN clustering should provide a better error message for mutually exclusive options #805, Should be able to cluster on existing pairwise distance matrices #769, Make cluster distance calc a separate part of ClusterList #356.

The sieve restore options are no longer locked into specific algorithms and can be chosen by the user. Some potential segfaults were found and fixed (particularly in the DPeaks routine).

Tons of new tests have been added. Also, the manual entry for clustering has been entirely revamped, and missing info (like what SSR/SST is) has been added.

Since this is a gigantic change (and another major version increase) I'm going to label this as WIP for now and let it sit a few days before merging. This has been a long time coming - a few more days won't hurt.

@drroe
Copy link
Contributor Author

drroe commented Sep 16, 2021

Added the ability to cluster on any combo of COORDS/1D sets. Added ability to specify weights when multiple metrics are involved. Added metricstats keyword for more insight when multiple metrics are involved.

@lgtm-com
Copy link

lgtm-com bot commented Sep 16, 2021

This pull request fixes 7 alerts when merging df77062 into 641bd2f - view on LGTM.com

fixed alerts:

  • 7 for FIXME comment

@lgtm-com
Copy link

lgtm-com bot commented Sep 17, 2021

This pull request fixes 7 alerts when merging d3a291d into 641bd2f - view on LGTM.com

fixed alerts:

  • 7 for FIXME comment

@drroe
Copy link
Contributor Author

drroe commented Sep 17, 2021

Failed Jenkins test is from Linux GNU OpenMP:

[2021-09-17T13:22:15.341Z] TEST: /scratch/local/jenkins/cpu/workspace/amber-github_cpptraj_PR-915@2/test/Test_Cluster_SymmRMSD
[2021-09-17T13:22:15.341Z] 
[2021-09-17T13:22:15.341Z]   CPPTRAJ: Clustering with symmetry-corrected RMSD metric (also 2D SRMSD)
[2021-09-17T13:22:16.349Z] terminate called after throwing an instance of 'std::bad_alloc'
[2021-09-17T13:22:16.349Z]   what():  std::bad_alloc
[2021-09-17T13:22:19.366Z] ../MasterTest.sh: line 495:  8991 Aborted                 (core dumped) $cpptraj_cmd >> $CPPTRAJ_OUTPUT

I can't seem to reproduce it. bad_alloc is usually an out-of-memory thing, but this test should require on the order of 4 MB. I'll try to re-run Jenkins and see if it happens again.

@drroe
Copy link
Contributor Author

drroe commented Sep 17, 2021

run jenkins

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API enhancement New keywords New keywords for existing commands. Tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants