-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major overhaul and enhancement of cluster analysis code. #915
Conversation
…tuff if no clusters found.
output format for metricstats.
Added the ability to cluster on any combo of COORDS/1D sets. Added ability to specify weights when multiple metrics are involved. Added |
This pull request fixes 7 alerts when merging df77062 into 641bd2f - view on LGTM.com fixed alerts:
|
This pull request fixes 7 alerts when merging d3a291d into 641bd2f - view on LGTM.com fixed alerts:
|
Failed Jenkins test is from Linux GNU OpenMP:
I can't seem to reproduce it. |
run jenkins |
Version 6.0.0.
The clustering code has been pretty much entirely rewritten. It was getting to the point where adding new functionality to the old code was becoming prohibitively difficult.
Code Changes
The clustering code has been reorganized along more logical lines. Clustering consists of an Algorithm class (which drives the actual clustering), a Metric class (which provides the ability to determine centroids, calculate distances between frames/centroids, etc), and various other helper classes (such a classes for each type of centroid). This generalization should hopefully make it easier to add new algorithms in the future, as well as new types of output and cluster-related calculations. The pairwise matrix functionality is now more rational as well; there is a
MetricArray
class containing all Metrics which is responsible for handling all distance calculations (frame to frame and frame to centroid), centroid determination and centroid operations, and the cache for previously calculated pairwise distances. The pairwise cache is a separate DataSet, and is located either in memory or on-disk (eliminating the need for the somewhat nonsensicalDataSet_Cmatrix_NOMEM
class). NetCDF disk cache is now the default.The new code is in its own namespace and subdirectory, which is probably a tactic I will use more often going forward. The encapsulation of the clustering functionality should make it easier in the future to use this as a backend for e.g. the Wavelet analysis code (which currently has its own DBscan implementation) or for doing something like clustering molecules within a Frame. I've taken pains not to break the interface with pytraj so that should still work, but @hainm and I should talk forward about making the interface more robust (for example, pytraj currently expects a certain ordering to data sets generated by clustering).
Functionality Changes
All functionality is backwards-compatible except for the summary-by-parts output, which now includes for each cluster line a best representative frame calculation for each part (#914). There are 2 major functionality changes:
metricstats <file>
can be used to gain more insight into how each metric is contributing to the total distance.readinfo
keyword can now be used to restart clustering with a different metric/algorithm if desired. The user can also just specify a cluster number vs time data set forreadinfo
instead of providing an info file. This works for hierarchical aggolmerative, K-means, and DBscan. Other issues that are addressed: DBSCAN clustering should provide a better error message for mutually exclusive options #805, Should be able to cluster on existing pairwise distance matrices #769, Make cluster distance calc a separate part of ClusterList #356.The sieve restore options are no longer locked into specific algorithms and can be chosen by the user. Some potential segfaults were found and fixed (particularly in the DPeaks routine).
Tons of new tests have been added. Also, the manual entry for clustering has been entirely revamped, and missing info (like what SSR/SST is) has been added.
Since this is a gigantic change (and another major version increase) I'm going to label this as WIP for now and let it sit a few days before merging. This has been a long time coming - a few more days won't hurt.