Major overhaul and enhancement of cluster analysis code. #915

drroe · 2021-09-10T20:35:50Z

Version 6.0.0.

The clustering code has been pretty much entirely rewritten. It was getting to the point where adding new functionality to the old code was becoming prohibitively difficult.

Code Changes

The clustering code has been reorganized along more logical lines. Clustering consists of an Algorithm class (which drives the actual clustering), a Metric class (which provides the ability to determine centroids, calculate distances between frames/centroids, etc), and various other helper classes (such a classes for each type of centroid). This generalization should hopefully make it easier to add new algorithms in the future, as well as new types of output and cluster-related calculations. The pairwise matrix functionality is now more rational as well; there is a MetricArray class containing all Metrics which is responsible for handling all distance calculations (frame to frame and frame to centroid), centroid determination and centroid operations, and the cache for previously calculated pairwise distances. The pairwise cache is a separate DataSet, and is located either in memory or on-disk (eliminating the need for the somewhat nonsensical DataSet_Cmatrix_NOMEM class). NetCDF disk cache is now the default.

The new code is in its own namespace and subdirectory, which is probably a tactic I will use more often going forward. The encapsulation of the clustering functionality should make it easier in the future to use this as a backend for e.g. the Wavelet analysis code (which currently has its own DBscan implementation) or for doing something like clustering molecules within a Frame. I've taken pains not to break the interface with pytraj so that should still work, but @hainm and I should talk forward about making the interface more robust (for example, pytraj currently expects a certain ordering to data sets generated by clustering).

Functionality Changes

All functionality is backwards-compatible except for the summary-by-parts output, which now includes for each cluster line a best representative frame calculation for each part (#914). There are 2 major functionality changes:

Any combination of COORDS and 1D sets can be clustered on, and custom weights can be set for each set. Previously it was either COORDS or a combination of 1D sets. This greatly expands the types of clustering possible. A new keyword, metricstats <file> can be used to gain more insight into how each metric is contributing to the total distance.
The readinfo keyword can now be used to restart clustering with a different metric/algorithm if desired. The user can also just specify a cluster number vs time data set for readinfo instead of providing an info file. This works for hierarchical aggolmerative, K-means, and DBscan. Other issues that are addressed: DBSCAN clustering should provide a better error message for mutually exclusive options #805, Should be able to cluster on existing pairwise distance matrices #769, Make cluster distance calc a separate part of ClusterList #356.

The sieve restore options are no longer locked into specific algorithms and can be chosen by the user. Some potential segfaults were found and fixed (particularly in the DPeaks routine).

Tons of new tests have been added. Also, the manual entry for clustering has been entirely revamped, and missing info (like what SSR/SST is) has been added.

Since this is a gigantic change (and another major version increase) I'm going to label this as WIP for now and let it sit a few days before merging. This has been a long time coming - a few more days won't hurt.

…tuff if no clusters found.

output format for metricstats.

drroe · 2021-09-16T16:52:54Z

Added the ability to cluster on any combo of COORDS/1D sets. Added ability to specify weights when multiple metrics are involved. Added metricstats keyword for more insight when multiple metrics are involved.

lgtm-com · 2021-09-16T17:24:05Z

This pull request fixes 7 alerts when merging df77062 into 641bd2f - view on LGTM.com

fixed alerts:

7 for FIXME comment

lgtm-com · 2021-09-17T14:05:26Z

This pull request fixes 7 alerts when merging d3a291d into 641bd2f - view on LGTM.com

fixed alerts:

7 for FIXME comment

drroe · 2021-09-17T17:07:40Z

Failed Jenkins test is from Linux GNU OpenMP:

[2021-09-17T13:22:15.341Z] TEST: /scratch/local/jenkins/cpu/workspace/amber-github_cpptraj_PR-915@2/test/Test_Cluster_SymmRMSD
[2021-09-17T13:22:15.341Z] 
[2021-09-17T13:22:15.341Z]   CPPTRAJ: Clustering with symmetry-corrected RMSD metric (also 2D SRMSD)
[2021-09-17T13:22:16.349Z] terminate called after throwing an instance of 'std::bad_alloc'
[2021-09-17T13:22:16.349Z]   what():  std::bad_alloc
[2021-09-17T13:22:19.366Z] ../MasterTest.sh: line 495:  8991 Aborted                 (core dumped) $cpptraj_cmd >> $CPPTRAJ_OUTPUT

I can't seem to reproduce it. bad_alloc is usually an out-of-memory thing, but this test should require on the order of 4 MB. I'll try to re-run Jenkins and see if it happens again.

drroe · 2021-09-17T17:07:48Z

run jenkins

Daniel R. Roe and others added 30 commits February 19, 2019 14:04

DRR - Cpptraj: Use epsilon sieve restore for dpeaks. Dont do output s…

f3b491e

…tuff if no clusters found.

DRR - Cpptraj: Allow density peaks to restart

1556e25

DRR - Cpptraj: Some code cleanup. Update dependencies.

6f041a8

DRR - Cpptraj: More cleanup

14d4fb7

Merge branch 'master' into clusterrevamp

c7d42dc

Merge branch 'master' into clusterrevamp

8239be3

Merge branch 'master' into clusterrevamp

2015409

DRR - Cpptraj: Fix use of integer data set

d5fcad6

Merge branch 'master' into clusterrevamp

a4b24f1

Merge branch 'master' into clusterrevamp

17e4f6b

DRR - Cpptraj: Fix calls to PrepareTrajWrite.

1a957c5

Merge branch 'master' into clusterrevamp

6f50bb5

Merge branch 'master' into clusterrevamp

199b5a7

Merge branch 'master' into clusterrevamp

67e341a

Merge branch 'master' into clusterrevamp

115c02c

Merge branch 'master' into clusterrevamp

ef0ef08

Merge branch 'master' into clusterrevamp

afde897

Merge branch 'master' into clusterrevamp

6719c9e

Add a few missing depends.

1f35b31

Merge branch 'master' into clusterrevamp

5c55ee8

Merge branch 'master' into clusterrevamp-mastermerge

489dd84

Add init

5700629

Use forward declarations

16d1a6a

More forward declares

814e27b

Fixes for fwd declare

0a4fe86

Make non-static class; use fwd declares

7535677

Fwd declares

339b77a

Start updating dependencies

858320d

More and more fwd delcares

7f4ab84

Finish up initial round of forward declares

3d74148

drroe added 13 commits September 16, 2021 10:37

Adjust metric summation based on distance calculation type

71eadd3

Add test with manhattan distance

21cba23

Metric stats calc now takes manhattan/euclid into account

863b85e

Calculate average and SD of individual metric contributions. Change

9fc2758

output format for metricstats.

Use OnlineVarT to calculate frac averages. Report frac SD as well.

4150dbe

Update for new format

750efe5

Update help text

9e40342

Update metricstats entry.

cc8f400

Update dependencies

a156451

Add some const

b077218

Add some feedback for frameSelect_ var in Info()

8f61cd7

Fix up some of the descriptions.

1f7ee8a

Merge branch 'master' into clusterrevamp

df77062

drroe added 4 commits September 17, 2021 08:49

Add min and max distance contribution to metricstats calc

56a6a85

Min and max contributions now reported

73d77d2

Add Min and Max description to metricstats help entry

62e346a

Improve help text.

d3a291d

drroe removed the Work in Progress label Sep 17, 2021

drroe merged commit 4917a32 into Amber-MD:master Sep 17, 2021

drroe deleted the clusterrevamp branch September 17, 2021 19:56

This was referenced Sep 19, 2021

DBSCAN clustering should provide a better error message for mutually exclusive options #805

Closed

Make cluster distance calc a separate part of ClusterList #356

Closed

This was referenced Oct 2, 2021

Determine representative frames for combined clustering #914

Closed

Add option to read in generic ASCII files as a pairwise distance matrix #554

Open

drroe mentioned this pull request Feb 11, 2022

Fix the cmake MPI build #945

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major overhaul and enhancement of cluster analysis code. #915

Major overhaul and enhancement of cluster analysis code. #915

drroe commented Sep 10, 2021 •

edited

Loading

drroe commented Sep 16, 2021

lgtm-com bot commented Sep 16, 2021

lgtm-com bot commented Sep 17, 2021

drroe commented Sep 17, 2021

drroe commented Sep 17, 2021

Major overhaul and enhancement of cluster analysis code. #915

Major overhaul and enhancement of cluster analysis code. #915

Conversation

drroe commented Sep 10, 2021 • edited Loading

Code Changes

Functionality Changes

drroe commented Sep 16, 2021

lgtm-com bot commented Sep 16, 2021

lgtm-com bot commented Sep 17, 2021

drroe commented Sep 17, 2021

drroe commented Sep 17, 2021

drroe commented Sep 10, 2021 •

edited

Loading