Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Major overhaul and enhancement of cluster analysis code. (#915)
* DRR - Cpptraj: Use epsilon sieve restore for dpeaks. Dont do output stuff if no clusters found. * DRR - Cpptraj: Allow density peaks to restart * DRR - Cpptraj: Some code cleanup. Update dependencies. * DRR - Cpptraj: More cleanup * DRR - Cpptraj: Fix use of integer data set * DRR - Cpptraj: Fix calls to PrepareTrajWrite. * Add a few missing depends. * Add init * Use forward declarations * More forward declares * Fixes for fwd declare * Make non-static class; use fwd declares * Fwd declares * Start updating dependencies * More and more fwd delcares * Finish up initial round of forward declares * Finishing fixing up dependencies * Assign to clusters noise instead of having a separate output routine. * If no clusters found with dpeaks, exit early; prevents a segfault * Protect against going too far into the tgtMask array * Note but do not mark sievetoframe option in DBscan setup so that Control can pick it up. Add hidden nosieverestore option in order to directly compare against original dpeaks implementation * Add build for cluster stuff * Enable libcpptraj_cluster.a target * Have findDepend depend on FindDepend.cpp * Simplify directory prefixes to cut down on redundant dependencies * Start adding DrawGraph * Use new classes for drawgraph * Add GraphType * Start incorporating drawgraph * Update dependencies, set debug level for cluster * Add parsing of drawgraph options * Ensure Cluster subdir is always made * Start adding an assignrefs test * Assign refs * Add test comparison * Add assign refs to cluster function * Add to files and depends * Add assignrefs options to Results_Coords * Return type, usemass, mask * Set defaults from metric if needed * Move assignrefs routine to Results_Coords. Create separate function CalcResults, purpose of which is calculating things that belong in cluster nodes. * assignrefs now in Results_Coords * Fix GetOptions call, add CalcResults call. * Create single unified setup routine instead of two separate ones * Use the new setup interface for clustering * CalcResults needs to come first since the results are used in Summary * Enable assignrefs test * Add 'nocoords' keyword to tests * Provide feedback when no coordinates given for results * Protect tests that require netcdf * Start adding more in depth pairwise cache tests * Add missing comma * Add test for disk cache and no cache * Add pairwise cache with sieve 5 test * Add the remainder of the sieve pairwise cache tests * Add some feedback about the pairwise cache * Enable pairwise cache test * Get rid of old test code * Test singlerepfmt keyword * Add cumulative best rep test * Add centroid and cumulative_nosieve option tests for bestrep * Add pwrecalc option test * Add mismatch fatal option * Add pwrecalc option * Hide some debug info. * Get rid of wrong sizeInBytes routine (does not account for different types). Use DataSize() for pairwise data set. * Add est. disk usage in bytes calc * Add pairwise cache in memory estimation, catch bad alloc error * Fix output messages * Better description of the pairwise cache in Info * Put COORDS set for results in Results info routine * Test loading previously written pairwise disk cache * Have masks report number selected * Get rid of old code * Start adding help function * Add cluster restart from info file test * Enable cluster restart test * Add more help args * Clean up hieragglo help. Start working on sieve options help * Fix up sieve options * Add output and coords output args * Add graph args. * Add missing help keywords. * Fix cluster code when no netcdf * Make it so cluster tests do not need netcdf trajectory. Protect the netcdf pairwise test. * Remove netcdf requirement * Remove netcdf requirement. Ensure leftover files from old versions are cleaned * Remove netcdf requirement * Remove netcdf dependency * Remove netcdf dependency * Protect test when no netcdf * Add missing include * Fix openmp compile * Return to previous analysis name for pytraj * Ensure data set name is the last argument. * To maintain compatibility with pytraj, ensure the cache is allocated after cluster number vs time set. Also ensure that a default data set name gets set. * Create an AllocateSet routine that will allocate a DataSet but not add it to the DataSetList * Only add the cache if it was actually allocated * Improve code docs. * Add more description to cluster help. * Update the cluster manual entry. * Add info about cluster pairwise matrix files; add cluster example using pairdist * Major revision increase to 6.0.0. Complete overhaul of the clustering code. * loadpairdist unnecessary here * If starting with more clusters than target # clusters in kmeans, remove the low population clusters. * Start trying to handle kmeans restart when target # clusters > current clusters * Add kmeans restart test * Handle case where kmeans restart target # clusters is greater than existing clusters * Make messages clearer * Add test for kmeans restart where target clusters > existing clusters. Not sure the seed choice algorithm is optimal but it appears to work ok. * Add note. * Compare info.dat * Ensure existing clusters are removed for restart DBscan, will be recreated at the end. * Add DBscan restart test * Hide some debug info * Add ability to read initial clusters from cnumvtime set * Comment out debug statement * Add readinfo cnvtset test * Clean up savepairdist/loadpairdist entries. Add cnvtset keyword * Start creating Node versions of bestrep functions * Create Node versions of the bestrep functions * Create separate print routine for debug; some cleanup * Make print function take a node * Best reps function for a node * Add best rep frames output to summary by parts. * Ensure assignrefs output comes last * Add info on how assignrefs output is written to summary/summarysplit files. Fix up description of summary/summarysplit. * Add blurb about SSR/SST * Summary by parts now has best rep output * Update CMakeLists.txt with most recent versions from Amber gitlab * Add file that can be used to test cmake * Ensure installdir is created * targets_include_directories is missing in the cmake submodule * Start trying to add cmake file for cluster subdir * Rename library to cpptraj_cluster to be consistent with make. Add missing source. Make static, add pic flag if needed. * libcpptraj_cluster needs netcdf includes * Use breaks where appropriate instead of continue. Address some comments. * Args were in the wrong order * Resolve FIXMEs that should be TODOs * Fix printf arg types * Try to resolve ordering issue. * Fix printf format * Add arrays for metrics and centroids * Add copy/assignment * Start Metric_Scalar * Start implementation of Metric_Scalar * Start using Centroid_Num * Start adding centroid calcs. * Finish Metric_Scalar implementation * Add Metric_Torsion * Start adding init to MetricArray * Add call to Init for allocated metric * Parse euclid and manhattan args * Add recognized keywords * Add metric weights * Remove old metric files * Remove obsolete metric types * Add Info routine for MetricArray * Add setup for metrics * Really remove obsolete metric types * Ensure # of points covered by Metrics is the same * Add tool to check for missing source files from make/cmake * Start incorporating MetricArray * Add destruct, assign and copy to CentroidArray * Add missing vars to copy/assign * Add new centroid calc * Add CentroidArray * Add CalculateCentroid * Add the difference distance calcs * Frame to centroids distance * Convert to use CentroidArray/MetricArray * Add empty() * Update code comments, delete commented out code * Fold PairwiseMatrix functionality into MetricArray * Add keywords strings for pairwise * Need to pass in two data set lists; one for sets to cluster, the other to add the cache to if needed. * Centroid_Multi now unused (covered by CentroidArray). PairwiseMatrix folded into MetricArray. * Add info about cache to Info() print * Make coordsMetricType a class var * Replace coords metric type var with function that returns first COORDS-related metric * Use MetricArray * Change name to better reflect what it is * Add routines for accessing the pairwise cache from outside MetricArray * Start switching to MetricArray * Create a copy of MetricArray itself for OpenMP * Add CentroidDist * Use MetricArray * Switch from PairwiseMatrix to MetricArray * Convert to use MetricArray * Convert from Metric*/PairwiseMatrix to MetricArray * Use MetricArray in place of PairwiseMatrix * Finish removing PairwiseMatrix * Last changes to MetricArray * Update main depends. Remove duplicate Info() call. * Remove obsolete comment * double* form of the distance routines is not needed. * Update code comments * Uncached distance calc can be private * Remove unneeded fwd declare * Update version for cmake * Add test clustering on RMSD and data * Add error message when no frames to cluster. * Move the error message to Control * Add calculation for determining how much each metric contributes to total distance * Write out description for Info instead of just the legend again * Add metricstats keyword * Improve output message * Add metricstats keyword * Fix up help * Update manual entry with ability to cluster on any combo of coords/1d sets. Add metricstats keyword. * Enable coords/1d clustering test * Add weights to distance contribution calc * Add entry for wgt keyword * Test the wgt keyword * Have metricstats and wgt reference each other * Put all coords results setup in the same place * Redirect metric stats to a file * Add metricstats test save * Add metricstats save files * Adjust metric summation based on distance calculation type * Add test with manhattan distance * Metric stats calc now takes manhattan/euclid into account * Calculate average and SD of individual metric contributions. Change output format for metricstats. * Use OnlineVarT to calculate frac averages. Report frac SD as well. * Update for new format * Update help text * Update metricstats entry. * Update dependencies * Add some const * Add some feedback for frameSelect_ var in Info() * Fix up some of the descriptions. * Add min and max distance contribution to metricstats calc * Min and max contributions now reported * Add Min and Max description to metricstats help entry * Improve help text. Co-authored-by: Daniel R. Roe <daniel.roe@nih.gov>
- Loading branch information