Skip to content

Commit

Permalink
Major overhaul and enhancement of cluster analysis code. (#915)
Browse files Browse the repository at this point in the history
* DRR - Cpptraj: Use epsilon sieve restore for dpeaks. Dont do output stuff if no clusters found.

* DRR - Cpptraj: Allow density peaks to restart

* DRR - Cpptraj: Some code cleanup. Update dependencies.

* DRR - Cpptraj: More cleanup

* DRR - Cpptraj: Fix use of integer data set

* DRR - Cpptraj: Fix calls to PrepareTrajWrite.

* Add a few missing depends.

* Add init

* Use forward declarations

* More forward declares

* Fixes for fwd declare

* Make non-static class; use fwd declares

* Fwd declares

* Start updating dependencies

* More and more fwd delcares

* Finish up initial round of forward declares

* Finishing fixing up dependencies

* Assign to clusters noise instead of having a separate output routine.

* If no clusters found with dpeaks, exit early; prevents a segfault

* Protect against going too far into the tgtMask array

* Note but do not mark sievetoframe option in DBscan setup so that Control
can pick it up. Add hidden nosieverestore option in order to directly
compare against original dpeaks implementation

* Add build for cluster stuff

* Enable libcpptraj_cluster.a target

* Have findDepend depend on FindDepend.cpp

* Simplify directory prefixes to cut down on redundant dependencies

* Start adding DrawGraph

* Use new classes for drawgraph

* Add GraphType

* Start incorporating drawgraph

* Update dependencies, set debug level for cluster

* Add parsing of drawgraph options

* Ensure Cluster subdir is always made

* Start adding an assignrefs test

* Assign refs

* Add test comparison

* Add assign refs to cluster function

* Add to files and depends

* Add assignrefs options to Results_Coords

* Return type, usemass, mask

* Set defaults from metric if needed

* Move assignrefs routine to Results_Coords. Create separate function
CalcResults, purpose of which is calculating things that belong in
cluster nodes.

* assignrefs now in Results_Coords

* Fix GetOptions call, add CalcResults call.

* Create single unified setup routine instead of two separate ones

* Use the new setup interface for clustering

* CalcResults needs to come first since the results are used in Summary

* Enable assignrefs test

* Add 'nocoords' keyword to tests

* Provide feedback when no coordinates given for results

* Protect tests that require netcdf

* Start adding more in depth pairwise cache tests

* Add missing comma

* Add test for disk cache and no cache

* Add pairwise cache with sieve 5 test

* Add the remainder of the sieve pairwise cache tests

* Add some feedback about the pairwise cache

* Enable pairwise cache test

* Get rid of old test code

* Test singlerepfmt keyword

* Add cumulative best rep test

* Add centroid and cumulative_nosieve option tests for bestrep

* Add pwrecalc option test

* Add mismatch fatal option

* Add pwrecalc option

* Hide some debug info.

* Get rid of wrong sizeInBytes routine (does not account for different
types). Use DataSize() for pairwise data set.

* Add est. disk usage in bytes calc

* Add pairwise cache in memory estimation, catch bad alloc error

* Fix output messages

* Better description of the pairwise cache in Info

* Put COORDS set for results in Results info routine

* Test loading previously written pairwise disk cache

* Have masks report number selected

* Get rid of old code

* Start adding help function

* Add cluster restart from info file test

* Enable cluster restart test

* Add more help args

* Clean up hieragglo help. Start working on sieve options help

* Fix up sieve options

* Add output and coords output args

* Add graph args.

* Add missing help keywords.

* Fix cluster code when no netcdf

* Make it so cluster tests do not need netcdf trajectory. Protect the
netcdf pairwise test.

* Remove netcdf requirement

* Remove netcdf requirement. Ensure leftover files from old versions are
cleaned

* Remove netcdf requirement

* Remove netcdf dependency

* Remove netcdf dependency

* Protect test when no netcdf

* Add missing include

* Fix openmp compile

* Return to previous analysis name for pytraj

* Ensure data set name is the last argument.

* To maintain compatibility with pytraj, ensure the cache is allocated
after cluster number vs time set. Also ensure that a default data set
name gets set.

* Create an AllocateSet routine that will allocate a DataSet but not add
it to the DataSetList

* Only add the cache if it was actually allocated

* Improve code docs.

* Add more description to cluster help.

* Update the cluster manual entry.

* Add info about cluster pairwise matrix files; add cluster example using
pairdist

* Major revision increase to 6.0.0. Complete overhaul of the clustering code.

* loadpairdist unnecessary here

* If starting with more clusters than target # clusters in kmeans, remove
the low population clusters.

* Start trying to handle kmeans restart when target # clusters > current
clusters

* Add kmeans restart test

* Handle case where kmeans restart target # clusters is greater than existing
clusters

* Make messages clearer

* Add test for kmeans restart where target clusters > existing clusters.
Not sure the seed choice algorithm is optimal but it appears to work ok.

* Add note.

* Compare info.dat

* Ensure existing clusters are removed for restart DBscan, will be
recreated at the end.

* Add DBscan restart test

* Hide some debug info

* Add ability to read initial clusters from cnumvtime set

* Comment out debug statement

* Add readinfo cnvtset test

* Clean up savepairdist/loadpairdist entries. Add cnvtset keyword

* Start creating Node versions of bestrep functions

* Create Node versions of the bestrep functions

* Create separate print routine for debug; some cleanup

* Make print function take a node

* Best reps function for a node

* Add best rep frames output to summary by parts.

* Ensure assignrefs output comes last

* Add info on how assignrefs output is written to summary/summarysplit
files. Fix up description of summary/summarysplit.

* Add blurb about SSR/SST

* Summary by parts now has best rep output

* Update CMakeLists.txt with most recent versions from Amber gitlab

* Add file that can be used to test cmake

* Ensure installdir is created

* targets_include_directories is missing in the cmake submodule

* Start trying to add cmake file for cluster subdir

* Rename library to cpptraj_cluster to be consistent with make. Add
missing source. Make static, add pic flag if needed.

* libcpptraj_cluster needs netcdf includes

* Use breaks where appropriate instead of continue. Address some comments.

* Args were in the wrong order

* Resolve FIXMEs that should be TODOs

* Fix printf arg types

* Try to resolve ordering issue.

* Fix printf format

* Add arrays for metrics and centroids

* Add copy/assignment

* Start Metric_Scalar

* Start implementation of Metric_Scalar

* Start using Centroid_Num

* Start adding centroid calcs.

* Finish Metric_Scalar implementation

* Add Metric_Torsion

* Start adding init to MetricArray

* Add call to Init for allocated metric

* Parse euclid and manhattan args

* Add recognized keywords

* Add metric weights

* Remove old metric files

* Remove obsolete metric types

* Add Info routine for MetricArray

* Add setup for metrics

* Really remove obsolete metric types

* Ensure # of points covered by Metrics is the same

* Add tool to check for missing source files from make/cmake

* Start incorporating MetricArray

* Add destruct, assign and copy to CentroidArray

* Add missing vars to copy/assign

* Add new centroid calc

* Add CentroidArray

* Add CalculateCentroid

* Add the difference distance calcs

* Frame to centroids distance

* Convert to use CentroidArray/MetricArray

* Add empty()

* Update code comments, delete commented out code

* Fold PairwiseMatrix functionality into MetricArray

* Add keywords strings for pairwise

* Need to pass in two data set lists; one for sets to cluster, the other
to add the cache to if needed.

* Centroid_Multi now unused (covered by CentroidArray). PairwiseMatrix
folded into MetricArray.

* Add info about cache to Info() print

* Make coordsMetricType a class var

* Replace coords metric type var with function that returns first
COORDS-related metric

* Use MetricArray

* Change name to better reflect what it is

* Add routines for accessing the pairwise cache from outside MetricArray

* Start switching to MetricArray

* Create a copy of MetricArray itself for OpenMP

* Add CentroidDist

* Use MetricArray

* Switch from PairwiseMatrix to MetricArray

* Convert to use MetricArray

* Convert from Metric*/PairwiseMatrix to MetricArray

* Use MetricArray in place of PairwiseMatrix

* Finish removing PairwiseMatrix

* Last changes to MetricArray

* Update main depends. Remove duplicate Info() call.

* Remove obsolete comment

* double* form of the distance routines is not needed.

* Update code comments

* Uncached distance calc can be private

* Remove unneeded fwd declare

* Update version for cmake

* Add test clustering on RMSD and data

* Add error message when no frames to cluster.

* Move the error message to Control

* Add calculation for determining how much each metric contributes to
total distance

* Write out description for Info instead of just the legend again

* Add metricstats keyword

* Improve output message

* Add metricstats keyword

* Fix up help

* Update manual entry with ability to cluster on any combo of coords/1d
sets. Add metricstats keyword.

* Enable coords/1d clustering test

* Add weights to distance contribution calc

* Add entry for wgt keyword

* Test the wgt keyword

* Have metricstats and wgt reference each other

* Put all coords results setup in the same place

* Redirect metric stats to a file

* Add metricstats test save

* Add metricstats save files

* Adjust metric summation based on distance calculation type

* Add test with manhattan distance

* Metric stats calc now takes manhattan/euclid into account

* Calculate average and SD of individual metric contributions. Change
output format for metricstats.

* Use OnlineVarT to calculate frac averages. Report frac SD as well.

* Update for new format

* Update help text

* Update metricstats entry.

* Update dependencies

* Add some const

* Add some feedback for frameSelect_ var in Info()

* Fix up some of the descriptions.

* Add min and max distance contribution to metricstats calc

* Min and max contributions now reported

* Add Min and Max description to metricstats help entry

* Improve help text.

Co-authored-by: Daniel R. Roe <daniel.roe@nih.gov>
  • Loading branch information
drroe and Daniel R. Roe authored Sep 17, 2021
1 parent 641bd2f commit 4917a32
Show file tree
Hide file tree
Showing 164 changed files with 17,464 additions and 6,313 deletions.
12 changes: 7 additions & 5 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@ project(cpptraj NONE)

#version number
#---------------------------------------------------------------------------------------------------------------------------------------------------------------------
set(cpptraj_MAJOR_VERSION 4)
set(cpptraj_MINOR_VERSION 3)
set(cpptraj_TWEAK_VERSION 4)
set(cpptraj_MAJOR_VERSION 6)
set(cpptraj_MINOR_VERSION 0)
set(cpptraj_TWEAK_VERSION 0)

set(cpptraj_VERSION "${cpptraj_MAJOR_VERSION}.${cpptraj_MINOR_VERSION}.${cpptraj_TWEAK_VERSION}")

Expand Down Expand Up @@ -49,10 +49,12 @@ if(NOT INSIDE_AMBER)
set(BUNDLE_SIGNATURE CPTJ)
include(Packaging)

# header installation option
# build options
option(INSTALL_HEADERS "Copy headers to the include/cpptraj folder of the install directory. Useful for building with pytraj." FALSE)

option(BUILD_PARALLEL_COMBINATIONS "If true, then combinations of all enabled parallelizations will be built, e.g. cpptraj.OMP.MPI and cpptraj.OMP.MPI.cuda" FALSE)

option(INSTALL_TESTS "Copy tests to the test/ folder of the install directory" FALSE)
else()
set(INSTALL_HEADERS FALSE)
set(BUILD_PARALLEL_COMBINATIONS FALSE)
Expand All @@ -67,4 +69,4 @@ add_subdirectory(test)
#--------------------------------------------------------------
if(NOT INSIDE_AMBER)
print_build_report()
endif()
endif()
30 changes: 30 additions & 0 deletions devtools/CheckForMissingSource.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
#!/bin/bash

MAKE_SOURCES=`ls *files`
if [ -z "$MAKE_SOURCES" ] ; then
MAKE_SOURCES=Makefile
fi
CMAKE_SOURCES=CMakeLists.txt

echo "Make sources : $MAKE_SOURCES"
echo "Cmake sources : $CMAKE_SOURCES"

if [ ! -f "$MAKE_SOURCES" ] ; then
echo "Make sources not found."
exit 1
fi

if [ ! -f "$CMAKE_SOURCES" ] ; then
echo "Cmake sources not found."
exit 1
fi

SOURCES=`ls *.cpp *.c *.F90 2> /dev/null`

for FILE1 in $MAKE_SOURCES $CMAKE_SOURCES ; do
for FILE2 in $SOURCES ; do
if [ -z "`grep $FILE2 $FILE1`" ] ; then
echo "$FILE2 appears to be missing from $FILE1"
fi
done
done
24 changes: 24 additions & 0 deletions devtools/RunCmake.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#!/bin/bash

# Can be used to test the cmake install.
# Assumes 'git submodule update --init --recursive' has been run in top dir.

if [ -z "$CPPTRAJHOME" ] ; then
echo "CPPTRAJHOME must be set."
exit 1
fi
HOME=$CPPTRAJHOME

installdir=$HOME/install
if [ ! -d "$installdir" ] ; then
mkdir $installdir
fi

#export BUILD_FLAGS="-DOPENMP=TRUE"
#export BUILD_FLAGS="-DMPI=TRUE ${BUILD_FLAGS}"
#-DNetCDF_LIBRARIES_C=$HOME/lib/libnetcdf.so -DNetCDF_INCLUDES=$HOME/include
COMPILER=gnu
cmake .. $BUILD_FLAGS -DCOMPILER=${COMPILER^^} -DINSTALL_HEADERS=FALSE \
-DCMAKE_INSTALL_PREFIX=$installdir -DCMAKE_LIBRARY_PATH=$HOME/lib \
-DPRINT_PACKAGING_REPORT=TRUE

Loading

0 comments on commit 4917a32

Please sign in to comment.