Skip to content
This repository has been archived by the owner on Jul 13, 2022. It is now read-only.

Releases: databrickslabs/automl-toolkit

v0.8.1

28 May 18:56
e9f5042
Compare
Choose a tag to compare

Version 0.8.1

Features

  • Added Distributed Shapley calculation APIs :
com.databricks.labs.automl.exploration.analysis.shap.ShapleyPipeline
com.databricks.labs.automl.exploration.analysis.shap.ShapleyModel

for calcualting shap values for each feature within a trained model (or pipeline)'s feature vector.
see: Documentation for details on the new API.

  • Added Tree-based SparkML model and metrics visualizations and extractors:
com.databricks.labs.automl.exploration.analysis.trees.TreeModelVisualization
com.databricks.labs.automl.exploration.analysis.trees.TreePipelineVisualization

Update the group ID to relfect the updated Databricks Labs naming convention. See the project definitions in pom.xml or build.sbt.

see Docs for API details.

v0.7.1

25 Mar 13:13
Compare
Choose a tag to compare

LABS NOTE

Sorry for the inconsistent push strategies in the past, we've now got a strong public release strategy so all the releases from this point forward should be able to be merged and will have a unified commit history.

Auto ML Toolkit Release Notes

Version 0.7.1

Features

  • Complete overhaul of train/test splitting and kFolding. Prior to this performance scaling improvement,
    the train and test data sets would be calculated during each model's kfold stage, resulting in non-homogenous
    comparisons between hyper parameters, as well as performance degradation from consantly having to perform
    splitting of the data. There are three new parameters to control the behavior of splitting:
Configuration parameters:

- "tunerDeltaCacheBackingDirectory"
  This new setting will define a location on dbfs in order to write extremely large training and test split data sets
  to.  This is particularly recommended for use cases in which the volume of the data is so large that making even a 
  few copies of the raw data would exceed budgeted cluster size allowances (recommended for data sets that are in the 
  100's of GB range - TB range)
- "tunerDeltaCacheBackingDirectoryRemovalFlag"
  This new setting (Boolean flag) will determine, if using 'delta' mode on splitCachingStrategy, whether or not
  to delete the delta data sets on dbfs after training is completed.  By default, this is set to true (will delete and 
  clean up the directories).  If further evaluation or testing of the train test splits is needed after the run is 
  completed, or if investigation into the composition of the splits is desired, or if auditing of the training data
  is required due to business rules, set this flag to false.  
  NOTE: directory pathing prevention of collisions is done through the generation of a UUID.  The path on dbfs
  will have this as part of the root bucket for the run.
- "splitCachingStrategy" DEFAULT: 'persist'
  Options: 'cache', 'persist', or 'delta'
  - delta mode: this will perform a train/test split for each kFold specified in the job definition, write the train
  and test datasets to dbfs in delta format, and provide a reference to the delta source for the training run.  
    NOTE: this will incur overhead and is NOT recommended for data sets that can easily fit multiple copies into 
    memory on the cluster.  
  - persist mode: this will cache and persist the train and test kFold data sets onto local disk.  This is recommended
  for larger data sets that in order to fit n copies of the data in memory would require an extremely large or 
  expensive cluster.  This is the default mode.
  - cache mode: this will use standard caching (memory and disk) for the kFold sets of train and test.  This mode
  is only recommended if the data set is relatively small and k copies of the data set can comfortably reside in memory
  on the cluster.
  • Main config is now written and tracked via MLFlow. Any pipeline trained as of 0.7.1 will provide a full config
    in json format in MLFlow Artifacts and next to your saved models path.

  • Run Inference Pipelines with only a RunID. You no longer have to track and manage a LoggingConfig to pass into
    the inference pipeline. That constructor has been deprecated, only use it for legacy pipelines. Old training pipelines
    will not be able to run this way but all future pipelines created as of 0.7.1 will be able to run with only the
    MLFlow runId.

Bug Fixes / Improvements

  • scoring metric can now support resolution of differently spelled metrics (upper case, camel case, etc.)
    and will resolve to the standard naming conventions within SparkML for Binary, Multiclass, and Regression
    evaluators.
  • Model training was getting one additional fold than applied at configuration, this has been resolved.
  • Type casting enabled from python API for complex nested types to config
  • Minor changes to assertions to provide a better experience
  • Minor internal function changes

v0.7.0.1

07 Mar 20:17
Compare
Choose a tag to compare

Release Notes

Added optimizations to model tuners (Strategic Disk Caching)
Corrected bad assertion for Outlier Filtering -- moved it to a warning
Fixed pom.xml

v0.7.0

07 Mar 19:17
Compare
Choose a tag to compare
2_11-0.7.0

release_7.0