Flatten model training parallelization and specifically control ones with n_jobs #542

thcrock · 2018-12-04T22:53:58Z

The model training is grossly inefficient, particularly with multiprocessing. Since different classifiers have very lopsided training requirements (memory and time), it's very easy to get unlucky in either direction: either underutilizing the machine heavily, or running out of memory. The latter case usually causes us to pull back the training grid considerably, which leads to more time underutilizing the machine.

A better route would be to:

Flatten out training so that it's not done by split. All the models are queued at once
Train all of the 'small' classifiers first, with high multiprocess parallelization
Train all of the 'large' classifiers last, with no parallelization but n_jobs set to -1

How do we define 'small' and 'large' classifiers? Potentially just by whether or not they have an n_jobs argument.

The text was updated successfully, but these errors were encountered:

In MulticoreExperiment, partition models to be trained/tested into two buckets: 'large ones' (random forests, extra trees) and 'small ones' (everything else). Train the small ones as they are now first, and then train the large ones serially, maxing out n_jobs if it's not set. This is for RAM stability, as trying to parallelize classifiers like random forests can cause memory to spike and kill the experiment.

Train/test tasks are now implemented in batches, based on what results people should be interested in first: - Batch 1: Short and important ones, like Decision Trees, Scaled Logistic Regressions, and baselines. These will be parallelized if using an Experiment subclass that supports parallelization - Batch 2: Big ones with n_jobs=-1. This will be run serially no matter what the Experiment subclass is, because the classifier is expected to parallelize and adding on another layer is likely to crash the Experiment. - Batch 3: All others, parallelized similar to batch 1. These are ones that might be expected to take a while to complete (Gradient Boosting, forests without n_jobs=-1) and/or ones less likely to be effective.

thcrock self-assigned this Dec 4, 2018

thcrock added the performance label Dec 28, 2018

saleiro closed this as completed in 4c4f3ca Mar 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flatten model training parallelization and specifically control ones with n_jobs #542

Flatten model training parallelization and specifically control ones with n_jobs #542

thcrock commented Dec 4, 2018

Flatten model training parallelization and specifically control ones with n_jobs #542

Flatten model training parallelization and specifically control ones with n_jobs #542

Comments

thcrock commented Dec 4, 2018