You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The model training is grossly inefficient, particularly with multiprocessing. Since different classifiers have very lopsided training requirements (memory and time), it's very easy to get unlucky in either direction: either underutilizing the machine heavily, or running out of memory. The latter case usually causes us to pull back the training grid considerably, which leads to more time underutilizing the machine.
A better route would be to:
Flatten out training so that it's not done by split. All the models are queued at once
Train all of the 'small' classifiers first, with high multiprocess parallelization
Train all of the 'large' classifiers last, with no parallelization but n_jobs set to -1
How do we define 'small' and 'large' classifiers? Potentially just by whether or not they have an n_jobs argument.
The text was updated successfully, but these errors were encountered:
In MulticoreExperiment, partition models to be trained/tested into two
buckets: 'large ones' (random forests, extra trees) and 'small ones' (everything else). Train the small ones as they are now first, and then train the large ones serially, maxing out n_jobs if it's not set. This is for RAM stability, as trying to parallelize classifiers like random forests can cause memory to spike and kill the experiment.
In MulticoreExperiment, partition models to be trained/tested into two
buckets: 'large ones' (random forests, extra trees) and 'small ones' (everything else). Train the small ones as they are now first, and then train the large ones serially, maxing out n_jobs if it's not set. This is for RAM stability, as trying to parallelize classifiers like random forests can cause memory to spike and kill the experiment.
Train/test tasks are now implemented in batches, based on what results
people should be interested in first:
- Batch 1: Short and important ones, like Decision Trees, Scaled Logistic Regressions,
and baselines. These will be parallelized if using an Experiment
subclass that supports parallelization
- Batch 2: Big ones with n_jobs=-1. This will be run serially no matter
what the Experiment subclass is, because the classifier is expected to
parallelize and adding on another layer is likely to crash the
Experiment.
- Batch 3: All others, parallelized similar to batch 1. These are ones
that might be expected to take a while to complete (Gradient Boosting,
forests without n_jobs=-1) and/or ones less likely to be effective.
The model training is grossly inefficient, particularly with multiprocessing. Since different classifiers have very lopsided training requirements (memory and time), it's very easy to get unlucky in either direction: either underutilizing the machine heavily, or running out of memory. The latter case usually causes us to pull back the training grid considerably, which leads to more time underutilizing the machine.
A better route would be to:
n_jobs
set to -1How do we define 'small' and 'large' classifiers? Potentially just by whether or not they have an
n_jobs
argument.The text was updated successfully, but these errors were encountered: