Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flatten model training parallelization and specifically control ones with n_jobs #542

Closed
thcrock opened this issue Dec 4, 2018 · 0 comments
Assignees

Comments

@thcrock
Copy link
Contributor

thcrock commented Dec 4, 2018

The model training is grossly inefficient, particularly with multiprocessing. Since different classifiers have very lopsided training requirements (memory and time), it's very easy to get unlucky in either direction: either underutilizing the machine heavily, or running out of memory. The latter case usually causes us to pull back the training grid considerably, which leads to more time underutilizing the machine.

A better route would be to:

  • Flatten out training so that it's not done by split. All the models are queued at once
  • Train all of the 'small' classifiers first, with high multiprocess parallelization
  • Train all of the 'large' classifiers last, with no parallelization but n_jobs set to -1

How do we define 'small' and 'large' classifiers? Potentially just by whether or not they have an n_jobs argument.

@thcrock thcrock self-assigned this Dec 4, 2018
thcrock added a commit that referenced this issue Jan 28, 2019
In MulticoreExperiment, partition models to be trained/tested into two
buckets: 'large ones' (random forests, extra trees) and 'small ones' (everything else). Train the small ones as they are now first, and then train the large ones serially, maxing out n_jobs if it's not set. This is for RAM stability, as trying to parallelize classifiers like random forests can cause memory to spike and kill the experiment.
thcrock added a commit that referenced this issue Jan 28, 2019
In MulticoreExperiment, partition models to be trained/tested into two
buckets: 'large ones' (random forests, extra trees) and 'small ones' (everything else). Train the small ones as they are now first, and then train the large ones serially, maxing out n_jobs if it's not set. This is for RAM stability, as trying to parallelize classifiers like random forests can cause memory to spike and kill the experiment.
thcrock added a commit that referenced this issue Mar 13, 2019
Train/test tasks are now implemented in batches, based on what results
people should be interested in first:

- Batch 1: Short and important ones, like Decision Trees, Scaled Logistic Regressions,
and baselines. These will be parallelized if using an Experiment
subclass that supports parallelization
- Batch 2: Big ones with n_jobs=-1. This will be run serially no matter
what the Experiment subclass is, because the classifier is expected to
parallelize and adding on another layer is likely to crash the
Experiment.
- Batch 3: All others, parallelized similar to batch 1. These are ones
that might be expected to take a while to complete (Gradient Boosting,
forests without n_jobs=-1) and/or ones less likely to be effective.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant