Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train forests serially [Resolves #542] #581

Merged
merged 1 commit into from
Mar 13, 2019
Merged

Conversation

thcrock
Copy link
Contributor

@thcrock thcrock commented Jan 28, 2019

In MulticoreExperiment, partition models to be trained/tested into two
buckets: 'large ones' (random forests, extra trees) and 'small ones' (everything else). Train the small ones as they are now first, and then train the large ones serially, maxing out n_jobs if it's not set. This is for RAM stability, as trying to parallelize classifiers like random forests can cause memory to spike and kill the experiment.

Under a dirtyduck test, this actually ended up taking 25% longer given the same parallelization. This makes sense, as running forests with multiprocessing will go pretty fast if you don't run out of memory. However, the unstable memory usage of the old way often leads to killed experiments which add a ton of extra human time and an extra experiment run (or two!), which is far worse. With the new method we can more reliably set n_processes to just be the number of cores on the machine.

before:
oldorder

after:
neworder

@codecov-io
Copy link

codecov-io commented Jan 28, 2019

Codecov Report

Merging #581 into master will increase coverage by 0.01%.
The diff coverage is 88.52%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master   #581      +/-   ##
========================================
+ Coverage   82.99%    83%   +0.01%     
========================================
  Files          87     87              
  Lines        5763   5779      +16     
========================================
+ Hits         4783   4797      +14     
- Misses        980    982       +2
Impacted Files Coverage Δ
src/triage/experiments/base.py 94.75% <100%> (-0.5%) ⬇️
src/triage/experiments/rq.py 87.23% <100%> (ø) ⬆️
src/triage/component/catwalk/model_trainers.py 94.61% <100%> (ø) ⬆️
src/triage/experiments/singlethreaded.py 100% <100%> (ø) ⬆️
src/triage/component/catwalk/__init__.py 92.59% <84.61%> (+1.96%) ⬆️
src/triage/experiments/multicore.py 75.53% <87.5%> (-0.03%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 041f568...4d71a61. Read the comment docs.

Copy link
Member

@ivanhigueram ivanhigueram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some comments in the class_path_implements_parallelization. I guess you can do your n_job argument trick if you are able to change that argument from the ScaledLogisticRegression in catwalk rather than use LogisticClassifier, the later should never be used in our grids.

@@ -202,3 +203,37 @@ def save_db_objects(db_engine, db_objects):
)
f.seek(0)
postgres_copy.copy_from(f, type(db_objects[0]), db_engine, format="csv")


def class_path_implements_parallelization(class_path):
Copy link
Member

@ivanhigueram ivanhigueram Jan 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we can change this behavior. I think you can avoid the n_job problem by only targeting the catwalk.estimators.classifiers.ScaledLogisticRegression instead of the LogisticRegression (actually we should strongly advice not to use the latter).

If you remove the n_job argument from the SLR, I guess your trick will work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What @saleiro thinks about this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 ☝️

Copy link
Member

@saleiro saleiro Mar 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would refrain from hardcding the large ones. We could check if the user passed a 'n_jobs=-1' argument in the configuration parameters for the target class_path. This way this functionlity would be equally compatible with xgboost or lightgbm cause they use the 'n_jobs' parameter too.

import logging
import sys
from contextlib import contextmanager
import importlib
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original ordering made sense 😛 🤷‍♂️


Returns: (bool) whether or not the classifier returns parallelization
"""
if 'RandomForest' in class_path or 'ExtraTrees' in class_path:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷‍♂️ or just:

Suggested change
if 'RandomForest' in class_path or 'ExtraTrees' in class_path:
return 'RandomForest' in class_path or 'ExtraTrees' in class_path

@@ -202,3 +203,37 @@ def save_db_objects(db_engine, db_objects):
)
f.seek(0)
postgres_copy.copy_from(f, type(db_objects[0]), db_engine, format="csv")


def class_path_implements_parallelization(class_path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 ☝️

src/triage/experiments/multicore.py Outdated Show resolved Hide resolved
@thcrock thcrock assigned saleiro and unassigned ivanhigueram Feb 4, 2019
if "n_jobs" not in serial_task["train_kwargs"]["parameters"]:
logging.info("n_jobs not configured for serial task, "
"defaulting to -1 for entire machine")
serial_task["train_kwargs"]["parameters"]["n_jobs"] = -1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing you don't feel it's important to create a copy of the dictionary rather than mutate it in place? (Just want to be sure it's called out.) Certainly I don't see a problem with this as it is, just noting the concern.

Copy link
Member

@saleiro saleiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to suggest an optimization to this approach.
Train models the following order:

  1. triage.component.catwalk.baselines
  2. scaled logistic regression, decision trees
  3. classpaths with 'n_jobs=-1' passed by the user
  4. all the others (which include the super slow sklearn gradientboostingclassifer and adaboost, for example, or any other not so relevant to dsapp learners like k neighbors, so not so urgent...)

Train/test tasks are now implemented in batches, based on what results
people should be interested in first:

- Batch 1: Short and important ones, like Decision Trees, Scaled Logistic Regressions,
and baselines. These will be parallelized if using an Experiment
subclass that supports parallelization
- Batch 2: Big ones with n_jobs=-1. This will be run serially no matter
what the Experiment subclass is, because the classifier is expected to
parallelize and adding on another layer is likely to crash the
Experiment.
- Batch 3: All others, parallelized similar to batch 1. These are ones
that might be expected to take a while to complete (Gradient Boosting,
forests without n_jobs=-1) and/or ones less likely to be effective.
)
return self.order_and_batch_tasks(train_test_tasks)

def order_and_batch_tasks(self, tasks):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@saleiro saleiro merged commit 4c4f3ca into master Mar 13, 2019
@saleiro saleiro deleted the big_classifiers_serially branch March 13, 2019 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants