Colocate Testing with Training [Resolves #560] #569

thcrock · 2019-01-07T23:08:17Z

Converts the 'train task' into 'train/test task', so instead of training
a split all at once and then testing it all at once, each model is
trained and then immediately tested. Additionally, the tasks are
flattened instead of grouped by split.

A cache is added to the ModelStorageEngine so that the predictor can load the memory model if it's there. The ModelTrainTester manually uncaches the model once it's done testing. I think this should work but is worth some memory testing to make sure there's no terrible memory leaks. I didn't get time to test this part out before vacation so we should make sure to test that before merging.

jesteria

just some initial comments that might be helpful...

and just a thought -- not sure if, the way this new cache is used, it would be possible to use an LRU policy or something for automatic (memory) management, for example.

jesteria · 2019-01-08T16:28:34Z

src/triage/component/catwalk/__init__.py

+                    store.matrix_type,
+                    model_id
+                )
+        self.model_trainer.uncache_model(model_id)


It's not clear to me just by looking at this method where the model gets cached -- (I'm guessing it happens for each loop?) -- regardless, perhaps the method/block which caches and uses the model should be in a try, and this uncache expression in a finally? (For one, that might help protect against memory leaks.)

It happens in model_trainer.process_train_task. We could certainly put a try/finally here

and, (responding to other comment), it makes sense to not want to do automatic cache management.

it is a little ... unbalanced I guess ... that you uncache explicitly at this level, but the caching happens somewhere else. (for that matter, the uncache method is on the model trainer, but the caching happens in the storage.)

i wonder if it could be a little better encapsulated, by providing a simple context manager interface, (rather than just adding try/finally here).

so just for example in model_trainers.py:

@contextlib.contextmanager def process_train_task(...): ... # rather than simply return model_id: try: yield model_id finally: self.uncache_model(model_id)

and then here, at least, you wouldn't have to add this knowledge about uncaching, and you could ensure it happens correctly by simply using the context manager:

with self.model_trainer.process_train_task(**train_kwargs) as model_id: ... # all that stuff

(furthermore, it kind of looked like the encapsulation would be even more complete by doing something similar but even lower -- on the storage class. but, just looking it over briefly, I'm not sure that would be as easy to give a singular, useful endpoint. perhaps so. but perhaps the above is sufficient.)

Interesting! I'm going back and forth on whether or not the semantics of the context manager caching like that make sense....and I think I like it. Maybe in a similar vein there could be a train_and_cache that has to be run as a context manager, and use that instead.

I definitely agree that having the caching and the uncaching on different layers is bad. I'll play around and get them on the same layer with a context manager

jesteria · 2019-01-08T16:29:38Z

src/triage/component/catwalk/individual_importance/__init__.py

@@ -151,4 +151,5 @@ def save(self, importance_records, model_id, as_of_date, method_name):
                importance_score=float(importance_record["score"]),
            )
            db_objects.append(db_object)
+        print(len(db_objects))


stray print?

jesteria · 2019-01-08T16:32:46Z

src/triage/experiments/base.py

+            ["profiling_stats"],
+            f"{int(time.time())}.profile"
+        )
+        cp.dump_stats(store.path)


this also appears to be mixed in from another branch/PR

thcrock · 2019-01-16T23:13:40Z

I'm not inclined to try to rely on automatic memory management here because we know the exact scope that the model pickle is needed.

Converts the 'train task' into 'train/test task', so instead of training a split all at once and then testing it all at once, each model is trained and then immediately tested. Additionally, the tasks are flattened instead of grouped by split

codecov-io · 2019-01-18T19:36:06Z

Codecov Report

Merging #569 into master will decrease coverage by 0.59%.
The diff coverage is 94.17%.

@@            Coverage Diff            @@
##           master     #569     +/-   ##
=========================================
- Coverage   83.13%   82.53%   -0.6%     
=========================================
  Files          83       83             
  Lines        4755     4810     +55     
=========================================
+ Hits         3953     3970     +17     
- Misses        802      840     +38

Impacted Files	Coverage Δ
src/triage/component/catwalk/utils.py	`98.9% <100%> (+0.06%)`	⬆️
src/triage/component/catwalk/predictors.py	`96.39% <100%> (-0.19%)`	⬇️
src/triage/component/catwalk/model_trainers.py	`94.65% <100%> (+0.16%)`	⬆️
src/triage/experiments/base.py	`96.01% <100%> (+1.71%)`	⬆️
src/triage/experiments/rq.py	`86.36% <100%> (-0.88%)`	⬇️
src/triage/experiments/multicore.py	`74.41% <100%> (-1.93%)`	⬇️
src/triage/experiments/singlethreaded.py	`100% <100%> (ø)`	⬆️
src/triage/component/catwalk/storage.py	`92.4% <100%> (+0.55%)`	⬆️
src/triage/component/catwalk/__init__.py	`88.67% <88%> (-11.33%)`	⬇️
src/triage/component/catwalk/model_testers.py	`0% <0%> (-83.73%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d617d7...4797f9f. Read the comment docs.

saleiro

I'm happy with using this branch.
Due to a external reason I had to resume train_test for models where there was no predictions/evaluations but the model was successfully trained in a previous run so the pickle existed, and it resumed flawlessly the testing for each model and then resumed training/testing for the models that hadn't been trained.

thcrock · 2019-01-24T17:30:59Z

I also tested this with dirtyduck. The memory usage was about 20% less, and time taken about 10% less. Nothing drastic, but at the very least the caching doesn't seem to be introducing a memory leak.

thcrock requested a review from saleiro January 7, 2019 23:08

jesteria reviewed Jan 8, 2019

View reviewed changes

thcrock force-pushed the reorder_training branch from 08933d6 to 670e206 Compare January 18, 2019 19:22

thcrock added 4 commits January 18, 2019 15:36

Convert model caching functionality to context manager

3569673

Remove stray additions

0f12f1f

Update docstrings

e815a04

Some test fixing

c45018e

Flatten MatrixStore test into functions so that fixtures can be used

4797f9f

saleiro approved these changes Jan 24, 2019

View reviewed changes

thcrock merged commit b8b56f1 into master Jan 24, 2019

thcrock deleted the reorder_training branch January 24, 2019 17:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Colocate Testing with Training [Resolves #560] #569

Colocate Testing with Training [Resolves #560] #569

thcrock commented Jan 7, 2019

jesteria left a comment

jesteria Jan 8, 2019

thcrock Jan 16, 2019

jesteria Jan 17, 2019

thcrock Jan 18, 2019

jesteria Jan 8, 2019

jesteria Jan 8, 2019

thcrock commented Jan 16, 2019

codecov-io commented Jan 18, 2019 •

edited

Loading

saleiro left a comment

thcrock commented Jan 24, 2019

Colocate Testing with Training [Resolves #560] #569

Colocate Testing with Training [Resolves #560] #569

Conversation

thcrock commented Jan 7, 2019

jesteria left a comment

Choose a reason for hiding this comment

jesteria Jan 8, 2019

Choose a reason for hiding this comment

thcrock Jan 16, 2019

Choose a reason for hiding this comment

jesteria Jan 17, 2019

Choose a reason for hiding this comment

thcrock Jan 18, 2019

Choose a reason for hiding this comment

jesteria Jan 8, 2019

Choose a reason for hiding this comment

jesteria Jan 8, 2019

Choose a reason for hiding this comment

thcrock commented Jan 16, 2019

codecov-io commented Jan 18, 2019 • edited Loading

Codecov Report

saleiro left a comment

Choose a reason for hiding this comment

thcrock commented Jan 24, 2019

codecov-io commented Jan 18, 2019 •

edited

Loading