Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Colocate Testing with Training [Resolves #560] #569

Merged
merged 6 commits into from
Jan 24, 2019
Merged

Conversation

thcrock
Copy link
Contributor

@thcrock thcrock commented Jan 7, 2019

Converts the 'train task' into 'train/test task', so instead of training
a split all at once and then testing it all at once, each model is
trained and then immediately tested. Additionally, the tasks are
flattened instead of grouped by split.

A cache is added to the ModelStorageEngine so that the predictor can load the memory model if it's there. The ModelTrainTester manually uncaches the model once it's done testing. I think this should work but is worth some memory testing to make sure there's no terrible memory leaks. I didn't get time to test this part out before vacation so we should make sure to test that before merging.

@thcrock thcrock requested a review from saleiro January 7, 2019 23:08
Copy link
Member

@jesteria jesteria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some initial comments that might be helpful...

and just a thought -- not sure if, the way this new cache is used, it would be possible to use an LRU policy or something for automatic (memory) management, for example.

store.matrix_type,
model_id
)
self.model_trainer.uncache_model(model_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me just by looking at this method where the model gets cached -- (I'm guessing it happens for each loop?) -- regardless, perhaps the method/block which caches and uses the model should be in a try, and this uncache expression in a finally? (For one, that might help protect against memory leaks.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happens in model_trainer.process_train_task. We could certainly put a try/finally here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and, (responding to other comment), it makes sense to not want to do automatic cache management.

it is a little ... unbalanced I guess ... that you uncache explicitly at this level, but the caching happens somewhere else. (for that matter, the uncache method is on the model trainer, but the caching happens in the storage.)

i wonder if it could be a little better encapsulated, by providing a simple context manager interface, (rather than just adding try/finally here).

so just for example in model_trainers.py:

@contextlib.contextmanager
def process_train_task(...):
    ...
    # rather than simply return model_id:
    try:
        yield model_id
    finally:
        self.uncache_model(model_id)

and then here, at least, you wouldn't have to add this knowledge about uncaching, and you could ensure it happens correctly by simply using the context manager:

with self.model_trainer.process_train_task(**train_kwargs) as model_id:
    ... # all that stuff

(furthermore, it kind of looked like the encapsulation would be even more complete by doing something similar but even lower -- on the storage class. but, just looking it over briefly, I'm not sure that would be as easy to give a singular, useful endpoint. perhaps so. but perhaps the above is sufficient.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! I'm going back and forth on whether or not the semantics of the context manager caching like that make sense....and I think I like it. Maybe in a similar vein there could be a train_and_cache that has to be run as a context manager, and use that instead.

I definitely agree that having the caching and the uncaching on different layers is bad. I'll play around and get them on the same layer with a context manager

@@ -151,4 +151,5 @@ def save(self, importance_records, model_id, as_of_date, method_name):
importance_score=float(importance_record["score"]),
)
db_objects.append(db_object)
print(len(db_objects))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stray print?

["profiling_stats"],
f"{int(time.time())}.profile"
)
cp.dump_stats(store.path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also appears to be mixed in from another branch/PR

@thcrock
Copy link
Contributor Author

thcrock commented Jan 16, 2019

I'm not inclined to try to rely on automatic memory management here because we know the exact scope that the model pickle is needed.

Converts the 'train task' into 'train/test task', so instead of training
a split all at once and then testing it all at once, each model is
trained and then immediately tested. Additionally, the tasks are
flattened instead of grouped by split
@codecov-io
Copy link

codecov-io commented Jan 18, 2019

Codecov Report

Merging #569 into master will decrease coverage by 0.59%.
The diff coverage is 94.17%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #569     +/-   ##
=========================================
- Coverage   83.13%   82.53%   -0.6%     
=========================================
  Files          83       83             
  Lines        4755     4810     +55     
=========================================
+ Hits         3953     3970     +17     
- Misses        802      840     +38
Impacted Files Coverage Δ
src/triage/component/catwalk/utils.py 98.9% <100%> (+0.06%) ⬆️
src/triage/component/catwalk/predictors.py 96.39% <100%> (-0.19%) ⬇️
src/triage/component/catwalk/model_trainers.py 94.65% <100%> (+0.16%) ⬆️
src/triage/experiments/base.py 96.01% <100%> (+1.71%) ⬆️
src/triage/experiments/rq.py 86.36% <100%> (-0.88%) ⬇️
src/triage/experiments/multicore.py 74.41% <100%> (-1.93%) ⬇️
src/triage/experiments/singlethreaded.py 100% <100%> (ø) ⬆️
src/triage/component/catwalk/storage.py 92.4% <100%> (+0.55%) ⬆️
src/triage/component/catwalk/__init__.py 88.67% <88%> (-11.33%) ⬇️
src/triage/component/catwalk/model_testers.py 0% <0%> (-83.73%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d617d7...4797f9f. Read the comment docs.

Copy link
Member

@saleiro saleiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with using this branch.
Due to a external reason I had to resume train_test for models where there was no predictions/evaluations but the model was successfully trained in a previous run so the pickle existed, and it resumed flawlessly the testing for each model and then resumed training/testing for the models that hadn't been trained.

@thcrock
Copy link
Contributor Author

thcrock commented Jan 24, 2019

I also tested this with dirtyduck. The memory usage was about 20% less, and time taken about 10% less. Nothing drastic, but at the very least the caching doesn't seem to be introducing a memory leak.

@thcrock thcrock merged commit b8b56f1 into master Jan 24, 2019
@thcrock thcrock deleted the reorder_training branch January 24, 2019 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants