-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Colocate Testing with Training [Resolves #560] #569
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just some initial comments that might be helpful...
and just a thought -- not sure if, the way this new cache is used, it would be possible to use an LRU policy or something for automatic (memory) management, for example.
store.matrix_type, | ||
model_id | ||
) | ||
self.model_trainer.uncache_model(model_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me just by looking at this method where the model gets cached -- (I'm guessing it happens for each loop?) -- regardless, perhaps the method/block which caches and uses the model should be in a try
, and this uncache expression in a finally
? (For one, that might help protect against memory leaks.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It happens in model_trainer.process_train_task. We could certainly put a try/finally here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and, (responding to other comment), it makes sense to not want to do automatic cache management.
it is a little ... unbalanced I guess ... that you uncache explicitly at this level, but the caching happens somewhere else. (for that matter, the uncache method is on the model trainer, but the caching happens in the storage.)
i wonder if it could be a little better encapsulated, by providing a simple context manager interface, (rather than just adding try/finally here).
so just for example in model_trainers.py
:
@contextlib.contextmanager
def process_train_task(...):
...
# rather than simply return model_id:
try:
yield model_id
finally:
self.uncache_model(model_id)
and then here, at least, you wouldn't have to add this knowledge about uncaching, and you could ensure it happens correctly by simply using the context manager:
with self.model_trainer.process_train_task(**train_kwargs) as model_id:
... # all that stuff
(furthermore, it kind of looked like the encapsulation would be even more complete by doing something similar but even lower -- on the storage class. but, just looking it over briefly, I'm not sure that would be as easy to give a singular, useful endpoint. perhaps so. but perhaps the above is sufficient.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! I'm going back and forth on whether or not the semantics of the context manager caching like that make sense....and I think I like it. Maybe in a similar vein there could be a train_and_cache
that has to be run as a context manager, and use that instead.
I definitely agree that having the caching and the uncaching on different layers is bad. I'll play around and get them on the same layer with a context manager
@@ -151,4 +151,5 @@ def save(self, importance_records, model_id, as_of_date, method_name): | |||
importance_score=float(importance_record["score"]), | |||
) | |||
db_objects.append(db_object) | |||
print(len(db_objects)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stray print?
src/triage/experiments/base.py
Outdated
["profiling_stats"], | ||
f"{int(time.time())}.profile" | ||
) | ||
cp.dump_stats(store.path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also appears to be mixed in from another branch/PR
I'm not inclined to try to rely on automatic memory management here because we know the exact scope that the model pickle is needed. |
Converts the 'train task' into 'train/test task', so instead of training a split all at once and then testing it all at once, each model is trained and then immediately tested. Additionally, the tasks are flattened instead of grouped by split
08933d6
to
670e206
Compare
Codecov Report
@@ Coverage Diff @@
## master #569 +/- ##
=========================================
- Coverage 83.13% 82.53% -0.6%
=========================================
Files 83 83
Lines 4755 4810 +55
=========================================
+ Hits 3953 3970 +17
- Misses 802 840 +38
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with using this branch.
Due to a external reason I had to resume train_test for models where there was no predictions/evaluations but the model was successfully trained in a previous run so the pickle existed, and it resumed flawlessly the testing for each model and then resumed training/testing for the models that hadn't been trained.
I also tested this with dirtyduck. The memory usage was about 20% less, and time taken about 10% less. Nothing drastic, but at the very least the caching doesn't seem to be introducing a memory leak. |
Converts the 'train task' into 'train/test task', so instead of training
a split all at once and then testing it all at once, each model is
trained and then immediately tested. Additionally, the tasks are
flattened instead of grouped by split.
A cache is added to the ModelStorageEngine so that the predictor can load the memory model if it's there. The ModelTrainTester manually uncaches the model once it's done testing. I think this should work but is worth some memory testing to make sure there's no terrible memory leaks. I didn't get time to test this part out before vacation so we should make sure to test that before merging.