-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synchronize model checkpointing and servable model saving #29
Comments
@joein WDYT? |
Previously we loaded from best checkpoint after From the first sight I'd rather not save two versions of checkpoints simultaneously, it may be space consuming and also makes a little mess. Maybe we can receive |
Ok, so my proposal is as follows:
I think such a design will
WDYT? |
I think that addition of I like the overall idea with callback and separate path for servable, we can try to develop this idea further |
Good point. But they still have the option to pass
I guess we can inspect the attributes of the user-defined Or, another alternative could be:
Pros:
Cons:
|
User can break saving unintentionally, like he overrode
Yes, we can, also we have a problem here with multiple
It does not seem right for me to save best checkpoint when user does not specify this via
Yes, maybe, but I think I'd like to avoid it unless we definitely can't survive without it. |
We can, but In fact, I'm for sensible defaults whenever possible. |
Yes, I support the overall idea also I support sensible defaults. Let's take it some time |
After some discussion with @joein yesterday, the current proposal is as follows:
|
There is one more problem, I guess. To restore full model state from checkpoint we need to call Actually, it will be save_servable(
path: str,
checkpoint_callback: ModelCheckpoint = None,
strategy: SaveStrategy = SaveStrategy.AUTO,
model_params: Optional[Dict] = None
) where It's a quite straight-forward approach and also a point where it starts looking clumsy for me. [UPDATE]:
[UPDATE]: |
We don't need it. It's fine with the following inside def save_servable(...):
# ...
self.load_state_dict(torch.load(checkpoint_path)["state_dict"])
# we restored the state from the given checkpoint, now save it as servable |
LGTM |
After giving it some thought I decided that this functionality is not something our framework should be responsible for. |
Currently,
TrainableModel.save_servable()
is called by the user at the end of the training loop. This is problematic because we may end up with saving an overfitted state of the model even if we are trying to monitor an evaluation metric withpl.callbacks.ModelCheckpoint
. So we need to come up with a way to synchronize both.Possible solution
ModelCheckpoint
insidequaterion
for synchronization.Quaterion.fit
to automatically save a servable checkpoints to the specified directory with a specified interval.The text was updated successfully, but these errors were encountered: