-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training Loss Evaluation #152
Conversation
Also, the configuraiton should be updated (saying that there is now this other evaluation type). |
This reverts commit 7e5626d.
parsing of regex does not work correctly yet
@rgemulla All comments addressed in latest commit |
@rgemulla Please wait until I add sum_loss to tracing for merging |
@rgemulla I've added sum loss tracing to trainers and the training loss eval job. Two things:
|
…ions. parameters are grouped by regex expressions.
@rgemulla Ok, num_examples was there but called size. Renamed it to num_examples because I think that makes more sense than the notion of "epoch size". We do call it size at the batch level, which of course makes sense. If you want, I can revert this. Also, not sure what keys to include in the trace produced by this eval job. I manually add avg_loss and num_examples from the internal training job for now. When adding everything, some things are overwritten, e.g. the event key, type key, epoch number key, etc. Also, this causes an error during the trace event, e.g. because there is already a key for job_id that comes from the already traced entry from the training job. We need to either decide which keys from the train job to include (avg cost too? avg penalty?), or which keys to exclude to prevent errors and overwritten keys. Ideally, it would be nice if all relevant keys from the train job were available to design custom metrics. |
@rgemulla Tracing done as suggested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than this, looks good to me. Please add an entry into the CHANGELOG before merging.
kge/job/train.py
Outdated
@@ -288,9 +312,12 @@ def run_epoch(self) -> Dict[str, Any]: | |||
epoch=self.epoch, | |||
split=self.train_split, | |||
batches=len(self.loader), | |||
size=self.num_examples, | |||
lr=[group["lr"] for group in self.optimizer.param_groups], | |||
num_examples=self.num_examples, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we use size everywhere else (e.g., for the batch size), I suggest to continue using "size" here.
@rgemulla Fixed! |
Alright, please add the CHANGELOG and merge |
…ation' into training_loss_evaluation
This PR adds an evaluation job that is simply the training loss. Currently, support is only for computing the loss on the training split. This is part of the following PR:
#99