Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Loss Evaluation #152

Merged
merged 24 commits into from
Oct 14, 2020
Merged

Training Loss Evaluation #152

merged 24 commits into from
Oct 14, 2020

Conversation

rufex2001
Copy link
Member

This PR adds an evaluation job that is simply the training loss. Currently, support is only for computing the loss on the training split. This is part of the following PR:

#99

kge/job/train.py Show resolved Hide resolved
kge/job/train.py Outdated Show resolved Hide resolved
kge/job/train.py Outdated Show resolved Hide resolved
kge/job/train.py Outdated Show resolved Hide resolved
kge/job/train.py Outdated Show resolved Hide resolved
kge/job/train.py Outdated Show resolved Hide resolved
@rgemulla
Copy link
Member

rgemulla commented Oct 7, 2020

Also, the configuraiton should be updated (saying that there is now this other evaluation type).

@rufex2001
Copy link
Member Author

@rgemulla All comments addressed in latest commit

@rufex2001
Copy link
Member Author

@rgemulla Please wait until I add sum_loss to tracing for merging

@rufex2001
Copy link
Member Author

@rgemulla I've added sum loss tracing to trainers and the training loss eval job. Two things:

  1. Should we do this as simple as it is how? Or should we add this also at the batch trace level? This would mean trainers add sum_loss to their results object.
  2. The solution to preparing the train job when creating it as an eval job (eval.py line 172) calls a private method from outside. Not sure this is what you meant, but this is partly why I didn't do it like this directly, but I forgot to bring it up (until now).

…ions.

parameters are grouped by regex expressions.
kge/job/eval.py Outdated Show resolved Hide resolved
kge/job/eval.py Outdated Show resolved Hide resolved
kge/job/eval.py Outdated Show resolved Hide resolved
kge/job/eval.py Show resolved Hide resolved
kge/job/train.py Outdated Show resolved Hide resolved
@rufex2001
Copy link
Member Author

@rgemulla Ok, num_examples was there but called size. Renamed it to num_examples because I think that makes more sense than the notion of "epoch size". We do call it size at the batch level, which of course makes sense. If you want, I can revert this.

Also, not sure what keys to include in the trace produced by this eval job. I manually add avg_loss and num_examples from the internal training job for now. When adding everything, some things are overwritten, e.g. the event key, type key, epoch number key, etc. Also, this causes an error during the trace event, e.g. because there is already a key for job_id that comes from the already traced entry from the training job. We need to either decide which keys from the train job to include (avg cost too? avg penalty?), or which keys to exclude to prevent errors and overwritten keys. Ideally, it would be nice if all relevant keys from the train job were available to design custom metrics.

@rufex2001
Copy link
Member Author

@rgemulla Tracing done as suggested.

Copy link
Member

@rgemulla rgemulla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than this, looks good to me. Please add an entry into the CHANGELOG before merging.

kge/job/train.py Outdated
@@ -288,9 +312,12 @@ def run_epoch(self) -> Dict[str, Any]:
epoch=self.epoch,
split=self.train_split,
batches=len(self.loader),
size=self.num_examples,
lr=[group["lr"] for group in self.optimizer.param_groups],
num_examples=self.num_examples,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we use size everywhere else (e.g., for the batch size), I suggest to continue using "size" here.

@rufex2001
Copy link
Member Author

@rgemulla Fixed!

@rgemulla
Copy link
Member

Alright, please add the CHANGELOG and merge

@rufex2001 rufex2001 merged commit 54257fb into master Oct 14, 2020
@rufex2001 rufex2001 deleted the training_loss_evaluation branch October 14, 2020 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants