feat: Add option to use MultipleNegativesRankingLoss for EmbeddingRetriever training with sentence-transformers #3164

bglearning · 2022-09-05T18:29:18Z

Add option to use MultipleNegativesRankingLoss for EmbeddingRetriever training with sentence-transformers.

Related Issues

Resolves Add MultipleNegativesRankingLoss for EmbeddingRetriever training with sentence-transformers #3136

Proposed Changes:

Expose an option train_loss on EmbeddingRetriever.train. It is propagated to _SentenceTransformersEmbeddingEncoder.train and determines the loss that's used. Also influences the training data format that can be used/is expected.

So the API is:

embedding_retriever = EmbeddingRetriever(...)
training_data = [{"question": ..., "pos_doc": ...}, ...]
# Loss can be specified when calling `train`. Options currently: 'mnrl' or 'margin_mse'
embedding_retriever.train(training_data=training_data, train_loss='mnrl')

How did you test it?

A Colab notebook with a sample run. Qualitative comparison against the baseline model.

Notes for the reviewer

Some possible aspects up for discussion

I exposed the loss option as a string.
- Another option could be to directly accept sentence-transformers losses (might even be able to say can use any loss in there as long as the data format is correct but would be a big jump). Plus might tie us to keep doing so (which may or may not be a problem). With the string we have the flexibility to later internally swap things out.
- We could also construct our own enums or probably better yet loss classes which are thin wrappers around the sentence-transformers losses. But probably not needed at this stage.
Slightly tricky to provide users with list of possible losses. I guess best place is in the EmbeddingRetriever docstring, which I went with.
Any unit or integration tests needed? Any other experiments needed?

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added tests that demonstrate the correct behavior of the change
I've used the conventional commit convention for my PR title
I documented my code
I ran pre-commit hooks and fixed any issue

Add option to use MultipleNegativesRankingLoss for EmbeddingRetriever training with sentence-transformers

haystack/nodes/retriever/_embedding_encoder.py

haystack/nodes/retriever/dense.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

bogdankostic

LGTM

…riever training with sentence-transformers (#3164) * Add option to use MultipleNegativesRankingLoss Add option to use MultipleNegativesRankingLoss for EmbeddingRetriever training with sentence-transformers * Move out losses into separate retriever/_losses.py module * Remove unused import in retriever/_losses.py * Apply documentation suggestions from code review Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com> Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

vblagoje · 2022-10-27T16:03:22Z

Hey @bglearning, I just realized that mnrl is now the default loss for GPL rather than margin_mse! Why make a change if the original paper used margin_mse?

bglearning · 2022-10-28T08:25:50Z

Hi @vblagoje,

The default change on this EmbeddingRetriever is primarily motivated to support the more straightforward use-case of directly training the retriever without the pseudo-labeling (through GPL) or some other intermediate process to come up with the scores.

So line of reasoning:

User has some data (just query and pos-doc) and they want to start training directly (default case). mnrl (the "simpler" loss, margin_mse needs neg-doc and scores)
User also wants to do GPL (or some other method) to come up with the pseudo-labels before training the model. User sets the loss to margin_mse.

But ya, realized I missed updating the GPL tutorial to explicitly set the loss to margin_mse which is definitely an issue.
And also generally should have highlighted the change in default as a big change as it's an API change.

We could update the GPL tutorial or update the default. Either way seems okay to me.

Add option to use MultipleNegativesRankingLoss

b72788d

Add option to use MultipleNegativesRankingLoss for EmbeddingRetriever training with sentence-transformers

bglearning force-pushed the 3136-multiple-negatives-ranking-loss branch from 86b32a1 to b72788d Compare September 6, 2022 12:52

bglearning added 2 commits September 6, 2022 18:54

Move out losses into separate retriever/_losses.py module

31ba13c

Remove unused import in retriever/_losses.py

ebbece8

bglearning marked this pull request as ready for review September 6, 2022 17:52

bglearning requested review from a team as code owners September 6, 2022 17:52

bglearning requested review from bogdankostic and vblagoje and removed request for a team September 6, 2022 17:52

agnieszka-m requested changes Sep 7, 2022

View reviewed changes

haystack/nodes/retriever/_embedding_encoder.py Outdated Show resolved Hide resolved

haystack/nodes/retriever/dense.py Outdated Show resolved Hide resolved

Apply documentation suggestions from code review

82af1be

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

bglearning force-pushed the 3136-multiple-negatives-ranking-loss branch from 177cc24 to 82af1be Compare September 7, 2022 12:05

bogdankostic approved these changes Sep 7, 2022

View reviewed changes

bogdankostic requested a review from agnieszka-m September 7, 2022 13:31

agnieszka-m approved these changes Sep 8, 2022

View reviewed changes

bglearning merged commit 21aedc6 into main Sep 12, 2022

bglearning deleted the 3136-multiple-negatives-ranking-loss branch September 12, 2022 07:38

bglearning mentioned this pull request Sep 12, 2022

Tutorial 09: Update to EmbeddingRetriever Training deepset-ai/haystack-tutorials#35

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add option to use MultipleNegativesRankingLoss for EmbeddingRetriever training with sentence-transformers #3164

feat: Add option to use MultipleNegativesRankingLoss for EmbeddingRetriever training with sentence-transformers #3164

bglearning commented Sep 5, 2022 •

edited

Loading

bogdankostic left a comment

vblagoje commented Oct 27, 2022

bglearning commented Oct 28, 2022 •

edited

Loading

feat: Add option to use MultipleNegativesRankingLoss for EmbeddingRetriever training with sentence-transformers #3164

feat: Add option to use MultipleNegativesRankingLoss for EmbeddingRetriever training with sentence-transformers #3164

Conversation

bglearning commented Sep 5, 2022 • edited Loading

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

bogdankostic left a comment

Choose a reason for hiding this comment

vblagoje commented Oct 27, 2022

bglearning commented Oct 28, 2022 • edited Loading

bglearning commented Sep 5, 2022 •

edited

Loading

bglearning commented Oct 28, 2022 •

edited

Loading