Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MultipleNegativesRankingLoss for EmbeddingRetriever training with sentence-transformers #3136

Closed
bglearning opened this issue Sep 1, 2022 · 0 comments · Fixed by #3164

Comments

@bglearning
Copy link
Contributor

bglearning commented Sep 1, 2022

Currently training EmbeddingRetriever with sentence-transformer uses the MarginMSE loss. We want to also support MultipleNegativesRankingLoss which can work with a simpler data requirement, particularly it doesn't necessitate a "score" for the data pairs/tuples.

Related discussion: deepset-ai/haystack-tutorials#35

Background and Related work

#2388 added the ability to train EmbeddingRetriever (sentence-transformer variant) with Generative Pseudo Labeling (GPL). It uses MarginMSE loss with labels coming from a soft pseudo-labeling process. Example Colab notebook.

Input data is then of the format:

[
{”question”: …, “pos_doc”: …, “neg_doc”: …, “score”: …}, 
... 
]

It works well. However, there can be cases where users want to directly move onto the retriever training from their data without the pseudo-labeling or some other intermediate process to come up with the scores.

Supporting the use of MultipleNegativesRankingLoss(MNRL) can provide such an option.

Proposal

Provide an argument (maybe a string or directly the loss class from sentence-transformers) to the EmbeddingRetriever#train method to allow for selection between the two losses. Data checks could be added to make sure the loss choice and data formats are compatible.

For MNRL, the format would be as below (with the neg_doc also being optional).

[
{”question”: …, “pos_doc”: …, “neg_doc”: …}, 
... 
]

Next up

I'll open a draft PR and start working on this. Fine implementation details can be worked out there. In the meantime, if there are any thoughts, please drop them here.
cc: @mkkuemmel @mathislucka @vblagoje

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant