You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently training EmbeddingRetriever with sentence-transformer uses the MarginMSE loss. We want to also support MultipleNegativesRankingLoss which can work with a simpler data requirement, particularly it doesn't necessitate a "score" for the data pairs/tuples.
#2388 added the ability to train EmbeddingRetriever (sentence-transformer variant) with Generative Pseudo Labeling (GPL). It uses MarginMSE loss with labels coming from a soft pseudo-labeling process. Example Colab notebook.
It works well. However, there can be cases where users want to directly move onto the retriever training from their data without the pseudo-labeling or some other intermediate process to come up with the scores.
Provide an argument (maybe a string or directly the loss class from sentence-transformers) to the EmbeddingRetriever#train method to allow for selection between the two losses. Data checks could be added to make sure the loss choice and data formats are compatible.
For MNRL, the format would be as below (with the neg_doc also being optional).
I'll open a draft PR and start working on this. Fine implementation details can be worked out there. In the meantime, if there are any thoughts, please drop them here.
cc: @mkkuemmel@mathislucka@vblagoje
The text was updated successfully, but these errors were encountered:
Currently training
EmbeddingRetriever
with sentence-transformer uses theMarginMSE
loss. We want to also supportMultipleNegativesRankingLoss
which can work with a simpler data requirement, particularly it doesn't necessitate a "score" for the data pairs/tuples.Related discussion: deepset-ai/haystack-tutorials#35
Background and Related work
#2388 added the ability to train
EmbeddingRetriever
(sentence-transformer variant) with Generative Pseudo Labeling (GPL). It usesMarginMSE
loss with labels coming from a soft pseudo-labeling process. Example Colab notebook.Input data is then of the format:
It works well. However, there can be cases where users want to directly move onto the retriever training from their data without the pseudo-labeling or some other intermediate process to come up with the scores.
Supporting the use of MultipleNegativesRankingLoss(MNRL) can provide such an option.
Proposal
Provide an argument (maybe a string or directly the loss class from
sentence-transformers
) to theEmbeddingRetriever#train
method to allow for selection between the two losses. Data checks could be added to make sure the loss choice and data formats are compatible.For MNRL, the format would be as below (with the
neg_doc
also being optional).Next up
I'll open a draft PR and start working on this. Fine implementation details can be worked out there. In the meantime, if there are any thoughts, please drop them here.
cc: @mkkuemmel @mathislucka @vblagoje
The text was updated successfully, but these errors were encountered: