Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EmbeddingRetriever assumes the title of the document is in a field of meta called name #3258

Closed
ZanSara opened this issue Sep 21, 2022 · 5 comments · Fixed by #3368
Closed
Assignees
Labels
topic:retriever type:bug Something isn't working

Comments

@ZanSara
Copy link
Contributor

ZanSara commented Sep 21, 2022

Describe the bug

Expected behavior

  • The name of the field containing the title should be configurable.
  • We should be able to configure which meta fields are embedded through the embed_meta_fields parameter.

Additional context

@ZanSara ZanSara added type:bug Something isn't working good first issue Good for newcomers Contributions wanted! Looking for external contributions topic:retriever labels Sep 21, 2022
@anakin87
Copy link
Member

anakin87 commented Oct 4, 2022

Hey @ZanSara!

I think that this line is superfluous and indeed is probably wrong:

passages = [[d.meta["name"] if d.meta and "name" in d.meta else "", d.content] for d in docs]

In fact, before calling that method, in the embed_documents method (of the EmbeddingRetriever), _preprocess_documents is called and the meta fields indicated by embed_meta_fields are concatenated there.

def embed_documents(self, documents: List[Document]) -> np.ndarray:
"""
Create embeddings for a list of documents.
:param documents: List of documents to embed.
:return: Embeddings, one per input document, shape: (docs, embedding_dim)
"""
documents = self._preprocess_documents(documents)
return self.embedding_encoder.embed_documents(documents)
def _preprocess_documents(self, docs: List[Document]) -> List[Document]:
"""
Turns table documents into text documents by representing the table in csv format.
This allows us to use text embedding models for table retrieval.
It also concatenates specified meta data fields with the text representations.
:param docs: List of documents to linearize. If the document is not a table, it is returned as is.
:return: List of documents with meta data + linearized tables or original documents if they are not tables.
"""
linearized_docs = []
for doc in docs:
doc = deepcopy(doc)
if doc.content_type == "table":
if isinstance(doc.content, pd.DataFrame):
doc.content = doc.content.to_csv(index=False)
else:
raise HaystackError("Documents of type 'table' need to have a pd.DataFrame as content field")
meta_data_fields = [doc.meta[key] for key in self.embed_meta_fields if key in doc.meta and doc.meta[key]]
doc.content = "\n".join(meta_data_fields + [doc.content])
linearized_docs.append(doc)
return linearized_docs

So for now, if you have such a document:

{"content": "mycontent",
"meta": {"name": "myname"}}

and you try to embed it with an EmbeddingRetriever, by specifying embed_meta_fields=['name'],
under the hood these are the passages to encode (sum of _preprocess_documents and embed_documents).

[['myname', 'myname\ncontent']]

I verified that with some experiments.

Possible solution

We can simply remove that line from embed_documents if you agree that the meta fields have already been concatenated in _preprocess_documents.
Do we want to provide some default value for EmbeddingRetriever's embed_meta_fields attribute? Currently, it is an empty list.

@ZanSara
Copy link
Contributor Author

ZanSara commented Oct 11, 2022

Hey @anakin87, wow, thank you for digging into this one.

So, although I don't believe this duplication benefits anyone, there might be parts of the code deep into the retriever that either assume the name will be in the first position of the list, and others that assume the name is already concatenated. My recommendation would be to proceed, remove the line, and then thoroughly test the resulting code. By this, I mean that we should make sure all tutorials run on it. If you can, running all tests locally would be also fantastic.

Do you think you can do that? If it's not viable I'll do the tests myself, let me know. It will just take a bit longer as I have a lot to catch up after my holidays 😄

@julian-risch
Copy link
Member

Mayank is already working on this.

@julian-risch
Copy link
Member

@mayankjobanputra

@julian-risch julian-risch removed Contributions wanted! Looking for external contributions good first issue Good for newcomers labels Oct 11, 2022
@ZanSara
Copy link
Contributor Author

ZanSara commented Oct 11, 2022

@julian-risch thanks! I wasn't aware yet 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:retriever type:bug Something isn't working
Projects
None yet
4 participants