`EmbeddingRetriever` assumes the title of the document is in a field of `meta` called `name` #3258

ZanSara · 2022-09-21T15:12:05Z

Describe the bug

In embed_documents, EmbeddingRetriever assumes the title of the document is in a field of meta called name:

haystack/haystack/nodes/retriever/_embedding_encoder.py

Line 189 in 7e79a48

passages = [[d.meta["name"] if d.meta and "name" in d.meta else "", d.content] for d in docs] # type: ignore

Expected behavior

The name of the field containing the title should be configurable.
We should be able to configure which meta fields are embedded through the embed_meta_fields parameter.

Additional context

chore: add DenseRetriever abstraction #3252 (comment)

The text was updated successfully, but these errors were encountered:

anakin87 · 2022-10-04T21:16:07Z

Hey @ZanSara!

I think that this line is superfluous and indeed is probably wrong:

haystack/haystack/nodes/retriever/_embedding_encoder.py

Line 214 in 6cb4e93

    
           passages = [[d.meta["name"] if d.meta and "name" in d.meta else "", d.content] for d in docs]

In fact, before calling that method, in the embed_documents method (of the EmbeddingRetriever), _preprocess_documents is called and the meta fields indicated by embed_meta_fields are concatenated there.

haystack/haystack/nodes/retriever/dense.py

Lines 1838 to 1868 in 6cb4e93

    
               def embed_documents(self, documents: List[Document]) -> np.ndarray: 
        
                   """ 
        
                   Create embeddings for a list of documents. 
        
                   :param documents: List of documents to embed. 
        
                   :return: Embeddings, one per input document, shape: (docs, embedding_dim) 
        
                   """ 
        
                   documents = self._preprocess_documents(documents) 
        
                   return self.embedding_encoder.embed_documents(documents) 
        
               def _preprocess_documents(self, docs: List[Document]) -> List[Document]: 
        
                   """ 
        
                   Turns table documents into text documents by representing the table in csv format. 
        
                   This allows us to use text embedding models for table retrieval. 
        
                   It also concatenates specified meta data fields with the text representations. 
        
                   :param docs: List of documents to linearize. If the document is not a table, it is returned as is. 
        
                   :return: List of documents with meta data + linearized tables or original documents if they are not tables. 
        
                   """ 
        
                   linearized_docs = [] 
        
                   for doc in docs: 
        
                       doc = deepcopy(doc) 
        
                       if doc.content_type == "table": 
        
                           if isinstance(doc.content, pd.DataFrame): 
        
                               doc.content = doc.content.to_csv(index=False) 
        
                           else: 
        
                               raise HaystackError("Documents of type 'table' need to have a pd.DataFrame as content field") 
        
                       meta_data_fields = [doc.meta[key] for key in self.embed_meta_fields if key in doc.meta and doc.meta[key]] 
        
                       doc.content = "\n".join(meta_data_fields + [doc.content]) 
        
                       linearized_docs.append(doc) 
        
                   return linearized_docs

So for now, if you have such a document:

{"content": "mycontent",
"meta": {"name": "myname"}}

and you try to embed it with an EmbeddingRetriever, by specifying embed_meta_fields=['name'],
under the hood these are the passages to encode (sum of _preprocess_documents and embed_documents).

[['myname', 'myname\ncontent']]

I verified that with some experiments.

Possible solution

We can simply remove that line from embed_documents if you agree that the meta fields have already been concatenated in _preprocess_documents.
Do we want to provide some default value for EmbeddingRetriever's embed_meta_fields attribute? Currently, it is an empty list.

ZanSara · 2022-10-11T16:51:01Z

Hey @anakin87, wow, thank you for digging into this one.

So, although I don't believe this duplication benefits anyone, there might be parts of the code deep into the retriever that either assume the name will be in the first position of the list, and others that assume the name is already concatenated. My recommendation would be to proceed, remove the line, and then thoroughly test the resulting code. By this, I mean that we should make sure all tutorials run on it. If you can, running all tests locally would be also fantastic.

Do you think you can do that? If it's not viable I'll do the tests myself, let me know. It will just take a bit longer as I have a lot to catch up after my holidays 😄

julian-risch · 2022-10-11T17:12:11Z

Mayank is already working on this.

julian-risch · 2022-10-11T17:13:58Z

@mayankjobanputra

ZanSara · 2022-10-11T17:23:43Z

@julian-risch thanks! I wasn't aware yet 👍

ZanSara added type:bug Something isn't working good first issue Good for newcomers Contributions wanted! Looking for external contributions topic:retriever labels Sep 21, 2022

julian-risch removed Contributions wanted! Looking for external contributions good first issue Good for newcomers labels Oct 11, 2022

mayankjobanputra self-assigned this Oct 12, 2022

mayankjobanputra mentioned this issue Oct 12, 2022

bug: removed duplicated meta "name" field addition to content before embedding in update_embeddings workflow #3368

Merged

6 tasks

mayankjobanputra closed this as completed in #3368 Oct 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`EmbeddingRetriever` assumes the title of the document is in a field of `meta` called `name` #3258

`EmbeddingRetriever` assumes the title of the document is in a field of `meta` called `name` #3258

ZanSara commented Sep 21, 2022 •

edited

Loading

anakin87 commented Oct 4, 2022 •

edited

Loading

ZanSara commented Oct 11, 2022

julian-risch commented Oct 11, 2022

julian-risch commented Oct 11, 2022

ZanSara commented Oct 11, 2022

EmbeddingRetriever assumes the title of the document is in a field of meta called name #3258

EmbeddingRetriever assumes the title of the document is in a field of meta called name #3258

Comments

ZanSara commented Sep 21, 2022 • edited Loading

anakin87 commented Oct 4, 2022 • edited Loading

Possible solution

ZanSara commented Oct 11, 2022

julian-risch commented Oct 11, 2022

julian-risch commented Oct 11, 2022

ZanSara commented Oct 11, 2022

`EmbeddingRetriever` assumes the title of the document is in a field of `meta` called `name` #3258

`EmbeddingRetriever` assumes the title of the document is in a field of `meta` called `name` #3258

ZanSara commented Sep 21, 2022 •

edited

Loading

anakin87 commented Oct 4, 2022 •

edited

Loading