Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: 'index' in pipelines with combination of BM25Retriever, MultihopEmbeddingRetriever #3502

Closed
SAIVENKATARAJU opened this issue Oct 31, 2022 · 2 comments
Labels
topic:pipeline type:bug Something isn't working

Comments

@SAIVENKATARAJU
Copy link

SAIVENKATARAJU commented Oct 31, 2022

Describe the bug
I am getting the following Keyerror with following trace

File "./documents_store/pipelines.py", line 141, in <module>
    pipe = passage_retrieval.pipeline_for_search("creditcards", "query")
  File "./documents_store/pipelines.py", line 127, in pipeline_for_search
    pipe.add_node(component=emb_retriever, name="MultihopEmbeddingRetriever",
  File "/opt/python/virtual_envs/venv_py_392_passage_retrieval/lib/python3.9/site-packages/haystack/pipelines/base.py", line 402, in add_node
    component_definitions = get_component_definitions(pipeline_config=self.get_config())
  File "/opt/python/virtual_envs/venv_py_392_passage_retrieval/lib/python3.9/site-packages/haystack/pipelines/base.py", line 1815, in get_config
    self._add_component_to_definitions(
  File "/opt/python/virtual_envs/venv_py_392_passage_retrieval/lib/python3.9/site-packages/haystack/pipelines/base.py", line 1856, in _add_component_to_definitions
    self._add_component_to_definitions(sub_component, component_definitions, return_defaults)
  File "/opt/python/virtual_envs/venv_py_392_passage_retrieval/lib/python3.9/site-packages/haystack/pipelines/base.py", line 1851, in _add_component_to_definitions
    component_params: Dict[str, Any] = component.get_params(return_defaults)
  File "/opt/python/virtual_envs/venv_py_392_passage_retrieval/lib/python3.9/site-packages/haystack/nodes/base.py", line 104, in get_params
    if value != component_signature[key].default or return_defaults:
KeyError: 'index

Error message
KeyError: 'index'

To Reproduce
I have the following custom code one is eshandler.py

class ESHandler(ElasticsearchDocumentStore):
    def __init__(self, *args, **kwargs):
        logger.info("connecting to elastic search")

        try:
            super(ESHandler, self).__init__(host=es_host, port=es_port, username=es_username, password=es_password,
                                            index=kwargs.get('index'), similarity='cosine', embedding_dim=768,
                                            )
        except ConnectionError as CE:
            logger.exception(f"Unable to connect to ES instance, make sure its up, exiting app {CE}")
            sys.exit(1)
        self.index = kwargs.get('index')

Now I have another file called pipelines.py where the error is getting.

class Retriever:
    @staticmethod
    def retriever(es_connection):
        emb_retriever = MultihopEmbeddingRetriever(
            document_store=es_connection,
            embedding_model=model_name,
            num_iterations=1,
            model_format="sentence_transformers",
        )
        bmsretriever = BM25Retriever(
            document_store=es_connection,
        )
        return emb_retriever,bmsretriever


self.file_store = ESHandler(index=doc_index)
logger.info(f"total documents will search {self.file_store.get_document_count()}")
if self.file_store.index_exist(doc_index) and \
        self.file_store.docs_exit() and \
        self.file_store.embbeddings_updated(doc_index):
    # call class function
    emb_retriever,bms_retriever = Retriever.retriever(self.file_store)
    join_documents = JoinDocuments(
        join_mode="reciprocal_rank_fusion",
        top_k_join=10
    )

    pipe = Pipeline()
    pipe.add_node(component=bms_retriever, name="BM25Retriever",
                  inputs=["Query"])
    pipe.add_node(component=emb_retriever, name="MultihopEmbeddingRetriever",
                  inputs=["BM25Retriever.output_1"])
    pipe.add_node(component=join_documents, name="JoinDocuments",
                  inputs=["BM25Retriever", "MultihopEmbeddingRetriever"])

    return pipe

FAQ Check

System:

  • OS: Centos8
  • GPU/CPU: CPU
  • Haystack version (commit or version number): 1.6.0
  • DocumentStore: ES
  • Reader: NA
  • Retriever: Multihop,BMS25
@ZanSara
Copy link
Contributor

ZanSara commented Oct 31, 2022

Hey @SAIVENKATARAJU, thank you for this bug report!

The issue is indeed very tricky. What triggers it is the fact that ESHandler takes variadic arguments (arguments with one or two stars, like *args and **kwargs). By replacing:

class ESHandler(ElasticsearchDocumentStore):
    def __init__(self, *args, **kwargs):

with something like:

class ESHandler(ElasticsearchDocumentStore):
    def __init__(self, param1, param2, param3):

the problem will disappear.

Some more context

This issue arises from the fact that Haystack is able to convert existing pipelines into their YAML definition with Pipeline.save_to_yaml. To do so, it needs to hook into the init process of every node and collect information about their init parameters and values.

The entire mechanism works well until variadic arguments are used, like *args and **kwargs. Those prevent Haystack from understanding which keys and values to set aside in for later use in case save_to_yaml was to be used, which makes the entire system collapse and show obscure errors like the one you observed above.

A similar issue was discussed here #2362

@masci
Copy link
Contributor

masci commented Dec 30, 2022

I'm going ahead and close this, @SAIVENKATARAJU feel free to reopen if you have any follow ups to @ZanSara explanation.

@masci masci closed this as completed Dec 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:pipeline type:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants