-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add run_batch
method to all nodes and Pipeline
to allow batch querying
#2481
Conversation
# Conflicts: # haystack/document_stores/elasticsearch.py # haystack/nodes/base.py # haystack/nodes/retriever/sparse.py # haystack/pipelines/base.py
# Conflicts: # haystack/document_stores/elasticsearch.py # haystack/nodes/answer_generator/base.py # haystack/nodes/question_generator/question_generator.py # haystack/nodes/ranker/base.py # haystack/nodes/retriever/sparse.py # haystack/nodes/summarizer/base.py # haystack/pipelines/base.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks quite good to me! I just have a couple of things that we should discuss/change before merging. Here is a list (see my individual comments for details). Happy to have a call anytime.
- batch_size should maybe be an attribute of every node and it should be possible to set it in the init method of each node (yaml pipeline definition)
- Maybe two of the cases list of docs/queries can actually be joined?
- Make error message more extensive and print not only expected type but also given type
- Allow a different filter per query in queries?
- truncation_warning every time
predict
is executed - multiple_doc_lists naming: more than one list? FOr example in batch_prediction_single_query_multiple_doc_lists
- Use distilled, tiny model instead of roberta-base-squad in test case
- Transformers upgrade separately would be better in separate PR
- Check whether tutorials are still working (QuestionGenerator could break I think)
if headers is None: | ||
headers = {} | ||
|
||
single_query = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the handling of the edge case with a single query makes the implementation much more complex, we could discuss whether queries
must be a list in all cases. The edge case would be a list with a single item then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, we can address this question and decide what to do in a separate PR.
|
||
# Query case 2: list of queries | ||
elif isinstance(queries, list) and len(queries) > 0 and isinstance(queries[0], str): | ||
# Docs case 1: single list of Documents -> apply each query to all Documents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case does essentially the same as "# Docs case 1: single list of Documents -> apply single query to all Documents". So if queries
is a string, we could make it a list [queries]
and then use the same code for both cases with the for loop for query in queries:
. Merging these two cases could simplify the code a little bit. What do you think? I like the readability of the current version. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, we can address this question and decide what to do in a separate PR.
# Conflicts: # docs/_src/api/api/document_store.md # docs/_src/api/api/ranker.md # docs/_src/api/api/reader.md # haystack/nodes/base.py # haystack/pipelines/base.py # test/test_pipeline.py # test/test_tokenization.py
…in pipelines/base.py and document_stores/base.py
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
all_terms_must_match: bool = False, | ||
scale_score: bool = True, | ||
) -> Union[List[Document], List[List[Document]]]: | ||
raise NotImplementedError("DeepsetCloudDocumentStore currently does not support query_batch method.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we maybe add this without much overhead before the next release? Is there an issue already? It would be a shame if we can't use all the batch processing in dc because of that. What we need for that is a self.client.query_batch()
call here and then also a query_batch()
implementation in IndexClient
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to just call query
here multiple times, as there is no batch_query endpoint in dc (yet). (In query_batch()
in IndexClient
we would need to call the documents-query
endpoint multiple times.) Once dc has a batch_query endpoint, we can have a query_batch()
implementation for the IndexClient
. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agreed. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me! 👍 Let's just not forget about query_batch
in DCDocumentStore so that we can make use of these nice new batch processing features.
This PR facilitates batch processing with query Pipelines. This is done by adding a
run_batch
and a correspondingpredict_batch
/retrieve_batch
/... method to each of the nodes. In order to achieve this,transformers
dependency has been upgraded to the newest version, because in the version currently used, batch processing for transformers' question-answering-pipeline is not supported.As decided in #1239, the Pipeline's
run_batch
method can take a single query or a list of queries as itsqueries
argument and a single list of Documents or a list of lists of Documents as itsdocuments
argument. The behavior of the Pipeline depends on whether a single value or list of values are provided. In general, the following applies:if
run_batch
is called on an indexing pipeline, the pipeline'srun
method will be called.Breaking changes
The input and output of the
FARMReader
'spredict_batch
method changed. (Therefore, the output of theQuestionAnswerGenerationPipeline
also changed.)Old way
New way
Closes #1239