Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing Pipeline with Document Classifier #1281

Closed
julian-risch opened this issue Jul 13, 2021 · 3 comments · Fixed by #1697
Closed

Indexing Pipeline with Document Classifier #1281

julian-risch opened this issue Jul 13, 2021 · 3 comments · Fixed by #1697
Assignees
Labels
topic:pipeline type:feature New feature or request

Comments

@julian-risch
Copy link
Member

#1265 introduced a document classification node FARMClassifier.
We should check whether this new node can be used in an indexing pipeline (mentioned here: https://github.com/deepset-ai/haystack/pull/1265/files#r668842533). In that case, the classifier could be used to enrich the meta data of documents at indexing time, for example, with zero-shot classification models. The FARMClassifier expects a List of Documents as input und returns a List of Documents as outputs. I don't see a reason why it should not work.

We would need to test, for example, a Converter node with a FARMClassifier

converter = TextConverter(remove_numeric_tables=True, valid_languages=["en"])
doc_txt = converter.convert(file_path="data/preprocessing_tutorial/classics.txt", meta=None)
...
classifier = FARMClassifier(model_name_or_path="deepset/bert-base-german-cased-sentiment-Germeval17")
...

Finally, we would need to check whether the meta field of documents indexed in the document store contain class labels.

@tstadel
Copy link
Member

tstadel commented Nov 5, 2021

#1508 replaced FARMClassifier with TransformersDocumentClassifier, so it would look more like this:

converter = TextConverter(remove_numeric_tables=True, valid_languages=["en"])
doc_txt = converter.convert(file_path="data/preprocessing_tutorial/classics.txt", meta=None)
...
classifier = TransformersDocumentClassifier(model_name_or_path="deepset/bert-base-german-cased-sentiment-Germeval17")
...

Probably it makes more sense to do it after using the PreProcessor, but we might want to keep the flexibility to use it right after the converters or after further preprocessing (e.g. document splitting).

There's a small hurdle to take, as preprocessing usually works on dicts, TransformersDocumentClassifier currently expects Document objects as input. As there is no common conversion logic from dicts to Documents throughout the different DocumentStores we might want to make TransformersDocumentClassifier also work with dicts.

@tstadel
Copy link
Member

tstadel commented Nov 5, 2021

In order to support custom content fields, it's better to keep the "Document only way" as only client code can provide this information. This means we have to convert the dicts to Documents within the preprocessing logic.

@tstadel
Copy link
Member

tstadel commented Nov 5, 2021

Should be better to introduce a new tutorial instead of extending the existing Tutorial8_Preprocessing. So we can show how the classification results dynamically added at index time can be used at query time in one tutorial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:pipeline type:feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants