Indexing Pipeline with Document Classifier #1281

julian-risch · 2021-07-13T17:53:30Z

#1265 introduced a document classification node FARMClassifier.
We should check whether this new node can be used in an indexing pipeline (mentioned here: https://github.com/deepset-ai/haystack/pull/1265/files#r668842533). In that case, the classifier could be used to enrich the meta data of documents at indexing time, for example, with zero-shot classification models. The FARMClassifier expects a List of Documents as input und returns a List of Documents as outputs. I don't see a reason why it should not work.

We would need to test, for example, a Converter node with a FARMClassifier

converter = TextConverter(remove_numeric_tables=True, valid_languages=["en"])
doc_txt = converter.convert(file_path="data/preprocessing_tutorial/classics.txt", meta=None)
...
classifier = FARMClassifier(model_name_or_path="deepset/bert-base-german-cased-sentiment-Germeval17")
...

Finally, we would need to check whether the meta field of documents indexed in the document store contain class labels.

The text was updated successfully, but these errors were encountered:

tstadel · 2021-11-05T10:23:59Z

#1508 replaced FARMClassifier with TransformersDocumentClassifier, so it would look more like this:

converter = TextConverter(remove_numeric_tables=True, valid_languages=["en"])
doc_txt = converter.convert(file_path="data/preprocessing_tutorial/classics.txt", meta=None)
...
classifier = TransformersDocumentClassifier(model_name_or_path="deepset/bert-base-german-cased-sentiment-Germeval17")
...

Probably it makes more sense to do it after using the PreProcessor, but we might want to keep the flexibility to use it right after the converters or after further preprocessing (e.g. document splitting).

There's a small hurdle to take, as preprocessing usually works on dicts, TransformersDocumentClassifier currently expects Document objects as input. As there is no common conversion logic from dicts to Documents throughout the different DocumentStores we might want to make TransformersDocumentClassifier also work with dicts.

tstadel · 2021-11-05T11:01:57Z

In order to support custom content fields, it's better to keep the "Document only way" as only client code can provide this information. This means we have to convert the dicts to Documents within the preprocessing logic.

tstadel · 2021-11-05T11:30:27Z

Should be better to introduce a new tutorial instead of extending the existing Tutorial8_Preprocessing. So we can show how the classification results dynamically added at index time can be used at query time in one tutorial.

julian-risch mentioned this issue Jul 13, 2021

Add FARMClassifier node for Document Classification #1265

Merged

5 tasks

tholor added topic:pipeline type:feature New feature or request labels Aug 6, 2021

tstadel self-assigned this Nov 5, 2021

tstadel mentioned this issue Nov 5, 2021

Tutorial for DocumentClassifier at Index Time #1697

Merged

4 tasks

tstadel closed this as completed in #1697 Nov 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing Pipeline with Document Classifier #1281

Indexing Pipeline with Document Classifier #1281

julian-risch commented Jul 13, 2021

tstadel commented Nov 5, 2021

tstadel commented Nov 5, 2021

tstadel commented Nov 5, 2021

Indexing Pipeline with Document Classifier #1281

Indexing Pipeline with Document Classifier #1281

Comments

julian-risch commented Jul 13, 2021

tstadel commented Nov 5, 2021

tstadel commented Nov 5, 2021

tstadel commented Nov 5, 2021