Add indexing pipeline type #3461

vblagoje · 2022-10-24T12:53:07Z

Related Issues

fixes https://github.com/deepset-ai/haystack-private/issues/18

Proposed Changes:

We didn't account for indexing pipelines. Now we detect them based on file_paths parameter being used.

How did you test it?

Manual telemetry tests are being prepared if the general approach is approved.

Notes for the reviewer

Best to involve a manual test with access to telemetry to verify the actual pipeline is being detected properly

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added tests that demonstrate the correct behavior of the change
I've used the conventional commit convention for my PR title
I documented my code
I ran pre-commit hooks and fixed any issue

julian-risch

Looks good to me. I left one comment about some additional functionality that I would like to see but this could become a separate issue and PR. If you decide to not include it here, I would suggest that you update the existing issue or add a new one.

julian-risch · 2022-10-24T13:27:04Z

haystack/pipelines/base.py

        fingerprint = sha1(json.dumps(self.get_config(), sort_keys=True).encode()).hexdigest()
        run_total = self.run.counter + self.run_batch.counter
        send_custom_event(
            "pipeline",
            payload={
                "fingerprint": fingerprint,
-                "type": self.get_type(),
+                "type": "Indexing" if is_indexing else self.get_type(),


It would be nice to get more info about the components used in the indexing pipeline. As inspiration here is the tutorial on indexing: https://haystack.deepset.ai/tutorials/doc-class-index
It would be great if the "type" could contain TextConverter, PreProcessor, FileTypeClassifier, PDFToTextConverter, DocxToTextConverter instead. It would be okay to get this done in a follow-up PR.

What do you think? For now just add a separate issue for that to the backlog?

Perhaps, let's get it working with the simplest solution possible and then improve.

Add indexing pipeline type

bc5dfc2

vblagoje requested a review from a team as a code owner October 24, 2022 12:53

vblagoje requested review from ZanSara and julian-risch and removed request for a team and ZanSara October 24, 2022 12:53

julian-risch approved these changes Oct 24, 2022

View reviewed changes

ZanSara added type:feature New feature or request topic:pipeline topic:telemetry topic:indexing labels Oct 24, 2022

vblagoje merged commit 1b9586a into deepset-ai:main Oct 24, 2022

vblagoje deleted the indexing_type branch March 31, 2023 20:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add indexing pipeline type #3461

Add indexing pipeline type #3461

vblagoje commented Oct 24, 2022

julian-risch left a comment

julian-risch Oct 24, 2022

julian-risch Oct 24, 2022

vblagoje Oct 24, 2022

Add indexing pipeline type #3461

Add indexing pipeline type #3461

Conversation

vblagoje commented Oct 24, 2022

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

julian-risch left a comment

Choose a reason for hiding this comment

julian-risch Oct 24, 2022

Choose a reason for hiding this comment

julian-risch Oct 24, 2022

Choose a reason for hiding this comment

vblagoje Oct 24, 2022

Choose a reason for hiding this comment