-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add indexing pipeline type #3461
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I left one comment about some additional functionality that I would like to see but this could become a separate issue and PR. If you decide to not include it here, I would suggest that you update the existing issue or add a new one.
fingerprint = sha1(json.dumps(self.get_config(), sort_keys=True).encode()).hexdigest() | ||
run_total = self.run.counter + self.run_batch.counter | ||
send_custom_event( | ||
"pipeline", | ||
payload={ | ||
"fingerprint": fingerprint, | ||
"type": self.get_type(), | ||
"type": "Indexing" if is_indexing else self.get_type(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to get more info about the components used in the indexing pipeline. As inspiration here is the tutorial on indexing: https://haystack.deepset.ai/tutorials/doc-class-index
It would be great if the "type" could contain TextConverter, PreProcessor, FileTypeClassifier, PDFToTextConverter, DocxToTextConverter
instead. It would be okay to get this done in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think? For now just add a separate issue for that to the backlog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, let's get it working with the simplest solution possible and then improve.
Related Issues
Proposed Changes:
We didn't account for indexing pipelines. Now we detect them based on
file_paths
parameter being used.How did you test it?
Manual telemetry tests are being prepared if the general approach is approved.
Notes for the reviewer
Best to involve a manual test with access to telemetry to verify the actual pipeline is being detected properly
Checklist