-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add options for handling duplicate documents (skip, fail, overwrite) #1088
Conversation
@tholor Please review this pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice progress... I think we're almost there :)
Left a few minor comments.
One bigger thing that I was wondering about: Speed of handle_duplicate_documents
. I don't believe a single get_documents_by_id() is that expensive, but we need to make sure that it doesn't become a bottleneck for indexing. We can check this in our next benchmarking and iterate on it in case it's really an issue.
:param documents: A list of Haystack Document objects. | ||
:param duplicate_documents: Handle duplicates document based on parameter options. | ||
Parameter options : ( 'skip','overwrite','fail') | ||
skip (default option): Ignore the duplicates documents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we want to make skip
or overwrite
the default option here. Any thoughts @lalitpagaria @oryx1729?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my view 'overwrite' will be good. User will get warning along with without any changes his latest/updated docs can be written to doc store.
In skip options, he can't move forward until he delete old documents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume from the above comment, overwrite will be the default option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think overwrite could be a good default value. I assume it'd overwrite the meta fields if they're different from the original?
In the haystack, there are still improvements that can be made for speed optimization and architecture improvements. Once the duplicate functionality criteria is approved then we will look into optimization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. I think this is ready to be merged. Thanks for addressing the comments so fast 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Proposed changes:
Status (please check what you already did):