Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: support long texts for labels in ElasticsearchDocumentStore #3346

Merged
merged 16 commits into from
Nov 2, 2022
Merged

fix: support long texts for labels in ElasticsearchDocumentStore #3346

merged 16 commits into from
Nov 2, 2022

Conversation

anakin87
Copy link
Member

@anakin87 anakin87 commented Oct 7, 2022

Related Issues

Proposed Changes:

As deeply explained in #2621, in label index, I changed field types of answer and document from flattened to nested, aligning ElasticsearchDocumentStore with OpensearchDocumentStore. This way, the text in these fields is tokenized correctly and is no longer considered one immense term.

How did you test it?

Added new related test

Checklist

@anakin87 anakin87 requested a review from a team as a code owner October 7, 2022 15:28
@anakin87 anakin87 requested review from vblagoje and removed request for a team October 7, 2022 15:28
@anakin87 anakin87 marked this pull request as draft October 7, 2022 16:56
@anakin87 anakin87 marked this pull request as ready for review October 7, 2022 16:56
@masci masci changed the title bug: support long texts for labels in ElasticsearchDocumentStore fix: support long texts for labels in ElasticsearchDocumentStore Oct 11, 2022
@anakin87
Copy link
Member Author

@vblagoje I just merged main into this branch.
Any thoughts about this PR?

@vblagoje
Copy link
Member

@anakin87, my apologies - I somehow missed these notifications. I am not an expert in ElasticSearch, let me ask who can understand this PR much better than I do.

@ZanSara
Copy link
Contributor

ZanSara commented Oct 27, 2022

@anakin87 just to make sure: does this represent a schema break? For example: if you write some documents in ES with the code in main, then switch to this branch and try to read them (or vice-versa) does it works or it throws exceptions?

I'm asking because if it throws exceptions this should be marked as a breaking change and we should warn users about this detail when migrating (and potentially offer some instructions on how to access their documents again)

@anakin87
Copy link
Member Author

@ZanSara great question!

Digging into this, I remembered that the method _create_label_index is used only during index creation
(who would have thought? 😃 ).
And once the index is created, as it is expected, the schema doesn't change.

I performed some tests:

  • write labels using the main and read them using this branch
    ✔️
  • write labels using the main and write more labels using this branch
    same bug with long text (as the schema doesn't change)
  • write labels using this branch and read them using the main
    ✔️
  • write labels using this branch and write more labels using the main
    ✔️ no bug, since the schema doesn't change

So it doesn't seem like a breaking change.
It must be said that if the user keeps using an index with the old schema, the bug won't disappear.

@anakin87 anakin87 marked this pull request as draft October 28, 2022 15:17
@anakin87 anakin87 marked this pull request as ready for review October 28, 2022 15:17
@ZanSara
Copy link
Contributor

ZanSara commented Oct 28, 2022

@anakin87 great! I think it's good to merge then 👍

Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@anakin87
Copy link
Member Author

Thank you.
But now suddenly a test has started to fail. 😃
I'll try to understand why...

@ZanSara
Copy link
Contributor

ZanSara commented Oct 28, 2022

The stacktrace of that test is quite funny I have to say 😂

@anakin87
Copy link
Member Author

@ZanSara finally ready to go!

@ZanSara ZanSara merged commit 4b0894f into deepset-ai:main Nov 2, 2022
@anakin87 anakin87 deleted the support_long_texts_ElasticsearchDocumentStore branch November 2, 2022 10:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

write_labels crashes for ElasticsearchDocumentStore for labels with long context
3 participants