Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed the Search Field mapping in ElasticSearch DocumentStore #2080

Merged
merged 4 commits into from
Jan 31, 2022

Conversation

SjSnowball
Copy link
Contributor

@SjSnowball SjSnowball commented Jan 27, 2022

Added the fix for the #2055 issue.

Proposed changes:
Once the search fields are set in the Elasticsearchdocument store initializing, those fields should be mapped as "text" fields while indexing.
Added the code which will take the search fields and add the type as "text". This matches the checks added in "_create_document_index" and makes the search fields full-text searchable.

Status (please check what you already did):

  • First draft (up for discussions & feedback)
  • Final code
  • Added tests
  • Updated documentation

@SjSnowball SjSnowball changed the title Fixed the Search Field data type in ElasticSearch DocumentStore Fixed the Search Field mapping in ElasticSearch DocumentStore Jan 27, 2022
@ZanSara ZanSara requested a review from tstadel January 27, 2022 16:32
Comment on lines 308 to 310
if self.synonyms:
mapping["mappings"]["properties"][self.content_field] = {"type": "text", "analyzer": "synonym"}
mapping["settings"]["analysis"]["analyzer"]["synonym"] = {"tokenizer": "whitespace",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also set the search_fields types to text if synonyms have been provided.

Copy link
Contributor Author

@SjSnowball SjSnowball Jan 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to add the synonym analyzer for the search field as well if the synonyms are passed @tstadel? Have added the changes still

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, functionally that's ok. However now we're passing always the same dictionary as field mapping. This can be confusing if later on someone tries to change the mapping for one field only but ends up with changing all self.search_fields mappings.

So here's what I recommend:

  • get rid of the extra handling for self.content_field in
    self.content_field: {"type": "text"},
    and
    mapping["mappings"]["properties"][self.content_field] = {"type": "text", "analyzer": "synonym"}
    This is per default included in self.search_fields, anyway.
  • instead of line 314 loop over self.search_fields and add the mapping with analyzer
  • add an else block after synonyms handling: set the mapping without analyzer for all self.search_fields
  • and there's another thing missing I totally forgot: we have to do the same for OpenSearchDocumentStore, so copy anything from Line 280 till the end of the else block of the last bullet point and replace
    mapping = {
    "mappings": {
    "properties": {
    self.name_field: {"type": "keyword"},
    self.content_field: {"type": "text"},
    },
    "dynamic_templates": [
    {
    "strings": {
    "path_match": "*",
    "match_mapping_type": "string",
    "mapping": {"type": "keyword"}}}
    ],
    },
    "settings": {
    "analysis": {
    "analyzer": {
    "default": {
    "type": self.analyzer,
    }
    }
    }
    }
    }

Copy link
Member

@tstadel tstadel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me :-)
Please add the missing synonyms support and we are good to merge.
And of course tests need to pass :-)

@tstadel
Copy link
Member

tstadel commented Jan 27, 2022

@SjSnowball, the tests failing seems not to be a code issue.

It seems like code changes do not take effect after a cache hit in CI. I had the same behavior in PR #2013. I pseudo changed setup.py by inserting a blank to work around the cache and it worked again.

@ZanSara, could that be connected to the new setup mechanism?

Comment on lines 308 to 310
if self.synonyms:
mapping["mappings"]["properties"][self.content_field] = {"type": "text", "analyzer": "synonym"}
mapping["settings"]["analysis"]["analyzer"]["synonym"] = {"tokenizer": "whitespace",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, functionally that's ok. However now we're passing always the same dictionary as field mapping. This can be confusing if later on someone tries to change the mapping for one field only but ends up with changing all self.search_fields mappings.

So here's what I recommend:

  • get rid of the extra handling for self.content_field in
    self.content_field: {"type": "text"},
    and
    mapping["mappings"]["properties"][self.content_field] = {"type": "text", "analyzer": "synonym"}
    This is per default included in self.search_fields, anyway.
  • instead of line 314 loop over self.search_fields and add the mapping with analyzer
  • add an else block after synonyms handling: set the mapping without analyzer for all self.search_fields
  • and there's another thing missing I totally forgot: we have to do the same for OpenSearchDocumentStore, so copy anything from Line 280 till the end of the else block of the last bullet point and replace
    mapping = {
    "mappings": {
    "properties": {
    self.name_field: {"type": "keyword"},
    self.content_field: {"type": "text"},
    },
    "dynamic_templates": [
    {
    "strings": {
    "path_match": "*",
    "match_mapping_type": "string",
    "mapping": {"type": "keyword"}}}
    ],
    },
    "settings": {
    "analysis": {
    "analyzer": {
    "default": {
    "type": self.analyzer,
    }
    }
    }
    }
    }

@SjSnowball
Copy link
Contributor Author

@tstadel You are right, mappings looks confusing.

get rid of the extra handling for self.content_field in...This is per default included in self.search_fields, anyway.

If you check the test snippet I have added, I used different fields in the search_fields and content_fields.
Content_field is not covered in the search_field list.
So, I worry, if we remove the content_field mapping, then it will index as a "keyword" type if that field is not covered in the search_field.

document_store = 
ElasticsearchDocumentStore( index="haystack_search_field_mapping",search_fields=["content", "sub_content"],content_field= "title")

2. instead of line 314 loop over self.search_fields and add the mapping with analyzer
3. add an else block after synonyms handling: set the mapping without analyzer for all self.search_fields

Yes, I have added the search_field mapping as you mentioned.
Along with that, I have added an if statement for content_fields. If the content_field is not present in the search_field then assign the mapping content_field as well.
This will make sure content_field and search_field are "text" types and are available for full-text search.
(I did the changes as per my understanding. Let me know if I am missing something.)

and there's another thing missing I totally forgot: we have to do the same for OpenSearchDocumentStore, so copy anything from Line 280 till the end of the else block of the last bullet point and replace

I replaced the else block. But I could see the changes are removing the content_field mapping. These changes will cause the same issue which I mentioned above right

@tstadel
Copy link
Member

tstadel commented Jan 29, 2022

@SjSnowball,

yes you're right with the content_field. Even if it unnecessarily increases the index size if it's not used as search_field, it should be searchable in general. That's a reasonable convention as you might want to include it into search_fields. And that should be possible without recreating the index, as content_field is naturally the field that dense retrieval happens on.
However keep in mind, at query-time we filter only for the search_fields. So in your example, the "title" field would not be searched by query(): https://github.com/SjSnowball/haystack/blob/7776ccd6b411333e47312e23ce9cc21d01e9ad36/haystack/document_stores/elasticsearch.py#L880

So sorry for my confusion: you can revert the deletion of line 284 and line 314. That's cleaner than the check for its inclusion in searchfields. Though you might want to add a comment that content_field should always be searchable even if it currently isn't part of search_fields.

Please also copy over the synonyms block to OpenSearchDocumentStore.
Thank's a lot for your contribution, btw!

@SjSnowball
Copy link
Contributor Author

@tstadel Understood.
Reverted the changes and added the synonyms in OpensearchDocumentstore

Copy link
Member

@tstadel tstadel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tstadel tstadel merged commit 7d769d8 into deepset-ai:master Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants