Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed the Search Field mapping in ElasticSearch DocumentStore #2080

Merged
merged 4 commits into from
Jan 31, 2022
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions haystack/document_stores/elasticsearch.py
Original file line number Diff line number Diff line change
Expand Up @@ -301,6 +301,15 @@ def _create_document_index(self, index_name: str, headers: Optional[Dict[str, st
}
}
}
if self.search_fields:
if self.synonyms:
search_fields_mapping = {"type": "text", "analyzer": "synonym"}
else:
search_fields_mapping = {"type": "text"}

for field in self.search_fields:
mapping["mappings"]["properties"].update({field: search_fields_mapping})

if self.synonyms:
mapping["mappings"]["properties"][self.content_field] = {"type": "text", "analyzer": "synonym"}
mapping["settings"]["analysis"]["analyzer"]["synonym"] = {"tokenizer": "whitespace",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also set the search_fields types to text if synonyms have been provided.

Copy link
Contributor Author

@SjSnowball SjSnowball Jan 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to add the synonym analyzer for the search field as well if the synonyms are passed @tstadel? Have added the changes still

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, functionally that's ok. However now we're passing always the same dictionary as field mapping. This can be confusing if later on someone tries to change the mapping for one field only but ends up with changing all self.search_fields mappings.

So here's what I recommend:

  • get rid of the extra handling for self.content_field in
    self.content_field: {"type": "text"},
    and
    mapping["mappings"]["properties"][self.content_field] = {"type": "text", "analyzer": "synonym"}
    This is per default included in self.search_fields, anyway.
  • instead of line 314 loop over self.search_fields and add the mapping with analyzer
  • add an else block after synonyms handling: set the mapping without analyzer for all self.search_fields
  • and there's another thing missing I totally forgot: we have to do the same for OpenSearchDocumentStore, so copy anything from Line 280 till the end of the else block of the last bullet point and replace
    mapping = {
    "mappings": {
    "properties": {
    self.name_field: {"type": "keyword"},
    self.content_field: {"type": "text"},
    },
    "dynamic_templates": [
    {
    "strings": {
    "path_match": "*",
    "match_mapping_type": "string",
    "mapping": {"type": "keyword"}}}
    ],
    },
    "settings": {
    "analysis": {
    "analyzer": {
    "default": {
    "type": self.analyzer,
    }
    }
    }
    }
    }

Expand Down
28 changes: 27 additions & 1 deletion test/test_document_store.py
Original file line number Diff line number Diff line change
Expand Up @@ -1190,4 +1190,30 @@ def test_DeepsetCloudDocumentStore_query_by_embedding(deepset_cloud_document_sto
)

emb_docs = deepset_cloud_document_store.query_by_embedding(query_emb)
assert len(emb_docs) == 0
assert len(emb_docs) == 0


@pytest.mark.elasticsearch
def test_elasticsearch_search_field_mapping():

client = Elasticsearch()
client.indices.delete(index='haystack_search_field_mapping', ignore=[404])

index_data = [
{"title": "Green tea components",
"meta": {"content": "The green tea plant contains a range of healthy compounds that make it into the final drink","sub_content":"Drink tip"},"id": "1"},
{"title": "Green tea catechin",
"meta": {"content": "Green tea contains a catechin called epigallocatechin-3-gallate (EGCG).","sub_content":"Ingredients tip"}, "id": "2"},
{"title": "Minerals in Green tea",
"meta": {"content": "Green tea also has small amounts of minerals that can benefit your health.","sub_content":"Minerals tip"}, "id": "3"},
{"title": "Green tea Benefits",
"meta": {"content": "Green tea does more than just keep you alert, it may also help boost brain function.","sub_content":"Health tip"},"id": "4"}
]

document_store = ElasticsearchDocumentStore(index="haystack_search_field_mapping",search_fields=["content", "sub_content"],content_field= "title")
document_store.write_documents(index_data)

indexed_settings = client.indices.get_mapping(index="haystack_search_field_mapping")

assert indexed_settings["haystack_search_field_mapping"]["mappings"]["properties"]["content"]["type"] == 'text'
assert indexed_settings["haystack_search_field_mapping"]["mappings"]["properties"]["sub_content"]["type"] == 'text'