From 7b38f8d169a163499b8c4b91be159a28d1878dfb Mon Sep 17 00:00:00 2001 From: "mergify[bot]" <37929162+mergify[bot]@users.noreply.github.com> Date: Wed, 18 Oct 2023 10:33:23 +0200 Subject: [PATCH] [8.8] [DOCS] Adds section about tokens to ELSER conceptual (backport #2568) (#2572) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: István Zoltán Szabó --- docs/en/stack/ml/nlp/ml-nlp-elser.asciidoc | 23 ++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/docs/en/stack/ml/nlp/ml-nlp-elser.asciidoc b/docs/en/stack/ml/nlp/ml-nlp-elser.asciidoc index faa5aabbe..c3e24bff6 100644 --- a/docs/en/stack/ml/nlp/ml-nlp-elser.asciidoc +++ b/docs/en/stack/ml/nlp/ml-nlp-elser.asciidoc @@ -20,13 +20,28 @@ meaning and user intent, rather than exact keyword matches. ELSER is an out-of-domain model which means it does not require fine-tuning on your own data, making it adaptable for various use cases out of the box. + +[discrete] +[[elser-tokens]] +== Tokens - not synonyms + ELSER expands the indexed and searched passages into collections of terms that are learned to co-occur frequently within a diverse set of training data. The terms that the text is expanded into by the model _are not_ synonyms for the -search terms; they are learned associations. These expanded terms are weighted -as some of them are more significant than others. Then the {es} -{ref}/rank-features.html[rank features field type] is used to store the terms -and weights at index time, and to search against later. +search terms; they are learned associations capturing relevance. These expanded +terms are weighted as some of them are more significant than others. Then the +{es} {ref}/rank-features.html[rank features] field type is used to store the +terms and weights at index time, and to search against later. + +This approach provides a more understandable search experience compared to +vector embeddings. However, attempting to directly interpret the tokens and +weights can be misleading, as the expansion essentially results in a vector in a +very high-dimensional space. Consequently, certain tokens, especially those with +low weight, contain information that is intertwined with other low-weight tokens +in the representation. In this regard, they function similarly to a dense vector +representation, making it challenging to separate their individual +contributions. This complexity can potentially lead to misinterpretations if not +carefully considered during analysis. [discrete]