-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extended Configuration for ElasticSearch Analyzer #17
Conversation
Given that changes to the search settings are a rare necessity, I recommend using the possibilities of DI Configuration instead of building a new type of configuration |
The main reason is adapting the search for different languages. With more presence of Magento at such huge markets as Japan, China, India, etc. this is really "must have" feature.
First, this is not a new configuration but the existing one extended in a backward compatible way. Second, DI configuration may be used in 2 way:
Third (and this one is my personal opinion), DI configuration in Magento currently abused and configure too many aspects. In fact, DI should be used to set dependencies but not to configure their behavior. Magento has a pretty flexible configuration framework that allows introducing of new config types easily and this should be considered as best practice. We should have more small config files targeted to particular feature instead of huge 500-1000 lines di.xml files |
|
||
Each direct child of `tokenizer` element may contain element which represent configuration parameter available for specified tokenizer type. See [ElasticSearch parameters XML representation](#ElasticSearch-parameters-XML-representation) section for more information. | ||
|
||
```xml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understood the intention was to make configuration similar to what we already have. Current configuration in the module is different from configuration that we have for other modules. Have you thought about unifying it? Since we don't have a lot of configuration, maybe we can introduce new format, deprecate old one and create fallback for backwards compatibility?
I need to think more about alternative format, but here is what I have so far
<config>
<languages>
<language name="default">
<tokenizer name="default" type="standard"/>
<token_filter name="default" type="lowercase"/>
<char_filter name="default" type="html_strip"/>
</language>
<language name="jp_JP">
<tokenizer name="my_tokenizer"
type="kuromoji_tokenizer"
mode="extended"
discard_punctuation="false"
user_dictionary="userdict_ja.txt"/>
<token_filer name="my_token_filter"
type="standard"
max_token_length="5"/>
<char_filer name="my_char_filter"
type="pattern_replace"
pattern="(\\d+)-(?=\\d)"
replacement="$1_"/>
</language>
</languages>
</config>
Main benefits
- Name of the language is value of an attribute instead of name of the node, more consistent with other Magento configuration. Are there other benefits in proposed configuration except consistency with existing module configuration?
- Configuration per language defined in one place
- Nodes that unlikely will have their own attributes made attributes
Don't like how defaults being set in my example, need to think more about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main intention was to make configuration similar to existing exactly in esconfig.xml
regardless it differs from main approach in other config files. I also think that supporting two types of config will create too many difficulties in maintenance and overcomplicate implementation.
I suppose in future we will need BiC in this module to eliminate support of ES 2 and full support of new features. I think that time in future would be ideal for reviewing the whole config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would favor introducing configuration in the format that follows general and Magento best practices because amount of configuration we going to add is pretty big. Currently we have small amount of configuration and it should be easy to support fallback to it. Having free form configuration that is not following conventions of other configuration in Magento is bad for developer experience. Would be interested in other people opinion on this.
| ------------- |-------------| | ||
| `{key: true}` | `<key xsi:type="boolean">true</key>` | | ||
|
||
#### Least |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
| ------------- |-------------| | ||
| `{key: ['v1', 'v2']}` | `<key xsi:type="list"><item>v1</item><item>v2</item></key>` | | ||
|
||
Item node may declare `ref` attribute that should be unique inside list. `ref` attribute should be used to provide a possibility to override list element by Magento configuration merging mechanism. If module B want to add element to list in ElasticSearch config declared by module A then `ref` attribute should be used as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should ref
be name
or it's different from how we use name
in other configuration files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may renamed to name
for consistency.
|
||
### ElasticSearch parameters XML representation | ||
|
||
JSON configuration described at ElasticSearch configuration may be converted to XML configuration to specify tokenizer and filter parameters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where this can be specified? Could you expand on this a little or point me to where it's described in case I'm missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure that understand the question. XML config in the proposal does not declare strict schema. Instead of that proposed straightforward rules how to represent ElasticSerach config in XML
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think there could be a benefit in making this configuration strict? Will be easier for developer to write. It's also better in case Elastic change configuration format. Could you point me to JSON configuration we trying to represent in XML?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. We will not cover all possible ES config options with specific elements so will have to introduce general elements like standard_filter
, custom_filter
, filter_param
, etc. This will make validation of XML more strong but in cost of extra verbosity and worse readability
|
||
#### Objects | ||
|
||
Objects or maps are key structure to declare ElasticSearch configuration. When converting JSON to XML keys became element names and values are presented as node values. An element may declare a type of value with `xsi:type` attribute but it may be autodetected by the parser. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we keep configuration in the flexible format, can we remove xsi:type
or there are cases when it's needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is optional. If omitted type is autodetected
Per @vkublytskyi proposal became obsolete. |
This is a proposal of adding new elements to
esconfig.xml
files to have more control on ElasticSearch analyzer settings for indexes.Here is example of proposed changes:
This proposal is part of work on magento/magento2-jp#18 and has implemented prototype at magento/magento2-l10n#1 implemented by @HirokazuNishi from Veriteworks in collaboration wit Magento Community Engineering Team.
Extended Configuration for ElasticSearch Analyzer
Context
During work on Japanese localization community project we descovered requirement to modify default index settings created by
Magento\Elasticsearch\Model\Adapter\Index\BuilderInterface::build
.First of all, this is caused by the unique nature of the Japanese writing system. Usage of standard analyzer and tokenizer do not provide sufficient search accuracy. To provide valid search results Japanese (kuromoji) Analysis Elasticsearch plugin should be used which provides set of tokenizers and token filters correctly processing Japanese texts.
Existing Elasticsearch configuration in Magento 2 allows setting only default stemmer name and list of stop words. It introduces two top-level XML elements
stemmer
andstopwords_file
. Each element includes elements with valid locale code as a name anddefault
element which holds configuration that should be used if locale-specific configuration is not defined.There may be much more use-cases when SI integrators would like to change default index settings to adjust search results.
To allow complex changes of default Elasticsearch configuration in Magento 2, such us required by Japanese localization project, at the current moment we have 2 options:
\Magento\Elasticsearch\Model\Adapter\Index\Config\EsConfigInterface
andMagento\Elasticsearch\Model\Adapter\Index\BuilderInterface
.Plugins are powerful mechanism but they push developers to explore the internals of pluginized module that violates the Open-Closed principle. Usage of plugins would require a same code from project-to-project.
Decision
Magento 2 should provide a possibility to configure analyzer for indexes. To not limit all the power of ElasticSearch without exposing all complexity of configuration should include:
The extended schema should be fully backward compatible and consistent with the current implementation.
General Concepts
As all required configuration options are independent they should be expressed as top-level XML elements.
As all required configuration options have a strong dependency on a language and to be consistent with existing configuration elements each introduced elements should contain
default
element and elements with names corresponding to locale names.As particular tokenizer and filters may require complex types for configuration XML schema should allow to do that and converting of available options described at ElasticSearch Documentation in JSON to XML elements should follow simple straight-forward rules.
Tokenizer Configuration
A tokenizer is configured by
tokenizer
element which must includedefault
element as a container for common configuration and may include one or more elements with valid locale codes as a name that are containers for locale-specific configuration.Each direct child of
tokenizer
element must containtype
element with the name of tokenizer to be used.Each direct child of
tokenizer
element may contain element which represent configuration parameter available for specified tokenizer type. See ElasticSearch parameters XML representation section for more information.Token Filters Configuration
Token filters are configured by
token_filters
element which must includedefault
element as a container for common configuration and may include one or more elements with valid locale codes as a name that are containers for locale-specific configuration.Each direct child of
token_filters
holds configuration of filter:<lowercase/>
)type
child element which contains the name of customized token and optional parameters. See ElasticSearch parameters XML representation section for more information.Char Filters Configuration
Char filters configuration is similar to token filters but declared by
char_filters
elementElasticSearch parameters XML representation
JSON configuration described at ElasticSearch configuration may be converted to XML configuration to specify tokenizer and filter parameters.
Objects
Objects or maps are key structure to declare ElasticSearch configuration. When converting JSON to XML keys became element names and values are presented as node values. An element may declare a type of value with
xsi:type
attribute but it may be autodetected by the parser.{key: {}}
<key xsi:type="map"><!-- ... --></key>
Strings
String values are converted to text node:
{key: "string value"}
<key xsi:type="string">string value</key>
Numbers
Numeric values are converted to text node. This type utilize both integer and float numbers depend on provided decimal part:
{key: 42}
<key xsi:type="number">42</key>
Booleans
Boolean values are converted to text node and may contain "true" or "false":
{key: true}
<key xsi:type="boolean">true</key>
Least
Array values represented as series of
item
nodes:{key: ['v1', 'v2']}
<key xsi:type="list"><item>v1</item><item>v2</item></key>
Item node may declare
ref
attribute that should be unique inside list.ref
attribute should be used to provide a possibility to override list element by Magento configuration merging mechanism. If module B want to add element to list in ElasticSearch config declared by module A thenref
attribute should be used as well.Disabling Configuration Element
As with proposed changes configuration may be complex and conflict with some requirements it is necessary to provide a mechanism of disabling some parts of configuration during the Magento configuration merge process.
To achieve this goal any element may declare
disabled
boolean attribute:Deprecation of Magento\Elasticsearch\Model\Adapter\Index\Config\EsConfigInterface
Current
@api
interfaceMagento\Elasticsearch\Model\Adapter\Index\Config\EsConfigInterface
violates Interface Segregation principle. It extending is impossible as that would be backward incompatible.As a solution
Magento\Elasticsearch\Model\Adapter\Index\Config\EsConfigInterface
should be deprecated and instead of it new set of interfaces should be introduced. Each new interface should be responsible for a single aspect of ElasticSearch configuration.Class
Magento\Elasticsearch\Model\Adapter\Index\Config\EsConfig
will implement all of these interfaces to simplify backward compatible implementation.Prototype
Prototype of proposed changes is implemented in magento/magento2-l10n#1 by Hirokazu Nishi from Veriteworks in collaboration wit Magento Community Engineering Team.
Status
Proposed
Consequences
Proposed changes to ElasticSearch XML configuration give a possibility to fully configure ElastoicSearch analyzer for indexes. It currently not supported multiples tokenizers. The current implementation also assumes usage of unified stemmer and stop words list declared in CSV files. Fixing these limitations is out of the scope of this proposal and should be addressed if real issue reported.
The main drawback of this proposal is an introduction of XML schema that not really match XML philosophy (not strictly defined elements structure). This is done by intention to be consistent with existing configuration and to provide straightforward, not verbose conversion of configuration in JSON described at ElasticSearch documentation to XML format expected by Magento.