-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom _all fields #4520
Comments
On Different analyzer chains: On Proposed syntax:
I would consider it a shorthand form but would like to retain complete verbose form |
@roytmana stemming some words and not others is pretty meaningless - you have to choose at query time whether you want to query the stemmed form or the unstemmed form. At that stage better to have it in two different fields. Putting tokens from multiple analysis chains results in a mess - it really doesn't work well. Second, for field-level index time boosting: I don't recommend using it for a single field. You lose precision in field norms and you have to reindex if you want to change it. Much better to use query-time boosting on a field instead. For the custom |
It is not meaningless. Yes you have freedom to choose analyzer at query time but you do not have to. As I said latest ES versions handle AND queries for tokens on the same position gracefully removing issue with not being able to use the same (stem+no-stem) analyzer at query time In some cases it will create a mess and in some no. In the case I outlined above it works better for me than trying to combine several flavors of _all like field (stemmed and unstemmed) and it is the only way to have an _all like field combining stemmed and unstemmed input very importand for cases where stemming of certain contributing fileds can screw up data (like stemming people names) in cases when you have hundreds of fields contributing to an _all like field I would like to have as much control over how it is put together (boosts, position gaps and analyzer chain) as possible. It would be up to me to make sure it is not mess in the end. I would not want ES preventing me from getting burned by denying me such functionality. Not to mention that there could be many people who do use it already and removing it would break their code. I do not dispute that _all like field level analyzer chain without per contributing field chains is the most common use case but why not use shorthand default config - absence of analyzers on contributing field definition which will be the case when using your shorthand version as indication that the _all like field analyzers should be used |
I'm wondering that it may actually be a better idea to stem your family names at indexing time? For example, let's imagine that one of the family names is Y, which is also a common name whose stem is X. I assume that you would apply stemming at query time so a query on |
from that perspective, yes (ideally I would want to give stemmed form a slight negative boost), I did not test it enough with real data as I had to switch from all-like to back to _all field due to field based boosts not supported. But here is another scenario: I am most interested in real words not people names. I am searching on "turn" but getting also Turner because names were stemmed. also in case of synonyms it is not as obvious I guess it is never perfect for all the scenarious |
@clintongormley if we use copy-to syntax, it would be great if we could copy multifields recursively into other multifields. For example I may have a my_all field which includes 100 fields and I want a stemmed version of it and shingled one being able to create my_all_stemmed by copying my_all would be a huge benefit |
@roytmana I don't think that would work with how ES uses stream parsing. We would have to hold on to a bunch of information to support this, plus would have to handle circular dependencies. Sounds more complex than we want to make this. Instead, you'll just be able to specify:
(yes I realise you'll have to do it on all 100 fields, but I think the advantages of being explicit outweigh the complications of recursion here) |
@clintongormley fair enough it's not too hard . what about reversing it:
makes it easy to maintain all in one place the big disadvantage is the need to use full property names I still have some concern re. using just copy-to form: I would like to be able to inherit boosts from contributing fields if no boost is specified in copy-to statement and I would like to be able to specify position gap offset for each contributing field even if you decide not to support different analysis pipelines Will copy_to support bott strings ( field names to copy to) and objects with field name and options such as boos and gap offset and anything else we may need in the future. The string form would be a shorthand for default copy logic |
We did consider As far as Re inheriting boosts... hmmm, I suppose we could do that. However, I repeat, using field-level index-time boosting is a bad idea, with the exception of when you use a custom
There shouldn't be anything other than boost. All we're doing is taking the value from one place and indexing as a different field, which has all the settings you need. The only exception being per-field boost. |
thanks for the explanation @clintongormley I still feel that providing flexibility in hos all-like fields are put together (multiple pipelines) would have very valuable but it is your call of course :-) Will traditional field-scoped concept of multifield remain (say for not analyzed version of a field no copying from multiple sources involved) or will we have to declare them separately and then use copy_to? is this slotted for near future 0.9.x or 1.0.x? I just want to plan better as I have a rather big mapping file to rework. Thankfully it is all defined in javascript code and generates itself including proper naming (full name) of multifields where both all-like and field scoped multifields are needed but still it is fair amount of work. |
Multi-fields will remain, although I'd like to see their syntax improved as per #4521
It won't be in 0.90 but hoping to get it in for 1.0 |
@clintongormley many thanks! |
Currently, boosting on `copy_to` is misleading and does not work as originally specified in elastic#4520. Instead of boosting just the terms from the origin field, it boosts the whole destination field. If two fields copy_to a third field, one with a boost of 2 and another with a boost of 3, all the terms in the third field end up with a boost of 6. This was not the intention. The alternative: to store the boost in a payload for every term, results in poor performance and inflexibility. Instead, users should either (1) query the common field AND the field that requires boosting, or (2) the multi_match query will soon be able to perform term-centric cross-field matching that will allow per-field boosting at query time (coming in 1.1).
Currently, boosting on `copy_to` is misleading and does not work as originally specified in #4520. Instead of boosting just the terms from the origin field, it boosts the whole destination field. If two fields copy_to a third field, one with a boost of 2 and another with a boost of 3, all the terms in the third field end up with a boost of 6. This was not the intention. The alternative: to store the boost in a payload for every term, results in poor performance and inflexibility. Instead, users should either (1) query the common field AND the field that requires boosting, or (2) the multi_match query will soon be able to perform term-centric cross-field matching that will allow per-field boosting at query time (coming in 1.1).
Currently, boosting on `copy_to` is misleading and does not work as originally specified in #4520. Instead of boosting just the terms from the origin field, it boosts the whole destination field. If two fields copy_to a third field, one with a boost of 2 and another with a boost of 3, all the terms in the third field end up with a boost of 6. This was not the intention. The alternative: to store the boost in a payload for every term, results in poor performance and inflexibility. Instead, users should either (1) query the common field AND the field that requires boosting, or (2) the multi_match query will soon be able to perform term-centric cross-field matching that will allow per-field boosting at query time (coming in 1.1).
Currently, boosting on `copy_to` is misleading and does not work as originally specified in elastic#4520. Instead of boosting just the terms from the origin field, it boosts the whole destination field. If two fields copy_to a third field, one with a boost of 2 and another with a boost of 3, all the terms in the third field end up with a boost of 6. This was not the intention. The alternative: to store the boost in a payload for every term, results in poor performance and inflexibility. Instead, users should either (1) query the common field AND the field that requires boosting, or (2) the multi_match query will soon be able to perform term-centric cross-field matching that will allow per-field boosting at query time (coming in 1.1).
In the quest for a cleaner way of setting up custom
_all
fields, there are two questions that need to be answered:_all
field (like we do with the_all
field), and can we only pay the query-time price of per-field boosting if it used?Different analyzer chains
I can't think of a good use case where it makes sense to combine the output from different analyzer chains into a single field. The field can only ever be searched via a single analyzer, multiple analyzers can produce tokens which interfere with each other (and so produce wrong results) and the term frequencies for overlapping tokens will be badly messed up. Also, a clean token stream should never have offsets move "backwards".
So I think we can discount multiple analyzers outputting to a single field.
Per-field boosting
When combining multiple fields into a single field, you lose the effect of field norms (ie title is shorter and thus more important than body). Field-level boosting at index time is the only way to maintain this distinction.
The
_all
field takes field-level boosts into account by storing any boost that is not 1.0 as a payload with each term. Retrieving these payloads has an impact on query performance, but the_all
field has an optimization called "auto_boost" which allows you to only pay the price of payloads if any included field has a boost other than 1.0.I think field-level boosts should be supported with custom
_all
fields too.Proposed syntax
Given that we're not going to support separate analyzer chains, the current way of implementing custom
_all
fields with multi-fields is verbose and misleading, as it implies that each source field can apply its own analyzer.Instead, we suggest the following:
The
copy_to
parameter can also support an array of fieldnames:Per-field boosting could be specified in two ways:
With the caret
^
syntax:"copy_to": "my_all_field^2"
As an object:
"copy_to": { "field": "my_all_field", "boost": 2 }
The destination custom
_all
field can be defined in the mapping:If it is not defined in the mapping, then it should be added using dynamic mapping (or fail if dynamic mapping is disabled)
The text was updated successfully, but these errors were encountered: