-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prepend the type name to the index_name automatically #5851
Comments
I actually wasn't aware this was happening too, and this change is important in order to preserve relevance quality. In that respect, I don't think losing field-data sharing is a disadvantage. Just like it would lead to more memory usage when the 2 are split, it could lead to less memory usage if once they are split only one is being used for sorting / faceting. Also does this mean aggregations were broken in such scenarios? |
Thanks for creating this issue, @clintongormley! I'd like to just point out the particular problems this behavior has caused us... We're not using Elasticsearch as a canonical datastore, but rather as a search engine for data that's canonically stored in a SQL or NoSQL database, so our ES mappings are usually just mechanical translations of our other database schemas. Likewise, our indexing pipeline uses mechanical transformers to denormalize our SQL records into JSON documents for Elasticsearch. As part of that process, our ES mappings often look something like this (with all the irrelevant fields omitted for brevity): {
"user" : {
"_id" : { "type" : "string", "path" : "id" },
"properties" : {
"id" : { "type" : "string", "index" : "not_analyzed" },
"address" : { "type" : "nested", "properties" : {
"id" : { "type" : "integer" }
}}
}
}
} ...where there are multiple different ID fields, some of which are integers or longs, and others are strings. On the query side, we always refer to these with fully qualified field names (in this case, "id" and "address.id"), so it can be disorienting when our queries or facets against an "id" end up matching against an "address.id". More importantly, though, we've experienced lots of hair-pulling trying to debug shard failures at query-time, caused by requests that sort hits by ID:
I probably spent 30 hours investigating that bug and implementing workarounds, eventually asking this question on the mailing list a few months ago at the height of my frustration: NumberFormatException when sorting by numeric document ID Now that we understand the bug, we've implemented workarounds by mandating that all our other SQL and NoSQL schemas conform to the rule: no two fields (in any column of any table) may have the same name if they have different low-level types. That way, if we use mechanical tools to generate ES mappings from SQL schemas, the automatically-generated field names will be conflict-free. But it makes our schemas a lot more awkward than they ought to be. This is a big big deal for us, and it's the only issue blocking us from adopting Elasticsearch as a canonical datastore. Anyhow, thanks again for opening this issue. I really appreciate it! |
@clintongormley It's been a few months since this issue was created, and I wanted to check back in to see if you or any of the other core members could chime in. This continues to be a big issue for us (and, I'd venture to say, for anyone else using Elasticsearch like a database). Whenever I discuss Elasticsearch with other engineers, this is always their #2 concern (second only to consistency guarantees). |
I think this is a very important point. There are several bugs that we have that are due to the fact that different mappings can go to the same index field name: downgraded index options, wrong analyzer picked at search time when the type is not specified, parsing issues when loading field data, etc. So I don't think we have many choices but either storing the mappings on the index level or using different field names for each type. |
@jpountz what do you mean by:
Are you saying you would be in favour of prepending the type name to the inverted index name? Agreed that this would solve a number of issues, but what negative effects could it have. Would (eg) aggregating on the |
@clintongormley slower maybe, but (finally) accurate |
Either that, or enforcing that field mappings that have the same index name are exactly the same (that's what I meant by storing mappings on the index level). I guess the first option would have negative effects around compression (more sparse postings lists and terms dictionaries), memory usage and speed while the 2nd option would remove flexibility since fields would need to have the same type, index options, doc values options, etc. across types. Having different index names per type for search is ok I think, it's mostly a matter of building a disjunction query on the queried types. For everything that works on field data however (aggs, field data ranges, etc.) this would be more tricky since we would need to wrap several field data instances into a single one, which would be very costly since field data expects values to be unique and sorted. Or we would need to treat each type as a filtered index and query them like we query indices today, but this would mean ${num_types} queries per index as opposed to one today. So maybe just enforcing that field mappings are the same on all types would be more practical. We already have the limitation today anyway (cf. field data loading issues with numeric fields, or positional queries if one of the types has |
@jpountz good explanation. I think that defaulting to a separate inverted index per field/type adds a lot of overhead, especially when there is a simple workaround: use a different field name.
I'm in favour of this. Currently it is a hidden gotcha which can surprise users. It'd be much better to let them know up front at mapping create/update time. That, at least, would be a fairly simple change. Of course, there are some things which ARE type level, eg whether the |
Hm. Is that the root cause of my issue here? #7963 |
@nickminutello yes, it is. |
I want to raise our objection to this proposed solution (as I did also on thread #8614 ). Enforcing the same mapping for fields with the same name under different types break the concept of type as a "table". This is the way types have been perceived and used, by us and by many others. We, for instance, are counting on the type separation, and have a lot of cases of fields with different mapping under different types. This breaking change will be a huge issue for us with enormous implications. And I'm sure many others are in the same situation. I think you need to keep the type separation as it was always presented by you and understood by most of your users (just yesterday at the Elasticsearch meetup here Boaz referenced types as being "equivalent to DB tables"). |
@rore The problem is that this design currently doesn't work, and there are two main options: having types map their fields to a unique field name in the lucene index (eg. by prepending the name of the type), or enforcing that types have consistent mappings. The first option would make postings lists sparser (which means slower less compressed) and cross-type search slower (a single field term query on two types would need to internally become a disjunction on two fields), while the 2nd option would only require that fields that have the same name have consistent mappings. While I agree that there could be some exceptions, a field called |
@jpountz Lets take a not too unreasonable scenario. An index that holds data that is stored in another system. You have a type per table, say t1, t2 and t3. Each type ("table") has a field called externalID that denotes an id on another system. For t1, the externalID is an int. For t2, it's a string. For t3, it's a compound index so it's an object with 2 fields. |
We usually only enforce new rules on newly-created indices so all your current data are not going to be lost, you will have time to migrate. For your use-case, you could either partition your data into several indices based on the system that your |
This was only one example. We have a lot of indexes, a lot of types and a lot of places where this happens or might happen. Dealing with all the possible ways it could impact us is a complex issue. And again, on a general level, it breaks the concept of a type and why you should use it. I understand your considerations but I think you're choosing the path that will cause more damage than good, at least for existing users. |
Closing in favour of #8870 |
Fields in different types that share the same path end up being indexed into the same inverted index. This can be a surprising gotcha for new users, who expect that types are as separate as tables in a traditional database.
One possibility, suggested in #4081, is to prepend the type name to the
index_name
(the name of the inverted index) automatically.This can be done manually now, by specifying a different
index_name
per field, but I'm wondering if it should become the default. What would be the disadvantages of doing this? I can think of these:foo
field across all types in an index, instead you'd have to query*.foo
The upsides are:
I'm guessing that most of the time, fields with the same name in different types usually represent the same "thing", and so would be mapped in the same way, which leads me to think that we should leave things as they are, and allow users to use the
index_name
to configure this manually where needed.Any thoughts?
The text was updated successfully, but these errors were encountered: