Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepend the type name to the index_name automatically #5851

Closed
clintongormley opened this issue Apr 17, 2014 · 16 comments
Closed

Prepend the type name to the index_name automatically #5851

clintongormley opened this issue Apr 17, 2014 · 16 comments
Assignees

Comments

@clintongormley
Copy link

Fields in different types that share the same path end up being indexed into the same inverted index. This can be a surprising gotcha for new users, who expect that types are as separate as tables in a traditional database.

One possibility, suggested in #4081, is to prepend the type name to the index_name (the name of the inverted index) automatically.

This can be done manually now, by specifying a different index_name per field, but I'm wondering if it should become the default. What would be the disadvantages of doing this? I can think of these:

  • field data would not be shared between types, leading to more memory usage
  • you wouldn't be able to query the foo field across all types in an index, instead you'd have to query *.foo

The upsides are:

  • different mappings would not clash
  • different field data data types would not clash
  • term frequencies would be preserved per type

I'm guessing that most of the time, fields with the same name in different types usually represent the same "thing", and so would be mapped in the same way, which leads me to think that we should leave things as they are, and allow users to use the index_name to configure this manually where needed.

Any thoughts?

@synhershko
Copy link
Contributor

I actually wasn't aware this was happening too, and this change is important in order to preserve relevance quality.

In that respect, I don't think losing field-data sharing is a disadvantage. Just like it would lead to more memory usage when the 2 are split, it could lead to less memory usage if once they are split only one is being used for sorting / faceting.

Also does this mean aggregations were broken in such scenarios?

@benjismith
Copy link

Thanks for creating this issue, @clintongormley!

I'd like to just point out the particular problems this behavior has caused us...

We're not using Elasticsearch as a canonical datastore, but rather as a search engine for data that's canonically stored in a SQL or NoSQL database, so our ES mappings are usually just mechanical translations of our other database schemas. Likewise, our indexing pipeline uses mechanical transformers to denormalize our SQL records into JSON documents for Elasticsearch.

As part of that process, our ES mappings often look something like this (with all the irrelevant fields omitted for brevity):

{
  "user" : {
    "_id" : { "type" : "string", "path" : "id" },
    "properties" : {
      "id" : { "type" : "string", "index" : "not_analyzed" },
      "address" : { "type" : "nested", "properties" : {
        "id" : { "type" : "integer" }
      }}
    }
  }
}

...where there are multiple different ID fields, some of which are integers or longs, and others are strings.

On the query side, we always refer to these with fully qualified field names (in this case, "id" and "address.id"), so it can be disorienting when our queries or facets against an "id" end up matching against an "address.id".

More importantly, though, we've experienced lots of hair-pulling trying to debug shard failures at query-time, caused by requests that sort hits by ID:

java.lang.NumberFormatException:
Invalid shift value (64) in prefixCoded bytes (is encoded value really an INT?)]

I probably spent 30 hours investigating that bug and implementing workarounds, eventually asking this question on the mailing list a few months ago at the height of my frustration:

NumberFormatException when sorting by numeric document ID

Now that we understand the bug, we've implemented workarounds by mandating that all our other SQL and NoSQL schemas conform to the rule: no two fields (in any column of any table) may have the same name if they have different low-level types. That way, if we use mechanical tools to generate ES mappings from SQL schemas, the automatically-generated field names will be conflict-free. But it makes our schemas a lot more awkward than they ought to be.

This is a big big deal for us, and it's the only issue blocking us from adopting Elasticsearch as a canonical datastore.

Anyhow, thanks again for opening this issue. I really appreciate it!

@benjismith
Copy link

@clintongormley It's been a few months since this issue was created, and I wanted to check back in to see if you or any of the other core members could chime in. This continues to be a big issue for us (and, I'd venture to say, for anyone else using Elasticsearch like a database).

Whenever I discuss Elasticsearch with other engineers, this is always their #2 concern (second only to consistency guarantees).

@jpountz jpountz added the v2.0.0 label Jun 6, 2014
@jpountz
Copy link
Contributor

jpountz commented Jun 6, 2014

different mappings would not clash

I think this is a very important point. There are several bugs that we have that are due to the fact that different mappings can go to the same index field name: downgraded index options, wrong analyzer picked at search time when the type is not specified, parsing issues when loading field data, etc. So I don't think we have many choices but either storing the mappings on the index level or using different field names for each type.

@clintongormley
Copy link
Author

@jpountz what do you mean by:

but either storing the mappings on the index level or using different field names for each type.

Are you saying you would be in favour of prepending the type name to the inverted index name? Agreed that this would solve a number of issues, but what negative effects could it have.

Would (eg) aggregating on the foo field across multiple types be much slower than it is today, because it has to access many more inverted indices?

@synhershko
Copy link
Contributor

@clintongormley slower maybe, but (finally) accurate

@jpountz
Copy link
Contributor

jpountz commented Jun 6, 2014

Are you saying you would be in favour of prepending the type name to the inverted index name?

Either that, or enforcing that field mappings that have the same index name are exactly the same (that's what I meant by storing mappings on the index level). I guess the first option would have negative effects around compression (more sparse postings lists and terms dictionaries), memory usage and speed while the 2nd option would remove flexibility since fields would need to have the same type, index options, doc values options, etc. across types.

Having different index names per type for search is ok I think, it's mostly a matter of building a disjunction query on the queried types. For everything that works on field data however (aggs, field data ranges, etc.) this would be more tricky since we would need to wrap several field data instances into a single one, which would be very costly since field data expects values to be unique and sorted. Or we would need to treat each type as a filtered index and query them like we query indices today, but this would mean ${num_types} queries per index as opposed to one today.

So maybe just enforcing that field mappings are the same on all types would be more practical. We already have the limitation today anyway (cf. field data loading issues with numeric fields, or positional queries if one of the types has DOCS_ONLY has index options). This makes the update mappings API look weird however: either you would need to refuse any mapping update as soon as a field is defined on two types so that they cannot diverge, or updating the mapping on one type would also update the mapping on other types, which feels wrong as well. Maybe the mapping create/update operations could be made index-wide (not per-type anymore) and then each type would not store mappings but only the list of fields that it contains?

@clintongormley
Copy link
Author

@jpountz good explanation. I think that defaulting to a separate inverted index per field/type adds a lot of overhead, especially when there is a simple workaround: use a different field name.

So maybe just enforcing that field mappings are the same on all types would be more practical.

I'm in favour of this. Currently it is a hidden gotcha which can surprise users. It'd be much better to let them know up front at mapping create/update time. That, at least, would be a fairly simple change.

Of course, there are some things which ARE type level, eg whether the _all field is enabled is per-type, not per-index. So I think that changing the mappings to be index-level instead of type-level would overcomplicate things as well. Better to leave it as it is, which works well for the majority of cases, but to let the user know when they try to do something which is not supported.

@clintongormley clintongormley self-assigned this Jul 8, 2014
@nickminutello
Copy link

I'm in favour of this. Currently it is a hidden gotcha which can surprise users.
It'd be much better to let them know up front at mapping create/update time

Hm. Is that the root cause of my issue here? #7963

@imotov
Copy link
Contributor

imotov commented Oct 10, 2014

@nickminutello yes, it is.

@rore
Copy link

rore commented Nov 25, 2014

I want to raise our objection to this proposed solution (as I did also on thread #8614 ).

Enforcing the same mapping for fields with the same name under different types break the concept of type as a "table". This is the way types have been perceived and used, by us and by many others.

We, for instance, are counting on the type separation, and have a lot of cases of fields with different mapping under different types. This breaking change will be a huge issue for us with enormous implications. And I'm sure many others are in the same situation.

I think you need to keep the type separation as it was always presented by you and understood by most of your users (just yesterday at the Elasticsearch meetup here Boaz referenced types as being "equivalent to DB tables").

@jpountz
Copy link
Contributor

jpountz commented Nov 25, 2014

@rore The problem is that this design currently doesn't work, and there are two main options: having types map their fields to a unique field name in the lucene index (eg. by prepending the name of the type), or enforcing that types have consistent mappings. The first option would make postings lists sparser (which means slower less compressed) and cross-type search slower (a single field term query on two types would need to internally become a disjunction on two fields), while the 2nd option would only require that fields that have the same name have consistent mappings. While I agree that there could be some exceptions, a field called name would usually be a string and a field called age a number? I only mentionned efficiency issues here, but handling inconsistent mappings across types correctly would also increase complexity quite significantly (again because there are decisions that need to be made at an index level).

@rore
Copy link

rore commented Nov 25, 2014

@jpountz Lets take a not too unreasonable scenario. An index that holds data that is stored in another system. You have a type per table, say t1, t2 and t3. Each type ("table") has a field called externalID that denotes an id on another system. For t1, the externalID is an int. For t2, it's a string. For t3, it's a compound index so it's an object with 2 fields.
Now what you're saying is that ES will no longer allow me to do this, all my current data is gone kaput, and if I want to hold this data I need to either change the names of my fields or prefix them with the type myself.

@jpountz
Copy link
Contributor

jpountz commented Nov 25, 2014

Now what you're saying is that ES will no longer allow me to do this, all my current data is gone kaput, and if I want to hold this data I need to either change the names of my fields or prefix them with the type myself.

We usually only enforce new rules on newly-created indices so all your current data are not going to be lost, you will have time to migrate.

For your use-case, you could either partition your data into several indices based on the system that your externalID field points to, or keep using a single index but make sure to use different field names depending on the system? Or if you only care about storing these data (no searching, no aggs, no sorting, only retrieval), you could just put them under an object field with dynamic configured to false so that elasticsearch does not even generate mappings for them.

@rore
Copy link

rore commented Nov 25, 2014

This was only one example. We have a lot of indexes, a lot of types and a lot of places where this happens or might happen. Dealing with all the possible ways it could impact us is a complex issue.

And again, on a general level, it breaks the concept of a type and why you should use it. I understand your considerations but I think you're choosing the path that will cause more damage than good, at least for existing users.

@clintongormley
Copy link
Author

Closing in favour of #8870

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants