Validate tsdb's routing_path #79384

nik9000 · 2021-10-18T17:47:31Z

routing_path's job is to put the same time series on the same node. It
does that by extracting a String directly from the xcontent, hashing it,
and feeding the hash into the shard selection algorithm. I'm happy with
it! But it won't work properly if routing_path matches non-dimension
fields or if it matches non-keyword dimensions. This prevents us from
mapping any fields that aren't keyword dimensions that "line up" with
routing_path.

Let's talk about why routing_path can't do it's job if it matches any
non-dimension fields. Well, imagine routing_path is [dim, foo] and
only dim is a dimension. It'll still hash foo's values into the
routing key. So, say you get documents like:

{"dim": "a", "foo": "1"}
{"dim": "a", "foo": "1"}
{"dim": "a", "foo": "2"}

The third document could be routed to a different shard than the first
two! Which would be a disaster because it'd cut the time series
identified by "dim":"a" into pieces!

Now let's talk about when routing_path matches a non-keyword
dimension. Imagine routing_path is [kwd, int] and we send these
documents:

{"kwd": "a", "int": "1"}
{"kwd": "a", "int": "01"}

Both of these documents belong to the time series identified by
"dim":"a","int":1 but the routing_path code just reads strings so
it'll route them into separate shards. Also bad! Also forbidden by this
change.

elasticmachine · 2021-10-18T17:47:35Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

`routing_path`'s job is to put the same time series on the same node. It does that by extracting a String directly from the xcontent, hashing it, and feeding the hash into the shard selection algorithm. I'm happy with it! But it won't work properly if `routing_path` matches non-dimension fields or if it matches non-keyword dimensions. This prevents us from mapping any fields that aren't keyword dimensions that "line up" with `routing_path`. Let's talk about why `routing_path` can't do it's job if it matches any non-dimension fields. Well, imagine `routing_path` is `[dim, foo]` and only `dim` is a dimension. It'll *still* hash `foo`'s values into the routing key. So, say you get documents like: ``` {"dim": "a", "foo": "1"} {"dim": "a", "foo": "1"} {"dim": "a", "foo": "2"} ``` The third document could be routed to a different shard than the first two! Which would be a disaster because it'd cut the time series identified by `"dim":"a"` into pieces! Now let's talk about when `routing_path` matches a non-keyword dimension. Imagine `routing_path` is `[kwd, int]` and we send these documents: ``` {"kwd": "a", "int": "1"} {"kwd": "a", "int": "01"} ``` Both of these documents belong to the time series identified by `"dim":"a","int":1` but the `routing_path` code just reads strings so it'll route them into separate shards. Also bad! Also forbidden by this change.

imotov

LGTM in general, added one suggestion.

server/src/main/java/org/elasticsearch/index/mapper/Mapper.java

This slightly moves the routing calculation out of a difficult to reason about `switch` statement and into real OO method implementation. Its a tiny tiny change but it makes me feel much better about it.

This reverts commit 9a040d4.

* upstream/master: Validate tsdb's routing_path (elastic#79384) Adjust the BWC version for the return200ForClusterHealthTimeout field (elastic#79436) API for adding and removing indices from a data stream (elastic#79279) Exposing the ability to log deprecated settings at non-critical level (elastic#79107) Convert operator privilege license object to LicensedFeature (elastic#79407) Mute SnapshotBasedIndexRecoveryIT testSeqNoBasedRecoveryIsUsedAfterPrimaryFailOver (elastic#79456) Create cache files with CREATE_NEW & SPARSE options (elastic#79371) Revert "[ML] Use a new annotations index for future annotations (elastic#79151)" [ML] Use a new annotations index for future annotations (elastic#79151) [ML] Removing legacy code from ML/transform auditor (elastic#79434) Fix rate agg with custom `_doc_count` (elastic#79346) Optimize SLM Policy Queries (elastic#79341) Fix execution of exists query within nested queries on field with doc_values disabled (elastic#78841) Stricter UpdateSettingsRequest parsing on the REST layer (elastic#79227) Do not release snapshot file download permit during recovery retries (elastic#79409) Preserve request headers in a mixed version cluster (elastic#79412) Adjust versions after elastic#79044 backport to 7.x (elastic#79424) Mute BulkByScrollUsesAllScrollDocumentsAfterConflictsIntegTests (elastic#79429) Fail on SSPL licensed x-pack sources (elastic#79348) # Conflicts: # server/src/test/java/org/elasticsearch/index/TimeSeriesModeTests.java

nik9000 added >non-issue v8.0.0 :StorageEngine/TSDB You know, for Metrics labels Oct 18, 2021

nik9000 requested review from imotov and romseygeek October 18, 2021 17:47

elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Oct 18, 2021

nik9000 mentioned this pull request Oct 18, 2021

Add better support for metric data types (TSDB) #74660

Closed

nik9000 added 2 commits October 18, 2021 13:54

Merge branch 'master' into index_routing_from_source_validate_dims

66e5ec6

imotov approved these changes Oct 18, 2021

View reviewed changes

server/src/main/java/org/elasticsearch/index/mapper/Mapper.java Outdated Show resolved Hide resolved

nik9000 added 5 commits October 18, 2021 16:37

Move routing calculation

9a040d4

This slightly moves the routing calculation out of a difficult to reason about `switch` statement and into real OO method implementation. Its a tiny tiny change but it makes me feel much better about it.

Revert "Move routing calculation"

95873ea

This reverts commit 9a040d4.

privacy

c1423a5

Handle warnings

2f3d5fd

Merge branch 'master' into index_routing_from_source_validate_dims

bb332e6

nik9000 merged commit 7a32158 into elastic:master Oct 19, 2021

jakelandis added v8.0.0-beta1 and removed v8.0.0 labels Oct 27, 2021

wchaparro assigned nik9000 and unassigned nik9000 Dec 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate tsdb's routing_path #79384

Validate tsdb's routing_path #79384

nik9000 commented Oct 18, 2021

elasticmachine commented Oct 18, 2021

imotov left a comment

Validate tsdb's routing_path #79384

Validate tsdb's routing_path #79384

Conversation

nik9000 commented Oct 18, 2021

elasticmachine commented Oct 18, 2021

imotov left a comment

Choose a reason for hiding this comment