Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate tsdb's routing_path #79384

Merged

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Oct 18, 2021

routing_path's job is to put the same time series on the same node. It
does that by extracting a String directly from the xcontent, hashing it,
and feeding the hash into the shard selection algorithm. I'm happy with
it! But it won't work properly if routing_path matches non-dimension
fields or if it matches non-keyword dimensions. This prevents us from
mapping any fields that aren't keyword dimensions that "line up" with
routing_path.

Let's talk about why routing_path can't do it's job if it matches any
non-dimension fields. Well, imagine routing_path is [dim, foo] and
only dim is a dimension. It'll still hash foo's values into the
routing key. So, say you get documents like:

{"dim": "a", "foo": "1"}
{"dim": "a", "foo": "1"}
{"dim": "a", "foo": "2"}

The third document could be routed to a different shard than the first
two! Which would be a disaster because it'd cut the time series
identified by "dim":"a" into pieces!

Now let's talk about when routing_path matches a non-keyword
dimension. Imagine routing_path is [kwd, int] and we send these
documents:

{"kwd": "a", "int": "1"}
{"kwd": "a", "int": "01"}

Both of these documents belong to the time series identified by
"dim":"a","int":1 but the routing_path code just reads strings so
it'll route them into separate shards. Also bad! Also forbidden by this
change.

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Oct 18, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

`routing_path`'s job is to put the same time series on the same node. It
does that by extracting a String directly from the xcontent, hashing it,
and feeding the hash into the shard selection algorithm. I'm happy with
it! But it won't work properly if `routing_path` matches non-dimension
fields or if it matches non-keyword dimensions. This prevents us from
mapping any fields that aren't keyword dimensions that "line up" with
`routing_path`.

Let's talk about why `routing_path` can't do it's job if it matches any
non-dimension fields. Well, imagine `routing_path` is `[dim, foo]` and
only `dim` is a dimension. It'll *still* hash `foo`'s values into the
routing key. So, say you get documents like:
```
{"dim": "a", "foo": "1"}
{"dim": "a", "foo": "1"}
{"dim": "a", "foo": "2"}
```

The third document could be routed to a different shard than the first
two! Which would be a disaster because it'd cut the time series
identified by `"dim":"a"` into pieces!

Now let's talk about when `routing_path` matches a non-keyword
dimension. Imagine `routing_path` is `[kwd, int]` and we send these
documents:
```
{"kwd": "a", "int": "1"}
{"kwd": "a", "int": "01"}
```

Both of these documents belong to the time series identified by
`"dim":"a","int":1` but the `routing_path` code just reads strings so
it'll route them into separate shards. Also bad! Also forbidden by this
change.
Copy link
Contributor

@imotov imotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general, added one suggestion.

This slightly moves the routing calculation out of a difficult to reason
about `switch` statement and into real OO method implementation. Its a
tiny tiny change but it makes me feel much better about it.
@nik9000 nik9000 merged commit 7a32158 into elastic:master Oct 19, 2021
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Oct 19, 2021
* upstream/master:
  Validate tsdb's routing_path (elastic#79384)
  Adjust the BWC version for the return200ForClusterHealthTimeout field (elastic#79436)
  API for adding and removing indices from a data stream (elastic#79279)
  Exposing the ability to log deprecated settings at non-critical level (elastic#79107)
  Convert operator privilege license object to LicensedFeature (elastic#79407)
  Mute SnapshotBasedIndexRecoveryIT testSeqNoBasedRecoveryIsUsedAfterPrimaryFailOver (elastic#79456)
  Create cache files with CREATE_NEW & SPARSE options (elastic#79371)
  Revert "[ML] Use a new annotations index for future annotations (elastic#79151)"
  [ML] Use a new annotations index for future annotations (elastic#79151)
  [ML] Removing legacy code from ML/transform auditor (elastic#79434)
  Fix rate agg with custom `_doc_count` (elastic#79346)
  Optimize SLM Policy Queries (elastic#79341)
  Fix execution of exists query within nested queries on field with doc_values disabled (elastic#78841)
  Stricter UpdateSettingsRequest parsing on the REST layer (elastic#79227)
  Do not release snapshot file download permit during recovery retries (elastic#79409)
  Preserve request headers in a mixed version cluster (elastic#79412)
  Adjust versions after elastic#79044 backport to 7.x (elastic#79424)
  Mute BulkByScrollUsesAllScrollDocumentsAfterConflictsIntegTests (elastic#79429)
  Fail on SSPL licensed x-pack sources (elastic#79348)

# Conflicts:
#	server/src/test/java/org/elasticsearch/index/TimeSeriesModeTests.java
@wchaparro wchaparro assigned nik9000 and unassigned nik9000 Dec 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>non-issue :StorageEngine/TSDB You know, for Metrics Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants