Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross-cluster search: preserve cluster alias in shard failures #32608

Merged
merged 3 commits into from
Aug 6, 2018

Conversation

javanna
Copy link
Member

@javanna javanna commented Aug 3, 2018

When some remote clusters return shard failures as part of a cross-cluster search request, the cluster alias currently gets lost. As a result, if the shard failures are all caused by the same error, and against indices belonging to different clusters, but with the same index name, only one failure gets returned as part of the search response, meaning that failures are grouped by index name, ignoring the cluster alias.

With this commit we make sure that ShardSearchFailure returns the cluster alias as part of the index name. Also, we set the fully qualfied index name when creating a QueryShardException. That way shard failures are grouped by cluster:index. Such fixes should cover at least most of the cases where either 1) the shard target is set but we don't have the index in the cause (we were previously reading it only from the cause that did not have the cluster alias) 2) the shard target is missing but if the cause is a QueryShardException the cluster alias does not get lost.

We also prevent NPE in case the failure cause is not set and test such scenario.

When some remote clusters return shard failures as part of a cross-cluster search request, the cluster alias currently gets lost. As a result, if the shard failures are all caused by the same error, and against indices belonging to different clusters, but with the same index name, only one failure gets returned as part of the search response, meaning that failures are grouped by index name, ignoring the cluster alias.

With this commit we make sure that `ShardSearchFailure` returns the cluster alias as part of the index name. Also, we set the fully qualfied index name when creating a `QueryShardException`. That way shard failures are grouped by cluster:index. Such fixes should cover at least most of the cases where either 1) the shard target is set but we don't have the index in the cause (we were previously reading it only from the cause that does not have the cluster alas) 2) the shard target is missing but if the cause is a `QueryShardException` the clusterAlias does not get lost.

We also prevent NPE in case the failure cause is not set and test such scenario.
@javanna javanna added >bug review :Search/Search Search-related issues that do not fall into other categories v7.0.0 v6.4.1 labels Aug 3, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@javanna
Copy link
Member Author

javanna commented Aug 6, 2018

thanks @jimczi !

@javanna javanna merged commit 826399f into elastic:master Aug 6, 2018
javanna added a commit that referenced this pull request Aug 6, 2018
When some remote clusters return shard failures as part of a cross-cluster search request, the cluster alias currently gets lost. As a result, if the shard failures are all caused by the same error, and against indices belonging to different clusters, but with the same index name, only one failure gets returned as part of the search response, meaning that failures are grouped by index name, ignoring the cluster alias.

With this commit we make sure that `ShardSearchFailure` returns the cluster alias as part of the index name. Also, we set the fully qualfied index name when creating a `QueryShardException`. That way shard failures are grouped by cluster:index. Such fixes should cover at least most of the cases where either 1) the shard target is set but we don't have the index in the cause (we were previously reading it only from the cause that did not have the cluster alias) 2) the shard target is missing but if the cause is a `QueryShardException` the cluster alias does not get lost.

We also prevent NPE in case the failure cause is not set and test such scenario.
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Aug 6, 2018
…pe-detection-with-leading-whitespace

* elastic/master: (34 commits)
  Cross-cluster search: preserve cluster alias in shard failures (elastic#32608)
  Handle AlreadyClosedException when bumping primary term
  [TEST] Allow to run in FIPS JVM (elastic#32607)
  [Test] Add ckb to the list of unsupported languages (elastic#32611)
  SCRIPTING: Move Aggregation Scripts to their own context (elastic#32068)
  Painless: Use LocalMethod Map For Lookup at Runtime (elastic#32599)
  [TEST] Enhance failure message when bulk updates have failures
  [ML] Add ML result classes to protocol library (elastic#32587)
  Suppress LicensingDocumentationIT.testPutLicense in release builds (elastic#32613)
  [Rollup] Update wire version check after backport
  Suppress Wildfly test in FIPS JVMs (elastic#32543)
  [Rollup] Improve ID scheme for rollup documents (elastic#32558)
  ingest: doc: move Dot Expander Processor doc to correct position (elastic#31743)
  [ML] Add some ML config classes to protocol library (elastic#32502)
  [TEST]Split transport verification mode none tests (elastic#32488)
  Core: Move helper date formatters over to java time (elastic#32504)
  [Rollup] Remove builders from DateHistogramGroupConfig (elastic#32555)
  [TEST} unmutes SearchAsyncActionTests and adds debugging info
  [ML] Add Detector config classes to protocol library (elastic#32495)
  [Rollup] Remove builders from MetricConfig (elastic#32536)
  ...
@javanna
Copy link
Member Author

javanna commented Aug 6, 2018

pending backport to 6.4

dnhatn added a commit that referenced this pull request Aug 6, 2018
* 6.x:
  [Kerberos] Use canonical host name (#32588)
  Cross-cluster search: preserve cluster alias in shard failures (#32608)
  [TEST] Allow to run in FIPS JVM (#32607)
  Handle AlreadyClosedException when bumping primary term
  [Test] Add ckb to the list of unsupported languages (#32611)
  SCRIPTING: Move Aggregation Scripts to their own context (#32068) (#32629)
  [TEST] Enhance failure message when bulk updates have failures
  [ML] Add ML result classes to protocol library (#32587)
  Suppress LicensingDocumentationIT.testPutLicense in release builds (#32613)
  [Rollup] Improve ID scheme for rollup documents (#32558)
  Mutes failing SQL string function tests due to #32589
  Suppress Wildfly test in FIPS JVMs (#32543)
  Add cluster UUID to Cluster Stats API response (#32206)
  [ML] Add some ML config classes to protocol library (#32502)
  [TEST]Split transport verification mode none tests (#32488)
  [Rollup] Remove builders from DateHistogramGroupConfig (#32555)
  [ML] Add Detector config classes to protocol library (#32495)
  [Rollup] Remove builders from MetricConfig (#32536)
  Fix race between replica reset and primary promotion (#32442)
  HLRC: Move commercial clients from XPackClient (#32596)
  Security: move User to protocol project (#32367)
  Minor fix for javadoc (applicable for java 11). (#32573)
  Painless: Move Some Lookup Logic to PainlessLookup (#32565)
  Core: Minor size reduction for AbstractComponent (#32509)
  INGEST: Enable default pipelines (#32286) (#32591)
  TEST: Avoid merges in testSeqNoAndCheckpoints
  [Rollup] Remove builders from HistoGroupConfig (#32533)
  fixed elements in array of produced terms (#32519)
  Mutes ReindexFailureTests.searchFailure dues to #28053
  Mutes LicensingDocumentationIT due to #32580
  Remove the SATA controller from OpenSUSE box
  [ML] Rename JobProvider to JobResultsProvider (#32551)
dnhatn added a commit that referenced this pull request Aug 6, 2018
* master:
  Cross-cluster search: preserve cluster alias in shard failures (#32608)
  Handle AlreadyClosedException when bumping primary term
  [TEST] Allow to run in FIPS JVM (#32607)
  [Test] Add ckb to the list of unsupported languages (#32611)
  SCRIPTING: Move Aggregation Scripts to their own context (#32068)
  Painless: Use LocalMethod Map For Lookup at Runtime (#32599)
  [TEST] Enhance failure message when bulk updates have failures
  [ML] Add ML result classes to protocol library (#32587)
  Suppress LicensingDocumentationIT.testPutLicense in release builds (#32613)
  [Rollup] Update wire version check after backport
  Suppress Wildfly test in FIPS JVMs (#32543)
  [Rollup] Improve ID scheme for rollup documents (#32558)
  ingest: doc: move Dot Expander Processor doc to correct position (#31743)
  [ML] Add some ML config classes to protocol library (#32502)
  [TEST]Split transport verification mode none tests (#32488)
  Core: Move helper date formatters over to java time (#32504)
  [Rollup] Remove builders from DateHistogramGroupConfig (#32555)
  [TEST} unmutes SearchAsyncActionTests and adds debugging info
  [ML] Add Detector config classes to protocol library (#32495)
  [Rollup] Remove builders from MetricConfig (#32536)
  Tests: Add rolling upgrade tests for watcher (#32428)
  Fix race between replica reset and primary promotion (#32442)
javanna added a commit to javanna/elasticsearch that referenced this pull request Aug 7, 2018
As part of elastic#32608 we made sure that the fully qualified index name is taken from the query shard context whenever creating a new `QueryShardException`. That change introduced a regression as instead of setting the entire `Index` object to the exception, which holds index name and index uuid, we ended up setting only the index name (including cluster alias). With this commit we make sure that the index uuid does not get lost and we try to lower the chances that a similar bug makes it in another time. That's done by making `QueryShardContext` return the fully qualified `Index` (which also holds the uuid) rather than only the fully qualified index name.
javanna added a commit that referenced this pull request Aug 8, 2018
As part of #32608 we made sure that the fully qualified index name is taken from the query shard context whenever creating a new `QueryShardException`. That change introduced a regression as instead of setting the entire `Index` object to the exception, which holds index name and index uuid, we ended up setting only the index name (including cluster alias). With this commit we make sure that the index uuid does not get lost and we try to lower the chances that a similar bug makes it in another time. That's done by making `QueryShardContext` return the fully qualified `Index` (which also holds the uuid) rather than only the fully qualified index name.
javanna added a commit that referenced this pull request Aug 8, 2018
As part of #32608 we made sure that the fully qualified index name is taken from the query shard context whenever creating a new `QueryShardException`. That change introduced a regression as instead of setting the entire `Index` object to the exception, which holds index name and index uuid, we ended up setting only the index name (including cluster alias). With this commit we make sure that the index uuid does not get lost and we try to lower the chances that a similar bug makes it in another time. That's done by making `QueryShardContext` return the fully qualified `Index` (which also holds the uuid) rather than only the fully qualified index name.
@javanna javanna self-assigned this Aug 13, 2018
@javanna javanna removed the review label Aug 24, 2018
javanna added a commit that referenced this pull request Aug 24, 2018
When some remote clusters return shard failures as part of a cross-cluster search request, the cluster alias currently gets lost. As a result, if the shard failures are all caused by the same error, and against indices belonging to different clusters, but with the same index name, only one failure gets returned as part of the search response, meaning that failures are grouped by index name, ignoring the cluster alias.

With this commit we make sure that `ShardSearchFailure` returns the cluster alias as part of the index name. Also, we set the fully qualfied index name when creating a `QueryShardException`. That way shard failures are grouped by cluster:index. Such fixes should cover at least most of the cases where either 1) the shard target is set but we don't have the index in the cause (we were previously reading it only from the cause that did not have the cluster alias) 2) the shard target is missing but if the cause is a `QueryShardException` the cluster alias does not get lost.

We also prevent NPE in case the failure cause is not set and test such scenario.
javanna added a commit that referenced this pull request Aug 24, 2018
As part of #32608 we made sure that the fully qualified index name is taken from the query shard context whenever creating a new `QueryShardException`. That change introduced a regression as instead of setting the entire `Index` object to the exception, which holds index name and index uuid, we ended up setting only the index name (including cluster alias). With this commit we make sure that the index uuid does not get lost and we try to lower the chances that a similar bug makes it in another time. That's done by making `QueryShardContext` return the fully qualified `Index` (which also holds the uuid) rather than only the fully qualified index name.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Search Search-related issues that do not fall into other categories v6.4.1 v6.5.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants