Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve exception handling for metrics store errors #1703

Merged
merged 4 commits into from
May 11, 2023

Conversation

b-deam
Copy link
Member

@b-deam b-deam commented Apr 13, 2023

Prior to this commit, Rally could hang indefinitely whenever its respective metrics store failed to flush() successfully.

This is because the respective remote metric store's close() method only attempted to set itself as closed after a successful call to the flush() method. If we failed to flush() successfully, then we never set the metric store's opened attribute to false, eventually leading to a deadlock where Rally couldn't ever shut down successfully:

def close(self):
self.progress_reporter.finish()
if self.metrics_store and self.metrics_store.opened:
self.metrics_store.close()

This commit ensures we properly close the respective metrics store by setting the opened attribute to false prior to attempting a final flush(), as well as ensures we capture and log any BulkIndexErrors raised by the bulk index helper method.

Fixes #1697

You can reproduce this locally with a remote metrics store and by running this specific http_logs revision to generate a mappings exception:

Fails due to metrics store mappings exception:

$ esrally race --telemetry="gc" --target-host="localhost:9200" --track-revision="cffc8641055233485b9d1c9a773b1303c7c5313f" --track-repository="default" --track="http_logs" --challenge="append-index-only-with-ingest-pipeline" --on-error="abort" --client-options="timeout:240,use_ssl:true,verify_certs:false,basic_auth_user:'elastic',basic_auth_password:'changeme'" --track-params='{"bulk_indexing_clients": 1,"source_enabled": false, "ingest_pipeline": "grok" }' --pipeline="benchmark-only"  --kill-running-processes
Output
$ esrally race --telemetry="gc" --target-host="localhost:9200" --track-revision="cffc8641055233485b9d1c9a773b1303c7c5313f" --track-repository="default" --track="http_logs" --challenge="append-index-only-with-ingest-pipeline" --on-error="abort" --client-options="timeout:240,use_ssl:true,verify_certs:false,basic_auth_user:'elastic',basic_auth_password:'changeme'" --track-params='{"bulk_indexing_clients": 1,"source_enabled": false, "ingest_pipeline": "grok" }' --pipeline="benchmark-only"  --kill-running-processes

    ____        ____
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
 / _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/

[INFO] Race id is [1415f777-8547-4193-bb0e-c8d02f05483f]
[INFO] Racing on track [http_logs], challenge [append-index-only-with-ingest-pipeline] and car ['external'] with version [8.6.2].

Running delete-index                                                           [100% done]
Running create-index                                                           [100% done]
Running check-cluster-health                                                   [100% done]
Running create-http-log-grok-pipeline                                          [100% done]
Running index-append-with-ingest-grok-pipeline                                 [  0% done]
[ERROR] Cannot race. Traceback (most recent call last):
  File "/home/b/perf/github.com/b-deam/rally/esrally/metrics.py", line 116, in guarded
    return target(*args, **kwargs)
  File "/home/b/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/elasticsearch/helpers/actions.py", line 524, in bulk
    for ok, item in streaming_bulk(
  File "/home/b/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/elasticsearch/helpers/actions.py", line 438, in streaming_bulk
    for data, (ok, info) in zip(
  File "/home/b/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/elasticsearch/helpers/actions.py", line 355, in _process_bulk_chunk
    yield from gen
  File "/home/b/perf/github.com/b-deam/rally/.venv/lib/python3.8/site-packages/elasticsearch/helpers/actions.py", line 274, in _process_bulk_chunk_success
    raise BulkIndexError(f"{len(errors)} document(s) failed to index.", errors)
elasticsearch.helpers.BulkIndexError: 45 document(s) failed to index.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/b/perf/github.com/b-deam/rally/esrally/actor.py", line 92, in guard
    return f(self, msg, sender)
  File "/home/b/perf/github.com/b-deam/rally/esrally/driver/driver.py", line 303, in receiveMsg_WakeupMessage
    self.coordinator.post_process_samples()
  File "/home/b/perf/github.com/b-deam/rally/esrally/driver/driver.py", line 955, in post_process_samples
    self.sample_post_processor(raw_samples)
  File "/home/b/perf/github.com/b-deam/rally/esrally/driver/driver.py", line 1068, in __call__
    self.metrics_store.flush(refresh=False)
  File "/home/b/perf/github.com/b-deam/rally/esrally/metrics.py", line 933, in flush
    self._client.bulk_index(index=self._index, items=self._docs)
  File "/home/b/perf/github.com/b-deam/rally/esrally/metrics.py", line 92, in bulk_index
    self.guarded(elasticsearch.helpers.bulk, self._client, items, index=index, chunk_size=5000)
  File "/home/b/perf/github.com/b-deam/rally/esrally/metrics.py", line 123, in guarded
    raise exceptions.RallyError(
esrally.exceptions.RallyError: Unretryable error encountered when sending metrics to remote metrics store: [mapper_parsing_exception]


Getting further help:
*********************
* Check the log files in /home/b/.rally/logs for errors.
* Read the documentation at https://esrally.readthedocs.io/en/latest/.
* Ask a question on the forum at https://discuss.elastic.co/tags/c/elastic-stack/elasticsearch/rally.
* Raise an issue at https://github.com/elastic/rally/issues and include the log files in /home/b/.rally/logs.

--------------------------------
[INFO] FAILURE (took 63 seconds)
--------------------------------

@b-deam b-deam added bug Something's wrong :Metrics How metrics are stored, calculated or aggregated labels Apr 13, 2023
@b-deam b-deam added this to the 2.8.0 milestone Apr 13, 2023
@b-deam b-deam requested review from pquentin and inqueue April 13, 2023 04:06
@b-deam b-deam self-assigned this Apr 13, 2023
@pquentin
Copy link
Member

pquentin commented May 9, 2023

I'd argue that this is a bug in the Elasticsearch Python client: BulkIndexError should be an ApiError, but it's not.

If it was an ApiError, then this would solve our problem, right? In the meantime, should we catch both ApiError and BulkIndexError in the same except statement?

@inqueue
Copy link
Member

inqueue commented May 9, 2023

I'd argue that this is a bug in the Elasticsearch Python client: BulkIndexError should be an ApiError, but it's not.

I see it both ways, though there is a difference. My interpretation is an ApiError is any non-2xx HTTP request status code whereas a BulkIndexError is an error strictly related to failed actions of the bulk request. The bulk HTTP request could end in a 2xx response even if it contains one or more errors of BulkIndexError. I think the ES Python client is doing the right thing.

Copy link
Member

@inqueue inqueue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending any concerns from @pquentin.

Copy link
Member

@pquentin pquentin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is an explicit decision from the Python client that is mentioned in the 8.x migration guide: https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/migration.html#migration-error-types:

elasticsearch.helpers.errors.BulkIndexError and elasticsearch.helpers.errors.ScanError now use Exception as a base class instead of ElasticsearchException.

So that sounds good, I just left a few nits.

except elasticsearch.helpers.BulkIndexError as e:
for err in e.errors:
err_type = err.get("index", {}).get("error", {}).get("type", None)
if err.get("index", {}).get("status", None) not in (502, 503, 504, 429):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move that list of error codes to a constant now that it's use twice?

I know you did not come up with the list, but I'm not sure I find the list very convincing. Even 429 seems OK to retry after having slept.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, and I'm not sure I understand your last two sentences:

I know you did not come up with the list, but I'm not sure I find the list very convincing. Even 429 seems OK to retry after having slept.

We do retry on 429s, but I can see that it can be easily misread. A side effect of using a constant is that it's now much more 'readable' (IMHO).

Addressed in 1d88020

},
]

retriable_errors = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also understand this is not your code, but can we please stick to either retriable or retryable? (Both are correct: https://english.stackexchange.com/questions/305273/retriable-or-retryable)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's most certainly a mistake to ask an Australian about proper use of the English language.

Addressed in 1d88020

@@ -114,6 +114,27 @@ def guarded(self, target, *args, **kwargs):

try:
return target(*args, **kwargs)
except elasticsearch.helpers.BulkIndexError as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please move this block just before except ApiError to highlight that they are similar?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 1d88020

@b-deam b-deam requested a review from pquentin May 11, 2023 00:13
Copy link
Member

@pquentin pquentin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating and for your patience! LGTM

@b-deam b-deam merged commit 9d771b7 into elastic:master May 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something's wrong :Metrics How metrics are stored, calculated or aggregated
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A bad mapping can cause Rally can get into a bad state during bulk indexing
3 participants