Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A bad mapping can cause Rally can get into a bad state during bulk indexing #1697

Closed
inqueue opened this issue Mar 30, 2023 · 0 comments · Fixed by #1703
Closed

A bad mapping can cause Rally can get into a bad state during bulk indexing #1697

inqueue opened this issue Mar 30, 2023 · 0 comments · Fixed by #1703
Labels
bug Something's wrong

Comments

@inqueue
Copy link
Member

inqueue commented Mar 30, 2023

elastic/rally-tracks@cffc864 led to a strange stream of daily failures in the nightly benchmarks for http_logs where jobs were taking a full 24 hours to fail1. The simplified Rally race invocation:

rally race --track="http_logs" --challenge="append-index-only-with-ingest-pipeline" --on-error="abort" --runtime-jdk="bundled" --track-params="{\"source_enabled\": false, \"ingest_pipeline\": \"grok\"}" --pipeline="from-sources"

Use elastic/rally-tracks@cffc864 without --test-mode to reproduce. The issue does not reproduce in test mode. When left to run, Rally displays Running index-append-with-ingest-grok-pipeline [100% done], then remains in the state forever.

image

2023-03-30 20:38:16,376 ActorAddr-(T|:33597)/PID:8312 esrally.actor ERROR Error in driver
Traceback (most recent call last):

  File "/home/esbench/rally/esrally/actor.py", line 92, in guard
    return f(self, msg, sender)
           ^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/rally/esrally/driver/driver.py", line 303, in receiveMsg_WakeupMessage
    self.coordinator.post_process_samples()

  File "/home/esbench/rally/esrally/driver/driver.py", line 955, in post_process_samples
    self.sample_post_processor(raw_samples)

  File "/home/esbench/rally/esrally/driver/driver.py", line 1068, in __call__
    self.metrics_store.flush(refresh=False)

  File "/home/esbench/rally/esrally/metrics.py", line 906, in flush
    self._client.bulk_index(index=self._index, items=self._docs)

  File "/home/esbench/rally/esrally/metrics.py", line 92, in bulk_index
    self.guarded(elasticsearch.helpers.bulk, self._client, items, index=index, chunk_size=5000)

  File "/home/esbench/rally/esrally/metrics.py", line 116, in guarded
    return target(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/.local/lib/python3.11/site-packages/elasticsearch/helpers/actions.py", line 524, in bulk
    for ok, item in streaming_bulk(

  File "/home/esbench/.local/lib/python3.11/site-packages/elasticsearch/helpers/actions.py", line 438, in streaming_bulk
    for data, (ok, info) in zip(

  File "/home/esbench/.local/lib/python3.11/site-packages/elasticsearch/helpers/actions.py", line 355, in _process_bulk_chunk
    yield from gen

  File "/home/esbench/.local/lib/python3.11/site-packages/elasticsearch/helpers/actions.py", line 274, in _process_bulk_chunk_success
    raise BulkIndexError(f"{len(errors)} document(s) failed to index.", errors)

elasticsearch.helpers.BulkIndexError: 118 document(s) failed to index.

2023-03-30 20:38:16,384 ActorAddr-(T|:33597)/PID:8312 esrally.actor ERROR Main driver received a fatal exception from a load generator. Shutting down.

2023-03-30 20:38:22,339 ActorAddr-(T|:33597)/PID:8312 esrally.driver.driver.DriverActor WARNING Actor esrally.driver.driver.DriverActor @ ActorAddr-(T|:33597) retryable exception on message <esrally.actor.BenchmarkFailure object at 0x7f083132c
6d0>
Traceback (most recent call last):

  File "/home/esbench/.local/lib/python3.11/site-packages/thespian/system/actorManager.py", line 163, in _handleOneMessage
    actor_result = self.actorInst.receiveMessage(msg, envelope.sender)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/.local/lib/python3.11/site-packages/thespian/actors.py", line 838, in receiveMessage
    r = getattr(klass, methodName)(self, message, sender)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/rally/esrally/driver/driver.py", line 242, in receiveMsg_BenchmarkFailure
    self.coordinator.close()

  File "/home/esbench/rally/esrally/driver/driver.py", line 921, in close
    self.metrics_store.close()

  File "/home/esbench/rally/esrally/metrics.py", line 454, in close
    self.flush()

  File "/home/esbench/rally/esrally/metrics.py", line 906, in flush
    self._client.bulk_index(index=self._index, items=self._docs)

  File "/home/esbench/rally/esrally/metrics.py", line 92, in bulk_index
    self.guarded(elasticsearch.helpers.bulk, self._client, items, index=index, chunk_size=5000)

  File "/home/esbench/rally/esrally/metrics.py", line 116, in guarded
    return target(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/esbench/.local/lib/python3.11/site-packages/elasticsearch/helpers/actions.py", line 524, in bulk
    for ok, item in streaming_bulk(

  File "/home/esbench/.local/lib/python3.11/site-packages/elasticsearch/helpers/actions.py", line 438, in streaming_bulk
    for data, (ok, info) in zip(

  File "/home/esbench/.local/lib/python3.11/site-packages/elasticsearch/helpers/actions.py", line 355, in _process_bulk_chunk
    yield from gen

  File "/home/esbench/.local/lib/python3.11/site-packages/elasticsearch/helpers/actions.py", line 274, in _process_bulk_chunk_success
    raise BulkIndexError(f"{len(errors)} document(s) failed to index.", errors)

OS: "Ubuntu 18.04.6 LTS

Bad Mapping

The mapping issue was about the date format of @timestamp. The index mapping was set to epoch_second when it should have been strict_date_optional_time. Bulk requests would have encountered this response for every single document in the corpus:

{
  "took": 0,
  "ingest_took": 0,
  "errors": true,
  "items": [
    {
      "index": {
        "_index": "logs-181998",
        "_type": "_doc",
        "_id": "br7lM4cBM1F21Bvv-SHk",
        "status": 400,
        "error": {
          "type": "mapper_parsing_exception",
          "reason": "failed to parse field [@timestamp] of type [date] in document with id 'br7lM4cBM1F21Bvv-SHk'. Preview of field's value: '1998-04-30T14:30:17-05:00'",
          "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "failed to parse date field [1998-04-30T14:30:17-05:00] with format [epoch_second]",
            "caused_by": {
              "type": "date_time_parse_exception",
              "reason": "Failed to parse with all enclosed parsers"
            }
          }
        }
      }
    }
...
    }
}

These result in a 200 status code response, even though the error status of the individual document was a 400. We should consider exiting if the bulk request responds with errors: true.

Footnotes

  1. https://elasticsearch-ci.elastic.co/blue/organizations/jenkins/elastic%2Belasticsearch%2Bmain%2Bmacrobenchmark-periodic-group-2/detail/elastic%2Belasticsearch%2Bmain%2Bmacrobenchmark-periodic-group-2/382/pipeline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something's wrong
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant