Create an ES client per simulated client instead of per worker. #1516

michaelbaamonde · 2022-06-13T20:47:52Z

This is a prerequisite for #1067 and also indirectly relates to #1350.

As discussed in the context of API keys, runners shouldn't have to concern themselves with ES transport options; they should just be provided ES client objects that already have the right options set. This PR is the first step in that direction.

Previously, an ES client was created per worker and shared among all of the worker's simulated clients. Now, each worker will instead create distinct ES client objects per simulated client that the worker is responsible for. This unique ES client object is what ultimately gets passed to the runner(s) invoked by that simulated client.

In lower-level terms, a worker is basically represented by an AsyncIoAdapter object, while a client basically maps to an AsyncExecutor object. An AsyncIoAdapter will now create an ES client per client_id that it's responsible for, and pass it to the AsyncExecutor that's instantiated for that client_id. The AsyncExecutor ultimately invokes the runner(s) that issue requests via the ES client.

This isn't the only possible implementation, though. As the discussion linked above details, version 8.0+ of the ES Python client offers a new .options() method on clients for overriding transport options on a per-request basis. It doesn't mutate the client, but rather returns a new client object with the provided transport options set.

We originally planned on using this new API, either by upgrading our client dependency (which we've paused for now) or by implementing .options() ourselves in Rally today. However, after getting into the details more, it seems that .options() isn't strictly necessary right now, and may actually make implementation more difficult and risky. Here's why:

While the .options() method is meant for request-level overrides, creating clients with transport options defined in the first place is still the normal way of doing things. "Overriding default transport settings at request time" is one way to conceptualize what we're trying to do here, but it's still perfectly valid for workers to just instantiate ES clients upfront with the transport options they already know they'll need. Now, at a lower level, individual runners may have a genuine need for overriding those options (e.g. setting a request_timeout or ignore_status), but it's not necessary at the level of simulated clients in order to support something like client-specific authentication. The downside could be the overhead of creating potentially many more ES client objects, but...
....options() will do that, anyway. As mentioned, .options() doesn't mutate the object it's called on, but rather creates a new object of its type. And leads to the real trickiness, which is that Rally's async ES client objects aren't just vanilla instances of AsyncElasticsearch. They also inherit from RequestContextHolder and are therefore stateful. My initial experiments with porting .options() to Rally quickly resulted in benchmarks hanging forever. I have a sense of why this is happening, but in light of (1) above, it doesn't seem worth the time or risk to pursue, at least not right now. We will need to figure this out eventually for the client upgrade, however, since .options() is how individual runners will need to set options like request_timeout, ignore_status, as mentioned.

So, to sum up, implementing .options() ourselves doesn't seem to offer obvious advantages at the moment, and making it work would probably require refactoring the code that Rally relies on for measuring timings and scheduling task execution, which is risky to say the least.

Regarding the potential overhead of creating more clients, I'm curious to hear input on this from the likes of @DJRickyB and @dliappis. If we merge as-is, we'll quickly get some feedback from the nightlies, but I'd think that if this is going to cause problems, it will be with very high client counts. I ran a logging indexing benchmark with 1,000 bulk indexing clients with both master and this branch of Rally and found no evidence of this change causing a client-side bottleneck. It's just one data point, so I'm happy to flesh out a larger number of test cases if we feel that's in order, but wanted to get the team's thoughts before investing a bunch of time in that.

Previously, Rally would instantiate an ES client per worker and scale its connection pool to match the number of simulated clients associated with that worker. This worker-level ES client would be shared across all of the simulated clients spawned by the worker. This meant that any simulated client and its siblings would have to use exactly the same ES transport options, as they all shared a single ES client object. There are use cases, however, where sibling clients may each need to set unique transport options (e.g. for authentication). This commit facilitates support for these lower-level overrides by creating a distinct ES client per simulated client, not per worker.

pquentin · 2022-06-14T13:29:15Z

Thanks for the detailed explanation! I like the approach but would like to hear from others too.

michaelbaamonde · 2022-06-21T14:31:05Z

We ran several tests with both this branch and master to determine whether this change introduces any signs of a client-side bottleneck at higher client counts. We did not find any evidence to that effect, but, unfortunately, we discovered that even master has issues scaling beyond ~2k clients on a single load driver due to #1448 in some cases.

There’s no reason in principle to expect this change to significantly affect resource consumption on the load driver in terms of memory usage, file handles, etc. So while we don’t have corroboration at very high client counts, it seems that this change aligns with those expectations and doesn’t do worse than master.

So, we will merge this change and look into the scalability issues separately.

michaelbaamonde added the :Load Driver Changes that affect the core of the load driver such as scheduling, the measurement approach etc. label Jun 13, 2022

michaelbaamonde requested a review from pquentin June 13, 2022 20:47

michaelbaamonde mentioned this pull request Jun 15, 2022

Create and use a unique ES API key for each simulated client #1520

Merged

michaelbaamonde merged commit 0ac68fc into elastic:master Jun 21, 2022

pquentin added the enhancement Improves the status quo label Jun 23, 2022

pquentin added this to the 2.5.0 milestone Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create an ES client per simulated client instead of per worker. #1516

Create an ES client per simulated client instead of per worker. #1516

michaelbaamonde commented Jun 13, 2022

pquentin commented Jun 14, 2022

michaelbaamonde commented Jun 21, 2022

Create an ES client per simulated client instead of per worker. #1516

Create an ES client per simulated client instead of per worker. #1516

Conversation

michaelbaamonde commented Jun 13, 2022

pquentin commented Jun 14, 2022

michaelbaamonde commented Jun 21, 2022