[Transform] implement throttling in indexer #55011

hendrikmuhs · 2020-04-09T13:17:05Z

implement throttling in async-indexer used by rollup and transform. The added docs_per_second parameter is used to calculate a delay before the next
search request is send. With re-throttle its possible to change the parameter at
runtime, at stop its ensured that despite throttling the indexer stops in
reasonable time

relates #54862

This PR adds the basics to use throttling, usage/exposure of this feature is planned for separate PR's, that's why I label this as non-issue.

elasticmachine · 2020-04-09T13:17:07Z

Pinging @elastic/es-analytics-geo (:Analytics/Rollup)

elasticmachine · 2020-04-09T13:17:08Z

Pinging @elastic/ml-core (:ml/Transform)

benwtrent

I like the change. Test coverage is nice :D.

I am a bit concerned around synchronization with the scheduledNextSearch.

...ck/plugin/core/src/main/java/org/elasticsearch/xpack/core/indexing/AsyncTwoPhaseIndexer.java

jimczi

I have mixed feelings about this change. On the one hand I understand why this could be useful but on the other hand I wonder if this is the right place/method to do it.
I wonder for instance if we could use the search option called max_concurrent_shard_requests to limit the impact on the cluster ? So instead of throttling the search requests entirely, this option could control the number of shard requests per node that can be executed concurrently.
That seems easier for users than setting a number of requests per second since this number will depend greatly on the number of unique docs per bucket.
If this is not enough I think we should at least ensure that the cancel of the scheduled search is handled properly. Without clear synchronization, I am not sure how we can ensure that another scheduled search is created while calling triggerThrottledSearchNow ?

hendrikmuhs · 2020-04-16T06:30:17Z

I have mixed feelings about this change. On the one hand I understand why this could be useful but on the other hand I wonder if this is the right place/method to do it.
I wonder for instance if we could use the search option called max_concurrent_shard_requests to limit the impact on the cluster ? So instead of throttling the search requests entirely, this option could control the number of shard requests per node that can be executed concurrently.
That seems easier for users than setting a number of requests per second since this number will depend greatly on the number of unique docs per bucket.

I disagree with that, max_concurrent_shard_requests is a much more abstract concept and harder to understand for a user than requests_per_second. It requires deep knowledge of the system, including the number of configured shards, number of nodes etc. This information might not be available (due to enabled security). The default of 5 does not provide much room for tweaking.

I agree that requests_per_second is not ideal, it highly depends on the system, the shape of the data and the used aggregations. Note that requests_per_second is a concept from reindex, which suffers the same problem (but less problematic as its not using aggregations).

In transform you can retrieve the current requests per second using _stats, however it requires some calculation. It would be helpful to calculate it directly, so that a user can see it. A workflow could be:

realize transform is causing to much load
check _stats for the current requests_per_second
Call _update on the transform and e.g. set 50% of the output of 2

There are some ideas for "smart throttling", 1 way could be to automate the 3 steps above. But this is not planned for the first version, a low hanging fruit we thought about: suggest a value for requests_per_second in the UI based on data of other transforms.

Speaking of alternatives, I think the dream solution is this: #37867 as it would implement load shaping instead of brute force throttling.

Another alternative that would be great from a usage perspective: budgeting the search call. Instead of using size, it would be nice to specify ops or time_spent, a search would than return whatever it got within this budget (somewhat similar to terminate_after). For indexing this must still return correct results(complete buckets), in _preview - where we face some performance issues, too - we only need approximate results.

There are definitely better solutions than requests per second, but for now we have to use what's available.

If this is not enough I think we should at least ensure that the cancel of the scheduled search is handled properly. Without clear synchronization, I am not sure how we can ensure that another scheduled search is created while calling triggerThrottledSearchNow ?

You mean https://github.com/elastic/elasticsearch/pull/55011/files#diff-01a4ee37232ad8becca8990702ebb4f0R489?

The answer is: we do not ensure. This works optimistic (to avoid overhead) and to the best of my knowledge should work:

Assume thread A runs the indexer and a request for stop comes in from thread B, its possible that although thread B sets the state to STOPPING, thread A schedules a search. That's why I added 7c55d35, which looks a bit strange but handles the described problem. The underlying java futures are thread-safe afaik, so concurrent access to it, shouldn't be a problem.

Note that there is always only 1 thread A, but there can be multiple thread B's.

hendrikmuhs · 2020-04-21T12:24:07Z

I implemented some improvements after offline discussion:

state is checked more often, that way a query is not send if the user stopped the jobs
the waiting time is exposed to the transform/rollup
- this allows further optimization like running with a lower page size, batched_reduce_size, max_concurrent_shard_requests and other options, this will be done in separate/enabling PR's (rollup/transform specific)

benwtrent

I have some concurrency concerns. I THINK cancel will simply return true if the thread is already executing, which might cause issues.

benwtrent · 2020-04-21T15:27:37Z

...ck/plugin/core/src/main/java/org/elasticsearch/xpack/core/indexing/AsyncTwoPhaseIndexer.java

@@ -178,6 +208,35 @@ public synchronized boolean maybeTriggerAsyncJob(long now) {
        }
    }

+    /**
+     * Cancels a scheduled search request and issues the search request immediately


Suggested change

* Cancels a scheduled search request and issues the search request immediately

* Cancels a scheduled search request

benwtrent · 2020-04-21T17:13:17Z

...ck/plugin/core/src/main/java/org/elasticsearch/xpack/core/indexing/AsyncTwoPhaseIndexer.java

+     * Cancels a scheduled search request and issues the search request immediately
+     */
+    private synchronized void stopThrottledSearch() {
+        if (scheduledNextSearch != null && scheduledNextSearch.cancel()) {


From what I understand around cancel, the only times it will return false are:

If the action has already been completed

If the action has already been cancelled.

This means it will return true if the thread is executing.

threadPool.executor(executorName).execute(() -> checkState(getState()));

Could happen in the middle of triggerNextSearch. This MIGHT be ok, but the call to checkState(getState()) might transition from STOPPING -> STOPPED while a search is still in flight. I am not sure this is intended behavior.

benwtrent · 2020-04-21T17:15:31Z

...ck/plugin/core/src/main/java/org/elasticsearch/xpack/core/indexing/AsyncTwoPhaseIndexer.java

@@ -461,4 +562,37 @@ private boolean checkState(IndexerState currentState) {
        }
    }

+    private synchronized void reQueueThrottledSearch() {
+        if (scheduledNextSearch != null && scheduledNextSearch.cancel()) {


Similar comment to above, a current search could be inflight. This means that if the next delay is 0L, we could have two searches occurring in parallel.

As long as this method is NEVER called out of band, I think this might be ok.

hendrikmuhs · 2020-04-21T18:29:01Z

I have some concurrency concerns. I THINK cancel will simply return true if the thread is already executing, which might cause issues.

Do you have a link? My source (code) says:

This attempt will fail if the task has already completed, has already been cancelled or could not be cancelled for some other reason.
...
@return {@code false} if the task could not be cancelled,
typically because it has already completed normally;
{@code true} otherwise

(same as https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/Future.html#cancel(boolean), we call it with false)

So my understanding is that cancel will return false if it could not stop it, e.g. if its running or has run.

benwtrent · 2020-04-21T18:53:45Z

@hendrikmuhs here is nothing in the JavaDOC that gives guarantees around if the thread is already executing and we are not interrupting.

@return {@code false} if the task could not be cancelled,
typically because it has already completed normally;
{@code true} otherwise

This seems to indicate to me that IF the action is already running, it will NOT be interrupted and cancel will return true.

This seems to be validated here (unless my code is wrong): https://repl.it/@ben_w_trent/WhenIsCancelTrue

hendrikmuhs · 2020-04-21T20:19:02Z

Your code obviously proves it. However I think I am not the only one, thinking that the documentation sucks, e.g. https://bugs.openjdk.java.net/browse/JDK-8022624. SO has some posts, too.

hendrikmuhs · 2020-04-23T12:49:24Z

I addressed the test problems in the separate PR #55666, to avoid making this one more complicated.

benwtrent

Seems OK, now that rescheduling won't kick off another search request if one is in flight.

Would be good to get another approval before merge.

...ck/plugin/core/src/main/java/org/elasticsearch/xpack/core/indexing/AsyncTwoPhaseIndexer.java

improve tests related to stopping using a client that answers and can be synchronized with the test thread in order to test special situations relates #55011

droberts195

I saw a couple of nits but I stopped reviewing when I saw the throttling formula because it seems to highlight a fundamental issue: are we intending to throttle based on search requests per second or documents retrieved per second? What changes are required will depend on the answer to that.

...ugin/core/src/test/java/org/elasticsearch/xpack/core/indexing/AsyncTwoPhaseIndexerTests.java

droberts195 · 2020-04-24T08:31:26Z

...ck/plugin/core/src/main/java/org/elasticsearch/xpack/core/indexing/AsyncTwoPhaseIndexer.java

+        if (requestsPerSecond <= 0) {
+            return TimeValue.ZERO;
+        }
+        float timeToWaitNanos = (docCount / requestsPerSecond) * TimeUnit.SECONDS.toNanos(1);


This formula implies that requestsPerSecond is really desiredDocsPerSecond. If that's correct then requestsPerSecond seems like it will cause confusion in the future because I would assume requestsPerSecond referred to the number of searches, each of which could return many documents.

For example, if I saw a configuration parameter requests_per_second I might decide to set it to 2 so that I'd get a maximum of 2 search requests per second from this functionality. But then if one of my searches returns 1000 documents then I get a 500 second wait until the next search.

I saw a couple of nits but I stopped reviewing when I saw the throttling formula because it seems to highlight a fundamental issue: are we intending to throttle based on search requests per second or documents retrieved per second? What changes are required will depend on the answer to that.

This design and the reasons for that are discussed in the tracking issue: #54862. In a nutshell: You are right that requests_per_second is misleading. The name has still been chosen, because its a existing concept from reindex. It's wrong there, too.

Ah yes, I missed that requests_per_second is used in this way for reindex already. In that case consistency with reindex is more important.

hendrikmuhs · 2020-04-27T07:57:20Z

run elasticsearch-ci/2

hendrikmuhs · 2020-04-27T07:59:14Z

status update: We are evaluating/discussing about renaming requests_per_second, possible candidate: documents_per_second

…duled

…opping

throttling

…throttling

…tructor argument

hendrikmuhs · 2020-04-29T14:44:43Z

The setting got renamed to docs_per_second. This PR is ready from my side.

@droberts195 @jimczi would be nice to get your input.

droberts195

The setting got renamed to docs_per_second.

I think that is much better as it now reflects how the throttling is done. It would be nice if reindex v2 used the same setting for consistency between similar features.

LGTM

jimczi

LGTM2

implement throttling in async-indexer used by rollup and transform. The added docs_per_second parameter is used to calculate a delay before the next search request is send. With re-throttle its possible to change the parameter at runtime. When stopping a running job, its ensured that despite throttling the indexer stops in reasonable time. This change contains the groundwork, but does not expose the new functionality. relates elastic#54862 # Conflicts: # x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/rollup/job/RollupIndexer.java # x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/rollup/job/RollupJobTask.java # x-pack/plugin/rollup/src/test/java/org/elasticsearch/xpack/rollup/job/RollupIndexerIndexingTests.java # x-pack/plugin/rollup/src/test/java/org/elasticsearch/xpack/rollup/job/RollupIndexerStateTests.java

implement throttling in async-indexer used by rollup and transform. The added docs_per_second parameter is used to calculate a delay before the next search request is send. With re-throttle its possible to change the parameter at runtime. When stopping a running job, its ensured that despite throttling the indexer stops in reasonable time. This change contains the groundwork, but does not expose the new functionality. relates #54862 backport: #55011

hendrikmuhs added :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data v8.0.0 :ml/Transform Transform v7.8.0 labels Apr 9, 2020

hendrikmuhs added the >non-issue label Apr 9, 2020

benwtrent reviewed Apr 13, 2020

View reviewed changes

...ck/plugin/core/src/main/java/org/elasticsearch/xpack/core/indexing/AsyncTwoPhaseIndexer.java Outdated Show resolved Hide resolved

hendrikmuhs force-pushed the transform-throttling-part1 branch from 5693925 to 7c55d35 Compare April 14, 2020 15:09

hendrikmuhs requested a review from jimczi April 15, 2020 12:43

jimczi reviewed Apr 15, 2020

View reviewed changes

hendrikmuhs force-pushed the transform-throttling-part1 branch from 1a4c37f to 35a9b74 Compare April 21, 2020 09:47

benwtrent reviewed Apr 21, 2020

View reviewed changes

hendrikmuhs mentioned this pull request Apr 23, 2020

[Rollup] improve stopping tests #55666

Merged

hendrikmuhs mentioned this pull request Apr 23, 2020

[Transform] Transform throttling part2.1 #55672

Closed

3 tasks

benwtrent approved these changes Apr 23, 2020

View reviewed changes

hendrikmuhs commented Apr 23, 2020

View reviewed changes

...ck/plugin/core/src/main/java/org/elasticsearch/xpack/core/indexing/AsyncTwoPhaseIndexer.java Outdated Show resolved Hide resolved

hendrikmuhs pushed a commit that referenced this pull request Apr 24, 2020

[Rollup] improve stopping tests (#55666)

dca44f6

improve tests related to stopping using a client that answers and can be synchronized with the test thread in order to test special situations relates #55011

hendrikmuhs pushed a commit that referenced this pull request Apr 24, 2020

[Rollup] improve stopping tests (#55666)

b213209

improve tests related to stopping using a client that answers and can be synchronized with the test thread in order to test special situations relates #55011

hendrikmuhs force-pushed the transform-throttling-part1 branch from 9a247d1 to df0b527 Compare April 24, 2020 06:59

droberts195 reviewed Apr 24, 2020

View reviewed changes

implement throttling in indexer, to be used in transform

56f5f9d

Hendrik Muhs added 15 commits April 28, 2020 20:43

checkstyle

bec60e3

fix corner case issue when stop is called while a next search is sche…

1e09c5b

…duled

all throttling members should be volatile

92293eb

check state more often and do not fire a search request in case of st…

cef664e

…opping

expose wait time so rollup/transform can issue lightweight queries for

1f6cfec

throttling

fix test after stopping earlier

d2edb47

improve code comment

fc401f7

relax test

463a597

use checkState to stop job

fd865be

use runOnce to workaround potential duplicate listener calls when re-…

ef4f27f

…throttling

checkstyle

54f4874

formatting

2c1e683

get maximumRequestsPerSecond via a getter instead of requiring a cons…

70cd6fc

…tructor argument

address review comments

7bb62e5

renames requests_per_second to docs_per_second

20125a3

hendrikmuhs force-pushed the transform-throttling-part1 branch from b712882 to 20125a3 Compare April 28, 2020 19:19

droberts195 approved these changes Apr 29, 2020

View reviewed changes

jimczi reviewed Apr 29, 2020

View reviewed changes

jimczi approved these changes Apr 29, 2020

View reviewed changes

hendrikmuhs changed the title ~~[Transform] implement throttling in indexer, to be used in transform~~ [Transform] implement throttling in indexer Apr 30, 2020

hendrikmuhs merged commit 72a43dd into elastic:master Apr 30, 2020

hendrikmuhs deleted the transform-throttling-part1 branch April 30, 2020 05:07

hendrikmuhs mentioned this pull request Apr 30, 2020

[7.x][Transform] implement throttling in indexer (#55011) #56002

Merged

hendrikmuhs mentioned this pull request Apr 30, 2020

[Transform] add throttling #56007

Merged

5 tasks

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Transform] implement throttling in indexer #55011

[Transform] implement throttling in indexer #55011

hendrikmuhs commented Apr 9, 2020 •

edited by droberts195

Loading

elasticmachine commented Apr 9, 2020

elasticmachine commented Apr 9, 2020

benwtrent left a comment

jimczi left a comment

hendrikmuhs commented Apr 16, 2020

hendrikmuhs commented Apr 21, 2020

benwtrent left a comment

benwtrent Apr 21, 2020

benwtrent Apr 21, 2020

benwtrent Apr 21, 2020

hendrikmuhs commented Apr 21, 2020

benwtrent commented Apr 21, 2020 •

edited

Loading

hendrikmuhs commented Apr 21, 2020

hendrikmuhs commented Apr 23, 2020

benwtrent left a comment

droberts195 left a comment

droberts195 Apr 24, 2020

hendrikmuhs Apr 24, 2020

droberts195 Apr 24, 2020

hendrikmuhs commented Apr 27, 2020

hendrikmuhs commented Apr 27, 2020

hendrikmuhs commented Apr 29, 2020

droberts195 left a comment

jimczi left a comment

	* Cancels a scheduled search request and issues the search request immediately
	* Cancels a scheduled search request

[Transform] implement throttling in indexer #55011

[Transform] implement throttling in indexer #55011

Conversation

hendrikmuhs commented Apr 9, 2020 • edited by droberts195 Loading

elasticmachine commented Apr 9, 2020

elasticmachine commented Apr 9, 2020

benwtrent left a comment

Choose a reason for hiding this comment

jimczi left a comment

Choose a reason for hiding this comment

hendrikmuhs commented Apr 16, 2020

hendrikmuhs commented Apr 21, 2020

benwtrent left a comment

Choose a reason for hiding this comment

benwtrent Apr 21, 2020

Choose a reason for hiding this comment

benwtrent Apr 21, 2020

Choose a reason for hiding this comment

benwtrent Apr 21, 2020

Choose a reason for hiding this comment

hendrikmuhs commented Apr 21, 2020

benwtrent commented Apr 21, 2020 • edited Loading

hendrikmuhs commented Apr 21, 2020

hendrikmuhs commented Apr 23, 2020

benwtrent left a comment

Choose a reason for hiding this comment

droberts195 left a comment

Choose a reason for hiding this comment

droberts195 Apr 24, 2020

Choose a reason for hiding this comment

hendrikmuhs Apr 24, 2020

Choose a reason for hiding this comment

droberts195 Apr 24, 2020

Choose a reason for hiding this comment

hendrikmuhs commented Apr 27, 2020

hendrikmuhs commented Apr 27, 2020

hendrikmuhs commented Apr 29, 2020

droberts195 left a comment

Choose a reason for hiding this comment

jimczi left a comment

Choose a reason for hiding this comment

hendrikmuhs commented Apr 9, 2020 •

edited by droberts195

Loading

benwtrent commented Apr 21, 2020 •

edited

Loading