OOM errors caused by large queue of cluster state update tasks #21568

ywelsch · 2016-11-15T16:30:44Z

We've had multiple reports of master nodes going OOM when dealing with large numbers of pending cluster state update tasks. I am mainly opening this issue to explain what the implementation issues are and to link to current work being done to fix them. Before talking about the actual issues, let me quickly recapitulate what these update tasks are. Cluster state update tasks are units of work that are processed on the master node to change the cluster state, for example when creating a new index, updating mappings or changing the routing table (allocating / starting / moving shards). Some tasks have timeouts, others do not. Most tasks that result from user action (e.g. index creation) have timeouts which are configurable on the request with the master_timeout parameter (default is 30 seconds). Tasks are submitted to a single-threaded thread pool executor that considers priorities as well as insertion order for choosing the tasks to execute.

The first issue is related to batching of cluster state updates. Even though a task might have timed out, the resources it will need to execute won't be cleaned up right away (in certain situations they won't be cleaned up at all which corresponds to a memory leak). The reason for this is that with the introduction of task batching a new map was added to cluster service that holds a reference to the UpdateTask. The task is only removed from this map once the task gets to execute (which can happen way later, or even not at all - see TieBreakingPrioritizedRunnable#scheduleTimeout and #runAndClean).
The second issue is that cluster state update tasks often reference large states. One of the worst offenders is the action listener in TransportMasterNodeAction.AsyncSingleAction which has a ClusterStateObserver to coordinate the retry mechanism if the action on the master node fails due to the node not being master anymore. The ClusterStateObserver in AsyncSingleAction keeps a reference to the full cluster state when the action was initiated. If the pending tasks queue grows quite large and has older items in it lots of cluster states can possibly be referenced. In ES 5.x this is less of an issue as in 2.x because 5.x has immutable ShardRoutings (Immutable ShardRouting #17821) which lets us reuse ShardRouting objects across cluster states. The solution here I think is to look at how ClusterStateObserver is used and see that keeping the full old cluster state around is not really needed (we only care about version / status and cluster state identity of the old state).
The third issue is that the pending task queue is unbounded. Here we could look at ways to constrain the number of tasks that can be submitted, possibly trading CPU time for memory.

Fixes:

Remove cluster update task when task times out #21578 fixes the memory leak issue where the cluster service does not remove an update task from its internal data structures that are used for batching cluster state updates.
Let ClusterStateObserver only hold onto state that's needed for change detection #21631 fixes the issue where ClusterStateObserver holds onto cluster states

The text was updated successfully, but these errors were encountered:

…e detection (#21631) ClusterStateObserver is a utility class that simplifies interacting with the cluster state in cases where an action takes a decision based on the current cluster state but may want to wait for a new state and retry upon failure. The ClusterStateObserver implements its functionality by keeping a reference to the last cluster state that it observed. When a new ClusterStateObserver is created, it samples a cluster state from the cluster service which is subsequently used for change detection. If actions take a long time to process, however, the cluster observer can reference very old cluster states. Due to cluster observers being created very frequently and cluster states being potentially large the referenced cluster states can waste a lot of heap space. A specific example where this can make a node go out of memory is given in point 2 of issue #21568: The action listener in TransportMasterNodeAction.AsyncSingleAction has a ClusterStateObserver to coordinate the retry mechanism if the action on the master node fails due to the node not being master anymore. The ClusterStateObserver in AsyncSingleAction keeps a reference to the full cluster state when the action was initiated. If the pending tasks queue grows quite large and has older items in it lots of cluster states can possibly be referenced. This commit changes the ClusterStateObserver to hold only onto the part of the cluster state that's needed for change detection.

amazinganshul · 2017-01-26T18:01:44Z

Will the changes be merged in 1.7.4 branch as well? We recently faced the same issue. Master nodes had heap configured at 8GB.

jasontedor · 2017-01-26T18:03:10Z

Will the changes be merged in 1.7.4 branch as well? We recently faced the same issue. Master nodes had heap configured at 8GB.

No. The 1.7 series has reached end-of-life and we not see any more changes.

bleskes · 2018-03-20T11:30:31Z

@ywelsch can you advice on the state of things here? It seems that the unbound nature of the queue hasn't been addressed (though it's tricky as to what we should do imo).

ywelsch · 2018-03-20T16:58:03Z

With the fixes done for 1) and 2), we have not gotten recent reports of this OOM kind anymore. As updates are not hogging memory anymore, remaining issues due to the unbounded nature of the pending cluster tasks queue have manifested in the form of the master node becoming slow/unavailable/CPU-bound when running into a high number of pending tasks. Other open issues capture work that helps with avoiding/better handling a large number of tasks, so we can close this one.

ywelsch added the :Cluster label Nov 15, 2016

ywelsch self-assigned this Nov 15, 2016

ywelsch mentioned this issue Nov 17, 2016

Let ClusterStateObserver only hold onto state that's needed for change detection #21631

Merged

clintongormley added the Meta label Nov 19, 2016

ywelsch mentioned this issue Nov 19, 2016

Remove cluster update task when task times out #21578

Merged

clintongormley mentioned this issue Nov 19, 2016

Master node crashes with OutOfMemoryError exception #21439

Closed

cnelson mentioned this issue Jan 29, 2017

Outage free logsearch updates cloud-gov/product#673

Closed

3 tasks

clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Cluster labels Feb 13, 2018

ywelsch closed this as completed Mar 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM errors caused by large queue of cluster state update tasks #21568

OOM errors caused by large queue of cluster state update tasks #21568

ywelsch commented Nov 15, 2016 •

edited

Loading

amazinganshul commented Jan 26, 2017

jasontedor commented Jan 26, 2017

bleskes commented Mar 20, 2018

ywelsch commented Mar 20, 2018

OOM errors caused by large queue of cluster state update tasks #21568

OOM errors caused by large queue of cluster state update tasks #21568

Comments

ywelsch commented Nov 15, 2016 • edited Loading

Fixes:

amazinganshul commented Jan 26, 2017

jasontedor commented Jan 26, 2017

bleskes commented Mar 20, 2018

ywelsch commented Mar 20, 2018

ywelsch commented Nov 15, 2016 •

edited

Loading