Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM errors caused by large queue of cluster state update tasks #21568

Closed
ywelsch opened this issue Nov 15, 2016 · 4 comments
Closed

OOM errors caused by large queue of cluster state update tasks #21568

ywelsch opened this issue Nov 15, 2016 · 4 comments
Assignees
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. Meta

Comments

@ywelsch
Copy link
Contributor

ywelsch commented Nov 15, 2016

We've had multiple reports of master nodes going OOM when dealing with large numbers of pending cluster state update tasks. I am mainly opening this issue to explain what the implementation issues are and to link to current work being done to fix them. Before talking about the actual issues, let me quickly recapitulate what these update tasks are. Cluster state update tasks are units of work that are processed on the master node to change the cluster state, for example when creating a new index, updating mappings or changing the routing table (allocating / starting / moving shards). Some tasks have timeouts, others do not. Most tasks that result from user action (e.g. index creation) have timeouts which are configurable on the request with the master_timeout parameter (default is 30 seconds). Tasks are submitted to a single-threaded thread pool executor that considers priorities as well as insertion order for choosing the tasks to execute.

  1. The first issue is related to batching of cluster state updates. Even though a task might have timed out, the resources it will need to execute won't be cleaned up right away (in certain situations they won't be cleaned up at all which corresponds to a memory leak). The reason for this is that with the introduction of task batching a new map was added to cluster service that holds a reference to the UpdateTask. The task is only removed from this map once the task gets to execute (which can happen way later, or even not at all - see TieBreakingPrioritizedRunnable#scheduleTimeout and #runAndClean).

  2. The second issue is that cluster state update tasks often reference large states. One of the worst offenders is the action listener in TransportMasterNodeAction.AsyncSingleAction which has a ClusterStateObserver to coordinate the retry mechanism if the action on the master node fails due to the node not being master anymore. The ClusterStateObserver in AsyncSingleAction keeps a reference to the full cluster state when the action was initiated. If the pending tasks queue grows quite large and has older items in it lots of cluster states can possibly be referenced. In ES 5.x this is less of an issue as in 2.x because 5.x has immutable ShardRoutings (Immutable ShardRouting #17821) which lets us reuse ShardRouting objects across cluster states. The solution here I think is to look at how ClusterStateObserver is used and see that keeping the full old cluster state around is not really needed (we only care about version / status and cluster state identity of the old state).

  3. The third issue is that the pending task queue is unbounded. Here we could look at ways to constrain the number of tasks that can be submitted, possibly trading CPU time for memory.

Fixes:

@ywelsch ywelsch self-assigned this Nov 15, 2016
ywelsch added a commit that referenced this issue Dec 20, 2016
…e detection (#21631)

ClusterStateObserver is a utility class that simplifies interacting with the cluster state in cases where an action takes a decision based on the current cluster state but may want to wait for a new state and retry upon failure. The ClusterStateObserver implements its functionality by keeping a reference to the last cluster state that it observed. When a new ClusterStateObserver is created, it samples a cluster state from the cluster service which is subsequently used for change detection. If actions take a long time to process, however, the cluster observer can reference very old cluster states. Due to cluster observers being created very frequently and cluster states being potentially large the referenced cluster states can waste a lot of heap space. A specific example where this can make a node go out of memory is given in point 2 of issue #21568: The action listener in TransportMasterNodeAction.AsyncSingleAction has a ClusterStateObserver to coordinate the retry mechanism if the action on the master node fails due to the node not being master anymore. The ClusterStateObserver in AsyncSingleAction keeps a reference to the full cluster state when the action was initiated. If the pending tasks queue grows quite large and has older items in it lots of cluster states can possibly be referenced.

This commit changes the ClusterStateObserver to hold only onto the part of the cluster state that's needed for change detection.
ywelsch added a commit that referenced this issue Dec 20, 2016
…e detection (#21631)

ClusterStateObserver is a utility class that simplifies interacting with the cluster state in cases where an action takes a decision based on the current cluster state but may want to wait for a new state and retry upon failure. The ClusterStateObserver implements its functionality by keeping a reference to the last cluster state that it observed. When a new ClusterStateObserver is created, it samples a cluster state from the cluster service which is subsequently used for change detection. If actions take a long time to process, however, the cluster observer can reference very old cluster states. Due to cluster observers being created very frequently and cluster states being potentially large the referenced cluster states can waste a lot of heap space. A specific example where this can make a node go out of memory is given in point 2 of issue #21568: The action listener in TransportMasterNodeAction.AsyncSingleAction has a ClusterStateObserver to coordinate the retry mechanism if the action on the master node fails due to the node not being master anymore. The ClusterStateObserver in AsyncSingleAction keeps a reference to the full cluster state when the action was initiated. If the pending tasks queue grows quite large and has older items in it lots of cluster states can possibly be referenced.

This commit changes the ClusterStateObserver to hold only onto the part of the cluster state that's needed for change detection.
@amazinganshul
Copy link

Will the changes be merged in 1.7.4 branch as well? We recently faced the same issue. Master nodes had heap configured at 8GB.

@jasontedor
Copy link
Member

Will the changes be merged in 1.7.4 branch as well? We recently faced the same issue. Master nodes had heap configured at 8GB.

No. The 1.7 series has reached end-of-life and we not see any more changes.

@clintongormley clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Cluster labels Feb 13, 2018
@bleskes
Copy link
Contributor

bleskes commented Mar 20, 2018

@ywelsch can you advice on the state of things here? It seems that the unbound nature of the queue hasn't been addressed (though it's tricky as to what we should do imo).

@ywelsch
Copy link
Contributor Author

ywelsch commented Mar 20, 2018

With the fixes done for 1) and 2), we have not gotten recent reports of this OOM kind anymore. As updates are not hogging memory anymore, remaining issues due to the unbounded nature of the pending cluster tasks queue have manifested in the form of the master node becoming slow/unavailable/CPU-bound when running into a high number of pending tasks. Other open issues capture work that helps with avoiding/better handling a large number of tasks, so we can close this one.

@ywelsch ywelsch closed this as completed Mar 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. Meta
Projects
None yet
Development

No branches or pull requests

5 participants