Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ILM shrink causes cluster to turn red #67957

Open
JohnLyman opened this issue Jan 25, 2021 · 2 comments
Open

ILM shrink causes cluster to turn red #67957

JohnLyman opened this issue Jan 25, 2021 · 2 comments
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team

Comments

@JohnLyman
Copy link

Elasticsearch version (bin/elasticsearch --version):
Version: 7.7.1, Build: default/tar/ad56dce891c901a492bb1ee393f12dfff473a423/2020-05-28T16:30:01.040088Z, JVM: 14.0.1

Plugins installed: []
repository-s3

JVM version (java -version):
openjdk version "14.0.1" 2020-04-14
OpenJDK Runtime Environment AdoptOpenJDK (build 14.0.1+7)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 14.0.1+7, mixed mode, sharing)

OS version (uname -a if on a Unix-like system):
RHEL 7.9 / 3.10.0-1160.11.1.el7.x86_64

Description of the problem including expected versus actual behavior:
According to https://www.elastic.co/guide/en/elasticsearch/reference/master/indices-shrink-index.html

The node handling the shrink process must have sufficient free disk space to accommodate a second copy of the existing index.

However, when ILM chooses eligible nodes for the shrink process, it only considers nodes that have enough free space for one copy of all shards, not two:
https://github.com/elastic/elasticsearch/blob/v7.10.2/x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/SetSingleNodeAllocateStep.java#L94

I consider this a major bug because the allocator routinely picks a node close to the low watermark and moves all shards to that node. It then isn't able to allocate the new shrink-* index because it would put the node above the watermark. That results in the cluster turning red and requires manual intervention to remediate.

I know this issue would likely be solved by #63519, but that is not intended to be worked on "in the foreseeable future." I think my issue warrants a bug fix in the mean time. Having ILM routinely turn the cluster red is a major problem. This also seems like a much quicker fix than #63519, even if that work gets re-prioritized.

It looks like curator handles this scenario correctly. It adds up the size of all the primaries, multiplies by two, and adds a small amount of padding. See https://github.com/elastic/curator/blob/v5.8.2/curator/actions.py#L2252-L2253.

I really don't want to switch back to curator when ILM seems poised to replace it.

Steps to reproduce:

This is difficult to reproduce since the allocator picks a random node after building the list of eligible nodes.

  1. Setup a three node cluster, with shards balanced evenly between the nodes, but with one node close to the disk watermark, and the others nowhere near.
  2. Ingest data into an ILM-managed index.
  3. Wait for ILM to shrink an index
  4. Hope it randomly picks the high-disk node to perform the shrinks in order to reproduce :)
@JohnLyman JohnLyman added >bug needs:triage Requires assignment of a team area label labels Jan 25, 2021
@jtibshirani jtibshirani added the :Data Management/ILM+SLM Index and Snapshot lifecycle management label Jan 26, 2021
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Jan 26, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (Team:Core/Features)

@jtibshirani jtibshirani removed the needs:triage Requires assignment of a team area label label Jan 26, 2021
@joegallo
Copy link
Contributor

joegallo commented May 4, 2021

Related to #56062 because this could be one of the potential failure cases mentioned there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team
Projects
None yet
Development

No branches or pull requests

4 participants