Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Job fail to start with "Invalid alias name [.ml-state-write] ..." #58482

Closed
hendrikmuhs opened this issue Jun 24, 2020 · 5 comments · Fixed by #58999
Closed

[ML] Job fail to start with "Invalid alias name [.ml-state-write] ..." #58482

hendrikmuhs opened this issue Jun 24, 2020 · 5 comments · Fixed by #58999
Labels

Comments

@hendrikmuhs
Copy link
Contributor

hendrikmuhs commented Jun 24, 2020

Affected version: 7.7 -

Problem

.ml-state-write is supposed to be an index alias, however by accident it can become an index. If .ml-state-write is a concrete index instead of an alias, starting a job can fail due to index rollover introduced in #52356.

The reason for .ml-state-write being an index instead of an alias is explained in #57645

From 7.9 the job fails with: Detected a problem with the internal machine learning data: the state index alias ... exists as index but must be an alias.

Mitigation

  • if you are ok with re-creating ML models you can delete .ml-state-write
  • if you want to preserve state:
    • reindex .ml-state-write to .ml-state:
#---
# reindex
# - https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html
#---
POST _reindex
{
"source": {
"index": ".ml-state-write",
"size": 100
},
"dest": {
"index": ".ml-state"
}
}

After the successful reindex, delete the old index and create an alias:

#---
# delete .ml-state-write
#---
DELETE /.ml-state-write

Now you should be able to start the jobs.

Solution

The issues #57645 and #55267 discuss solutions for preventing the .ml-state-write index. This will solve the root cause of this issue.

For users that have an .ml-state-write index by mistake, this won't help. Because reindex is an expensive operation it's not an option to reindex in the back.

2 possible improvements I can think of:

A: improve log message

The log message isn't very descriptive and does not help for finding a solution quickly. We can improve the message (concrete wording to be discussed): "Expected [.ml-state-write] to be an alias but it is an index, can't start the job. Please reindex [.ml-state-write] to [.ml-state]". It's not possible to write full instructions in a log message, but given the message is part of this, users should find this.

B: do not use ILM if ml-state-write is an index

We could be lenient and simply fall back to the old non-ILM way. We added ILM for a reason, that's why this solution is questionable, however, we talk about 7.x. For upgrading to 8.0 we can require using an update tool and reindex as part of migrating to 8, so eventually the state index will be managed. This solution requires that a .ml-state-write index does not cause problems in other parts of the code.

@hendrikmuhs hendrikmuhs added >bug :ml Machine learning labels Jun 24, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@hendrikmuhs
Copy link
Contributor Author

We have discussed this issue and decided to improve the log message. For the permanent fix we follow up on #55267.

hendrikmuhs pushed a commit that referenced this issue Jul 3, 2020
.ml-state-write is supposed to be an index alias, however by accident it can become an index. If
.ml-state-write is a concrete index instead of an alias ML stops working. This change improves error
handling by setting the job to failed and properly log and audit the problem. The user still has to
manually fix the problem. This change should lead to a quicker resolution of the problem.

fixes #58482
hendrikmuhs pushed a commit that referenced this issue Jul 3, 2020
.ml-state-write is supposed to be an index alias, however by accident it can become an index. If
.ml-state-write is a concrete index instead of an alias ML stops working. This change improves error
handling by setting the job to failed and properly log and audit the problem. The user still has to
manually fix the problem. This change should lead to a quicker resolution of the problem.

fixes #58482
hendrikmuhs pushed a commit that referenced this issue Jul 3, 2020
.ml-state-write is supposed to be an index alias, however by accident it can become an index. If
.ml-state-write is a concrete index instead of an alias ML stops working. This change improves error
handling by setting the job to failed and properly log and audit the problem. The user still has to
manually fix the problem. This change should lead to a quicker resolution of the problem.

fixes #58482
@kunisen
Copy link
Contributor

kunisen commented Oct 29, 2020

Maybe we should also consider the ILM behavior.
After reindex, we will get .ml-state which has rollover_alias to be .ml-state-write.
But we will also have an auto created .ml-state-000001 which has the same rollover_alias.
Thus it could cause confliction and the previous index .ml-state (which we did via reindex), will report ILM error:

{
  "indices" : {
    ".ml-state" : {
      "index" : ".ml-state",
      "managed" : true,
      "policy" : "ml-size-based-ilm-policy",
      "lifecycle_date_millis" : 1603784027940,
      "age" : "2.06d",
      "phase" : "hot",
      "phase_time_millis" : 1603960894945,
      "action" : "rollover",
      "action_time_millis" : 1603784495507,
      "step" : "ERROR",
      "step_time_millis" : 1603961494945,
      "failed_step" : "check-rollover-ready",
      "is_auto_retryable_error" : true,
      "failed_step_retry_count" : 147,
      "step_info" : {
        "type" : "illegal_argument_exception",
        "reason" : "index.lifecycle.rollover_alias [.ml-state-write] does not point to index [.ml-state]",
        "stack_trace" : """java.lang.IllegalArgumentException: index.lifecycle.rollover_alias [.ml-state-write] does not point to index [.ml-state]
	at org.elasticsearch.xpack.core.ilm.WaitForRolloverReadyStep.evaluateCondition(WaitForRolloverReadyStep.java:114)
	at org.elasticsearch.xpack.ilm.IndexLifecycleRunner.runPeriodicStep(IndexLifecycleRunner.java:174)
	at org.elasticsearch.xpack.ilm.IndexLifecycleService.triggerPolicies(IndexLifecycleService.java:329)
	at org.elasticsearch.xpack.ilm.IndexLifecycleService.triggered(IndexLifecycleService.java:267)
	at org.elasticsearch.xpack.core.scheduler.SchedulerEngine.notifyListeners(SchedulerEngine.java:183)
	at org.elasticsearch.xpack.core.scheduler.SchedulerEngine$ActiveSchedule.run(SchedulerEngine.java:211)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
	at java.base/java.lang.Thread.run(Thread.java:832)
"""

To work around this, we need to release the ILM settings from the index,

PUT .ml-state/_settings
{
  "index.lifecycle.name": null,
  "index.lifecycle.rollover_alias": null
}

Thoughts?

@droberts195
Copy link
Contributor

After reindex, we will get .ml-state which has rollover_alias to be .ml-state-write.

Yes, you are correct this is a problem. After deleting the concrete .ml-state-write index the advice should be to create the .ml-state-write alias against .ml-state-000001, not .ml-state. Also, like you say, the .ml-state index created for the repair needs to be released from ILM.

@kunisen
Copy link
Contributor

kunisen commented Nov 4, 2020

Thanks @droberts195 !! for the comment! all clear now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants