[ML] Job fail to start with "Invalid alias name [.ml-state-write] ..." #58482

hendrikmuhs · 2020-06-24T10:35:02Z

Affected version: 7.7 -

Problem

.ml-state-write is supposed to be an index alias, however by accident it can become an index. If .ml-state-write is a concrete index instead of an alias, starting a job can fail due to index rollover introduced in #52356.

The reason for .ml-state-write being an index instead of an alias is explained in #57645

From 7.9 the job fails with: Detected a problem with the internal machine learning data: the state index alias ... exists as index but must be an alias.

Mitigation

if you are ok with re-creating ML models you can delete .ml-state-write
if you want to preserve state:
- reindex .ml-state-write to .ml-state:

#---
# reindex
# - https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html
#---
POST _reindex
{
"source": {
"index": ".ml-state-write",
"size": 100
},
"dest": {
"index": ".ml-state"
}
}

After the successful reindex, delete the old index and create an alias:

#---
# delete .ml-state-write
#---
DELETE /.ml-state-write

Now you should be able to start the jobs.

Solution

The issues #57645 and #55267 discuss solutions for preventing the .ml-state-write index. This will solve the root cause of this issue.

For users that have an .ml-state-write index by mistake, this won't help. Because reindex is an expensive operation it's not an option to reindex in the back.

2 possible improvements I can think of:

A: improve log message

The log message isn't very descriptive and does not help for finding a solution quickly. We can improve the message (concrete wording to be discussed): "Expected [.ml-state-write] to be an alias but it is an index, can't start the job. Please reindex [.ml-state-write] to [.ml-state]". It's not possible to write full instructions in a log message, but given the message is part of this, users should find this.

~~B: do not use ILM if ml-state-write is an index~~

We could be lenient and simply fall back to the old non-ILM way. We added ILM for a reason, that's why this solution is questionable, however, we talk about 7.x. For upgrading to 8.0 we can require using an update tool and reindex as part of migrating to 8, so eventually the state index will be managed. This solution requires that a .ml-state-write index does not cause problems in other parts of the code.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-06-24T10:35:04Z

Pinging @elastic/ml-core (:ml)

hendrikmuhs · 2020-07-02T08:19:48Z

We have discussed this issue and decided to improve the log message. For the permanent fix we follow up on #55267.

.ml-state-write is supposed to be an index alias, however by accident it can become an index. If .ml-state-write is a concrete index instead of an alias ML stops working. This change improves error handling by setting the job to failed and properly log and audit the problem. The user still has to manually fix the problem. This change should lead to a quicker resolution of the problem. fixes #58482

kunisen · 2020-10-29T09:04:36Z

Maybe we should also consider the ILM behavior.
After reindex, we will get .ml-state which has rollover_alias to be .ml-state-write.
But we will also have an auto created .ml-state-000001 which has the same rollover_alias.
Thus it could cause confliction and the previous index .ml-state (which we did via reindex), will report ILM error:

{
  "indices" : {
    ".ml-state" : {
      "index" : ".ml-state",
      "managed" : true,
      "policy" : "ml-size-based-ilm-policy",
      "lifecycle_date_millis" : 1603784027940,
      "age" : "2.06d",
      "phase" : "hot",
      "phase_time_millis" : 1603960894945,
      "action" : "rollover",
      "action_time_millis" : 1603784495507,
      "step" : "ERROR",
      "step_time_millis" : 1603961494945,
      "failed_step" : "check-rollover-ready",
      "is_auto_retryable_error" : true,
      "failed_step_retry_count" : 147,
      "step_info" : {
        "type" : "illegal_argument_exception",
        "reason" : "index.lifecycle.rollover_alias [.ml-state-write] does not point to index [.ml-state]",
        "stack_trace" : """java.lang.IllegalArgumentException: index.lifecycle.rollover_alias [.ml-state-write] does not point to index [.ml-state]
	at org.elasticsearch.xpack.core.ilm.WaitForRolloverReadyStep.evaluateCondition(WaitForRolloverReadyStep.java:114)
	at org.elasticsearch.xpack.ilm.IndexLifecycleRunner.runPeriodicStep(IndexLifecycleRunner.java:174)
	at org.elasticsearch.xpack.ilm.IndexLifecycleService.triggerPolicies(IndexLifecycleService.java:329)
	at org.elasticsearch.xpack.ilm.IndexLifecycleService.triggered(IndexLifecycleService.java:267)
	at org.elasticsearch.xpack.core.scheduler.SchedulerEngine.notifyListeners(SchedulerEngine.java:183)
	at org.elasticsearch.xpack.core.scheduler.SchedulerEngine$ActiveSchedule.run(SchedulerEngine.java:211)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
	at java.base/java.lang.Thread.run(Thread.java:832)
"""

To work around this, we need to release the ILM settings from the index,

PUT .ml-state/_settings
{
  "index.lifecycle.name": null,
  "index.lifecycle.rollover_alias": null
}

Thoughts?

droberts195 · 2020-11-02T10:24:42Z

After reindex, we will get .ml-state which has rollover_alias to be .ml-state-write.

Yes, you are correct this is a problem. After deleting the concrete .ml-state-write index the advice should be to create the .ml-state-write alias against .ml-state-000001, not .ml-state. Also, like you say, the .ml-state index created for the repair needs to be released from ILM.

kunisen · 2020-11-04T00:09:25Z

Thanks @droberts195 !! for the comment! all clear now!

hendrikmuhs added >bug :ml Machine learning labels Jun 24, 2020

hendrikmuhs added the team-discuss label Jun 24, 2020

hendrikmuhs mentioned this issue Jul 3, 2020

[ML] handle broken setup with state alias being an index #58999

Merged

hendrikmuhs closed this as completed in #58999 Jul 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Job fail to start with "Invalid alias name [.ml-state-write] ..." #58482

[ML] Job fail to start with "Invalid alias name [.ml-state-write] ..." #58482

hendrikmuhs commented Jun 24, 2020 •

edited

Loading

elasticmachine commented Jun 24, 2020

hendrikmuhs commented Jul 2, 2020

kunisen commented Oct 29, 2020

droberts195 commented Nov 2, 2020

kunisen commented Nov 4, 2020

[ML] Job fail to start with "Invalid alias name [.ml-state-write] ..." #58482

[ML] Job fail to start with "Invalid alias name [.ml-state-write] ..." #58482

Comments

hendrikmuhs commented Jun 24, 2020 • edited Loading

elasticmachine commented Jun 24, 2020

hendrikmuhs commented Jul 2, 2020

kunisen commented Oct 29, 2020

droberts195 commented Nov 2, 2020

kunisen commented Nov 4, 2020

hendrikmuhs commented Jun 24, 2020 •

edited

Loading