Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade Assistant - Phase 2 - Reindexing #26368

Closed
17 of 19 tasks
joshdover opened this issue Nov 28, 2018 · 14 comments
Closed
17 of 19 tasks

Upgrade Assistant - Phase 2 - Reindexing #26368

joshdover opened this issue Nov 28, 2018 · 14 comments
Assignees
Labels
Team:Operations Team label for Operations Team v6.7.0

Comments

@joshdover
Copy link
Contributor

joshdover commented Nov 28, 2018

As part of Phase 2 of #20890, we need to add a UI and state layer to allow users to reindex old indices (created before 6.x) in order to be compatible with 7.0.

Left to Implement

In first PR:

  • Add confirmation textbox for destructive changes
  • Handle conflicting index names
  • Design cleanup
  • Ensure indices are writable if reindexing fails
  • Handle pausing ML jobs when reindexing ML indices
  • Stop/start watcher when reindex .watches

In follow up PR(s):

Other nice-to-haves:


Details

This feature will be similar in flow to the upgrade assistant in 5.6 and will:

  • Make the old index read-only
  • Create new index with the same settings and mappings
  • Begin the reindexing using the Reindex API
  • Wait for reindex to finish
  • Alias old index name to point to new index and delete old index

One issue with this flow last time was around persistence. Almost all of this logic was driven by client-side code, so if you left the page in the browser the process would stop. This time around we want to persist the reindex process into a saved object and leverage the Task Manager (#24356) to poll Elasticsearch's Task API (naming is fun) to poll the status of the reindex task and to resume the flow once the reindex is done. We've decided to persist this using saved objects that we will update using optimistic concurrency. We are going to break this work into two parts, first to get this working ONLY when the browser is on the page, and then if we have time, add a worker that could handle this in the background. We should also be able to offer a reindex progress indicator and the ability to abort or reset a reindex process.

Browser-driven iteration

For each reindex operation, we will create a saved object that acts as a state-machine to track the steps of the reindex process. To update this object, we will utilize the version parameter in Elasticsearch to ensure that there are not two browser tabs (or workers) attempting to update the object simultaneously.

Reindex flow:

  • User clicks "reindex", browser makes API call to server to begin reindexing for the given index.
  • Server creates a saved object to track this reindex operation with a status. Begins the first steps of the reindex: set old index as readonly, create new index, start the reindex operation. For each step of the way, we update the saved object's status field to track the state machine.
  • While the browser tab is on the Upgrade Assistant page, the browser will continue to poll for known reindexes in progress.
  • Once the reindex has finished, the server will complete the reindex process: alias the new index, delete the old index, mark the reindex operation as completed.

If the user leaves the page while the browser is polling, the alias switchover will not complete until they return to the upgrade assistant.

Worker-driven iteration

Largely the same flow, but we will have a in-process worker on the server side that will look for in-progress reindex operations, and continue to poll for their completion.

To reduce overhead from polling Elasticsearch, we could only boot up this worker if there are any known reindexes in-progress. This check will be done at startup and when a new reindex operation is started.

Potential problem:

  • kibana1 starts up, no reindex operations in progress, does not start worker.
  • kibana2 starts up, receives request to start reindex operation, starts worker.
  • kibana2 crashes before reindex is complete
  • kibana1 never starts worker, reindex operation is not shown as completed (and aliases not swapped over).

We could address this issue by either:

  • Polling for in-progress reindex operation saved objects on regular, but infrequent basis (say, every 5 minutes). If a new one is found, start polling its progress frequently (every 10s).
  • Polling for in-progress redindex operation saved objects whenever the user visits the Upgrade Assistant.

Known Unknowns

  • Which settings should be copied from the original index to the new index? So far, I know these cannot be copied:
    • index.uuid
    • index.creation_date
    • index.version.created
    • index.version.upgraded
    • index.provided_name
    • index.blocks
    • index.legacy
  • Can we intelligently block the user from using this tool for large indices? If so, how do we decide this? Can ES's reindex API tell us whether or not this process should succeed?
  • UI Design

Possible Improvements

  • Should we offer an option to reindex many small indices in a single action (done in serial, not in parallel)?
@joshdover joshdover added Team:Operations Team label for Operations Team v6.7.0 labels Nov 28, 2018
@joshdover joshdover self-assigned this Nov 28, 2018
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-operations

@alexfrancoeur
Copy link

@joshdover whenever you have a UI around this and would like some feedback, let me know and I can take a look.

@joshdover
Copy link
Contributor Author

I've updated the issue to include our plan to use saved objects + browser-driven polling for the first iteration and how we'll add background polling in the second if time permits.

cc @epixa @tylersmalley


@alexfrancoeur Will do!

@joshdover
Copy link
Contributor Author

@droberts195 Can you add information this ticket relating to any special handling that ML indices need during the reindex? As of right now, I know that ML jobs will need to be paused while reindexing and then resumed.

  • How do I identify an ML index?
  • How do I identify which jobs are indexing into that index?
  • Which APIs would I use to pause and resume jobs?

If there's anything else that needs to be handled, please add that here as well.

@joshdover joshdover changed the title Add reindex feature to Upgrade Assistant Upgrade Assistant - Phase 2 - Reindexing Jan 14, 2019
@droberts195
Copy link
Contributor

@joshdover we started off along the path of upgrading ML indices without pausing the ML jobs - elastic/elasticsearch#36643. This is more complex but nicer for users who are running real-time anomaly detection and have large ML indices that date back to 5.x. If we were to continue along that path then the Kibana side logic would be not to reindex ML indices using the Kibana functionality but instead call that endpoint. The idea of pausing jobs by cancelling allocations of ML persistent tasks only came up last week. We'll decide within the next couple of days whether to switch to that approach.

  • How do I identify an ML index?

They all start with .ml-.

  • How do I identify which jobs are indexing into that index?

We have different types of ML indices:

  1. Results (pattern .ml-anomalies*)
  2. State (.ml-state)
  3. Metadata (.ml-meta)
  4. Notifications (.ml-notifications)
  5. Config (.ml-config - cannot possibly need reindexing in 6.7 as it was only added in 6.6)
  6. Annotations (.ml-annotations - cannot possibly need reindexing in 6.7 as it was only added in 6.6)

.ml-state, .ml-meta and .ml-notifications are shared by all jobs.

.ml-meta and .ml-notifications are small, infrequently written, and failure to write to them won't cause running jobs to fail, so I think they can just be reindexed using the standard migration assistant procedure.

Reindexing .ml-state would require all jobs to be paused while it is reindexed. But if we carry on along the ML upgrade endpoint path instead then the UI should not allow the standard migration procedure to run against it, but instead call the ML upgrade endpoint.

For the results indices, .ml-anomalies*, there is an alias for each job that points to its results index. You could use these aliases to work out which jobs are using each index. (Also, all these aliases need to be switched over to the new index after reindexing is complete - does the migration assistant already switch over arbitrary amounts of existing aliases?) The work already done in elastic/elasticsearch#36643 can handle migration of these indices and aliases while jobs remain running, so if we continue along that path then the UI should not allow the standard migration procedure to run against any index matching .ml-anomalies*, but instead call the ML upgrade endpoint.

  • Which APIs would I use to pause and resume jobs?

There are no APIs to do this currently. If we decide to switch from online upgrade to pause/resume upgrade then we'll have to add these APIs into 6.7.

Given the work that's been done so far I'm not convinced that the pause/resume option is the easiest way forward.

To summarise there are two ways forward:

  1. Continue with the ML migration endpoint. Kibana disallows standard migration for .ml-state* and .ml-anomalies* and if either or both is from 5.x then calls the ML migration endpoint instead.
  2. Add ES endpoints to pause and resume ML jobs. Standard Kibana index migration is used for .ml-state* and .ml-anomalies* but pausing jobs before reindex and resuming after.

In either case, .ml-meta and .ml-notifications can be reindexed using the standard procedure.

@joshdover
Copy link
Contributor Author

@droberts195 Thanks for writing this up. I think the best course for us right now is to wait on your decision and then jump on a video to call to work out the details depending on which path the ML team decides to move forward with.

From my perspective, it may actually be simpler for Kibana to use the ML-specific reindexing endpoint rather than pausing/resuming jobs. I think it's most likely too late for this upgrade cycle, but we should probably explore using this approach with other user indices in the 8.0 upgrade cycle. If we can accomplish zero-downtime reindexing that would be great for many use-cases.

Also, all these aliases need to be switched over to the new index after reindexing is complete - does the migration assistant already switch over arbitrary amounts of existing aliases?

This is not something that is handled right now by the Upgrade Assistant and actually something we hadn't considered. I'm going to take a look at this today and see how the current logic would behave when reindexing an index that already has an alias. I agree that moving any aliases should be handled by the Upgrade Assistant.

@droberts195
Copy link
Contributor

@joshdover I spoke to @bleskes this morning and we're going to go with the pause/resume option. We're going to discuss exactly how in the ES distributed team's weekly meeting tomorrow, so I'll update this issue after that. @benwtrent will probably do work for this.

I'm going to take a look at this today and see how the current logic would behave when reindexing an index that already has an alias

I'm surprised that no customers complained about that in the 5.6 to 6.x upgrade. It should be possible to add arbitrarily many aliases to the new index in the same operation where you delete the old index. It would be similar to what's in the "It is also possible to swap an index with an alias in one operation" example in https://www.elastic.co/guide/en/elasticsearch/reference/6.x/indices-aliases.html, but you can have many add actions in the same request so as well as adding an alias to the new index with the same name as the old index you could add additional aliases to the new index to replace all the aliases that the old index had.

@joshdover
Copy link
Contributor Author

@droberts195 @benwtrent Here's the plan I went over with Ben yesterday, written out for clarity:

  • ML will provide two APIs in Elasticsearch. One will stop/pause all ML jobs and the other will resume/restart all ML jobs.
  • When Kibana is reindexing any .ml-state* or .ml-anomalies* indices it will:
    • Call the ML stop endpoint is ES
    • Set the index as read only
    • Reindex the data, with no transformations.
    • Create an alias from the original index name to the new index name, copy any aliases pointing to the old index over to the new index, and delete the old index. This will all happen in a single atomic Update Aliases call.
    • Resume any ML jobs only if this is the only ML index being reindexed still. If there are others in progress, this step will be skipped so that only the last index to finish will resume the ML jobs.

Note, with this plan, we are not pausing/resuming specific ML jobs, but instead pausing and resuming all ML jobs. If we need to do specific jobs we could, but I'm not sure that optimization is needed at this time.

@droberts195
Copy link
Contributor

Thanks @joshdover that plan sounds good to me.

The pause/resume endpoints we're thinking of using at the moment are:

_ml/set_upgrade_mode?enabled=true
_ml/set_upgrade_mode?enabled=false

These still aren't implemented so it's possible someone will object to that naming and we'll have to change it, but the difficulty in calling the endpoints will not be any higher than that.

@joshdover
Copy link
Contributor Author

Great! @benwtrent is there a PR to follow for this? I didn't see one when I briefly poked around the ES repo. Also, with this API will it be guaranteed that Kibana can set indices to read-only as soon as we've gotten a response back from this API?

@benwtrent
Copy link
Member

@joshdover I am currently writing tests for the API. The PR should be opened this week or early next week.

Yes, once the API returns, the Indices can be set to read-only and re-indexing can begin.

@benwtrent
Copy link
Member

@joshdover PR: elastic/elasticsearch#37837

Its a biggie, lots of stuff going on to enable this change. Should get some reviewers taking a gander tomorrow/monday and hopefully have it finished early next week :)

@benwtrent
Copy link
Member

benwtrent commented Jan 28, 2019

elastic/elasticsearch#37942

This PR fixes a bug with the set-upgrade-mode API. Apparently I did not account for the situation when there were no tasks to worry about :/

@joshdover
Copy link
Contributor Author

All the planned work on this is complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Operations Team label for Operations Team v6.7.0
Projects
None yet
Development

No branches or pull requests

5 participants