Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create unique task ID per task launch #94

Closed
wants to merge 7 commits into from

Conversation

erikdw
Copy link
Collaborator

@erikdw erikdw commented Feb 22, 2016

This is needed to avoid a problem with mesos-slave recovery resulting in LOST tasks.

i.e., we discovered that if you relaunch a topology's task onto the same worker slot (so there are 2 different instances with the same "task ID" that have run), then when the mesos-slave process is recovering, it terminates the task upon finding a "terminal" update in the recorded state of the task. The terminal state having been recorded the 1st time the task with that task ID stopped.

To solve this we ensure all task IDs are unique, by adding a milisecond-granularity timestamp onto the task IDs.

@erikdw
Copy link
Collaborator Author

erikdw commented Feb 22, 2016

@dsKarthick : please 👀

@erikdw erikdw changed the title create unique task ID per launch of a task onto a slot create unique task ID per task launch Feb 22, 2016

LOG.info("killTask: killing task " + id.getValue() +
" which is running on port " + port);
_state.remove(Integer.toString(port));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this breaks killedWorker(). The intent is to allow MesosSupervisor to communicate to storm-core supervisor that it no longer has this port assigned, and thus should kill the associated worker process. Unfortunately, by removing the port from _state, we prevent killedWorker() from issuing the status update later (which needs the TaskID). So I'm going to change from using LocalState _state for this [1] to just using a memory-based map pointing to a tuple <TaskID, Boolean> (boolean indicating whether the port is currently assigned here). Then I'll change the methods that interact with _state to deal with that tuple.

[1] LocalState stores persistently into disk, which doesn't have any value for the MesosSupervisor, since it cannot be restarted.

drewrobb and others added 7 commits March 16, 2016 01:36
This is needed to avoid a problem with mesos-slave recovery resulting in LOST tasks.

i.e., we discovered that if you relaunch a topology's task onto the same worker slot (so there are 2 different instances with the same "task ID" that have run), then when the mesos-slave process is recovering, it terminates the task upon finding a "terminal" update in the recorded state of the task. The terminal state having been recorded the 1st time the task with that task ID stopped.

To solve this we ensure all task IDs are unique, by adding a milisecond-granularity timestamp onto the task IDs.
@erikdw
Copy link
Collaborator Author

erikdw commented Mar 16, 2016

This is being continued in #106

@erikdw erikdw deleted the unique-task-ids branch May 25, 2016 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants