-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
create unique task ID per task launch #94
Conversation
@dsKarthick : please 👀 |
a6f1d7e
to
92ac096
Compare
|
||
LOG.info("killTask: killing task " + id.getValue() + | ||
" which is running on port " + port); | ||
_state.remove(Integer.toString(port)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, this breaks killedWorker()
. The intent is to allow MesosSupervisor to communicate to storm-core supervisor that it no longer has this port assigned, and thus should kill the associated worker process. Unfortunately, by removing the port from _state
, we prevent killedWorker()
from issuing the status update later (which needs the TaskID). So I'm going to change from using LocalState _state
for this [1] to just using a memory-based map pointing to a tuple <TaskID, Boolean> (boolean indicating whether the port is currently assigned here). Then I'll change the methods that interact with _state
to deal with that tuple.
[1] LocalState stores persistently into disk, which doesn't have any value for the MesosSupervisor, since it cannot be restarted.
This is needed to avoid a problem with mesos-slave recovery resulting in LOST tasks. i.e., we discovered that if you relaunch a topology's task onto the same worker slot (so there are 2 different instances with the same "task ID" that have run), then when the mesos-slave process is recovering, it terminates the task upon finding a "terminal" update in the recorded state of the task. The terminal state having been recorded the 1st time the task with that task ID stopped. To solve this we ensure all task IDs are unique, by adding a milisecond-granularity timestamp onto the task IDs.
This is being continued in #106 |
This is needed to avoid a problem with mesos-slave recovery resulting in LOST tasks.
i.e., we discovered that if you relaunch a topology's task onto the same worker slot (so there are 2 different instances with the same "task ID" that have run), then when the mesos-slave process is recovering, it terminates the task upon finding a "terminal" update in the recorded state of the task. The terminal state having been recorded the 1st time the task with that task ID stopped.
To solve this we ensure all task IDs are unique, by adding a milisecond-granularity timestamp onto the task IDs.