create unique task ID per task launch #94

erikdw · 2016-02-22T08:04:25Z

This is needed to avoid a problem with mesos-slave recovery resulting in LOST tasks.

i.e., we discovered that if you relaunch a topology's task onto the same worker slot (so there are 2 different instances with the same "task ID" that have run), then when the mesos-slave process is recovering, it terminates the task upon finding a "terminal" update in the recorded state of the task. The terminal state having been recorded the 1st time the task with that task ID stopped.

To solve this we ensure all task IDs are unique, by adding a milisecond-granularity timestamp onto the task IDs.

erikdw · 2016-02-22T08:04:54Z

@dsKarthick : please 👀

erikdw · 2016-02-22T22:53:25Z

storm/src/main/storm/mesos/MesosSupervisor.java

+
+      LOG.info("killTask: killing task " + id.getValue() +
+          " which is running on port " + port);
+      _state.remove(Integer.toString(port));


Unfortunately, this breaks killedWorker(). The intent is to allow MesosSupervisor to communicate to storm-core supervisor that it no longer has this port assigned, and thus should kill the associated worker process. Unfortunately, by removing the port from _state, we prevent killedWorker() from issuing the status update later (which needs the TaskID). So I'm going to change from using LocalState _state for this [1] to just using a memory-based map pointing to a tuple <TaskID, Boolean> (boolean indicating whether the port is currently assigned here). Then I'll change the methods that interact with _state to deal with that tuple.

[1] LocalState stores persistently into disk, which doesn't have any value for the MesosSupervisor, since it cannot be restarted.

Issue: mesos#102

This is needed to avoid a problem with mesos-slave recovery resulting in LOST tasks. i.e., we discovered that if you relaunch a topology's task onto the same worker slot (so there are 2 different instances with the same "task ID" that have run), then when the mesos-slave process is recovering, it terminates the task upon finding a "terminal" update in the recorded state of the task. The terminal state having been recorded the 1st time the task with that task ID stopped. To solve this we ensure all task IDs are unique, by adding a milisecond-granularity timestamp onto the task IDs.

…n mk-supervisor

erikdw · 2016-03-16T20:58:15Z

This is being continued in #106

erikdw changed the title ~~create unique task ID per launch of a task onto a slot~~ create unique task ID per task launch Feb 22, 2016

erikdw force-pushed the unique-task-ids branch from a6f1d7e to 92ac096 Compare February 22, 2016 08:05

erikdw reviewed Feb 22, 2016
View reviewed changes

drewrobb and others added 7 commits March 16, 2016 01:36

Pass env vars to docker build

01e0ee5

Update default mesos version to 0.27.0

21486d6

Bug fix - Prevent losing logs due to incorrect shading

3c4a915

Issue: mesos#102

Migrate from 'log4j-over-slf4j' to 'slf4j'

6d3a7d4

tweaks to use in-memory state for task assignments

6c7bc0f

Make TaskAssignments.AssignmentInfo serializable to avoid exception i…

7f6b708

…n mk-supervisor

erikdw force-pushed the unique-task-ids branch from 92ac096 to 7f6b708 Compare March 16, 2016 09:41

erikdw mentioned this pull request Mar 16, 2016

Unique TaskIDs #106

Merged

erikdw closed this Mar 16, 2016

erikdw deleted the unique-task-ids branch May 25, 2016 08:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create unique task ID per task launch #94

create unique task ID per task launch #94

erikdw commented Feb 22, 2016

erikdw commented Feb 22, 2016

erikdw Feb 22, 2016

erikdw commented Mar 16, 2016

create unique task ID per task launch #94

create unique task ID per task launch #94

Conversation

erikdw commented Feb 22, 2016

erikdw commented Feb 22, 2016

erikdw Feb 22, 2016

Choose a reason for hiding this comment

erikdw commented Mar 16, 2016