Unique TaskIDs #106

erikdw · 2016-03-16T20:47:20Z

create unique task ID per task launch

This is needed to avoid a problem with mesos-slave recovery resulting in LOST tasks.

i.e., we discovered that if you relaunch a topology's task onto the same worker slot (so there are 2 different instances with the same "task ID" that have run), then when the mesos-slave process is recovering, it terminates the task upon finding a "terminal" update in the recorded state of the task. The terminal state having been recorded the 1st time the task with that task ID stopped.

To solve this we ensure all task IDs are unique, by adding a milisecond-granularity timestamp onto the task IDs.

erikdw · 2016-03-16T20:48:40Z

continuation of #94 : had problems merging and rebasing and gave up. :) Just cherry-picked my changes into this new branch. @dsKarthick & @JessicaLHartog: please 👀 at your convenience.

dsKarthick · 2016-03-24T01:00:30Z

This is needed to avoid a problem with mesos-slave recovery resulting in LOST tasks. i.e., we discovered that if you relaunch a topology's task onto the same worker slot (so there are 2 different instances with the same "task ID" that have run), then when the mesos-slave process is recovering, it terminates the task upon finding a "terminal" update in the recorded state of the task. The terminal state having been recorded the 1st time the task with that task ID stopped. To solve this we ensure all task IDs are unique, by adding a milisecond-granularity timestamp onto the task IDs.

…n mk-supervisor

This is intended to fix issue mesos#119. With the introduction of the TaskAssignment refactor for creating unique task IDs (mesos#106), I introduced a couple of bugs in the implementation of MesosSupervisor.getMetadata: 1. The slot counts in the Storm UI were broken -- the return from getMetadata was always a single element vector, due to using a Set instead of a java array previously. This was causing the PersistentVector.create(Object object) method to be matched, which doesn't unpack the items and repackage them into a new vector. Instead it was So the fix is to just create a List and pass that to PersistentVector.create(). 2. The returned Object must be serializable. Depending on the build and runtime environment, the serialization done by the storm supervisor during initialization will fail, crashing the supervisor. That was happening because we were passing back the ConcurrentHashMap$KeySetView object, which is not serializable. Here too, the fix is to just create a List and pass that to PersistentVector.create(). NOTE: I haven't been able to reproduce problem 2. unfortunately.

This is intended to fix issue mesos#119. With the introduction of the TaskAssignment refactor for creating unique task IDs (mesos#106), I introduced a couple of bugs in the implementation of MesosSupervisor.getMetadata: 1. The slot counts in the Storm UI were broken -- the return from getMetadata was always a single element vector, due to using a Set instead of a java array previously. This was causing the PersistentVector.create(Object object) method to be matched, which just puts the passed object into a vector without iterating over its constituent elements. So the fix is to just create a List and pass that to PersistentVector.create(). 2. The returned Object must be serializable. Depending on the build and runtime environment, the serialization done by the storm supervisor during initialization will fail, crashing the supervisor. That was happening because we were passing back the ConcurrentHashMap$KeySetView object, which is not serializable. Here too, the fix is to just create a List and pass that to PersistentVector.create(). NOTE: I haven't been able to reproduce problem 2. unfortunately.

This is intended to fix issue mesos#119. With the introduction of the TaskAssignment refactor for creating unique task IDs (mesos#106), I introduced a couple of bugs in the implementation of MesosSupervisor.getMetadata: 1. The slot counts in the Storm UI were broken -- the return from getMetadata was always a single element vector, due to using a Set instead of a java array previously. This was causing the PersistentVector.create(Object ... object) method to be matched, which just puts the passed objects into a vector without iterating over their constituent elements. Since we are passing a single Set object, we are getting a single element in the resultant vector. So the fix is to just create a List and pass that to PersistentVector.create(). 2. The returned Object must be serializable. Depending on the build and runtime environment, the serialization done by the storm supervisor during initialization will fail, crashing the supervisor. That was happening because we were passing back the ConcurrentHashMap$KeySetView object, which is not serializable. Here too, the fix is to just create a List and pass that to PersistentVector.create(). NOTE: I haven't been able to reproduce problem 2. unfortunately.

This is intended to fix issue mesos#119. With the introduction of the TaskAssignments refactor for creating unique task IDs (mesos#106), I introduced a couple of bugs in the implementation of MesosSupervisor.getMetadata: 1. The slot counts in the Storm UI were broken -- the return from getMetadata was always a single element vector, due to using a Set instead of a java array previously. This was causing the PersistentVector.create(Object ... object) method to be matched, which just puts the passed objects into a vector without iterating over their constituent elements. Since we are passing a single Set object, we are getting a single element in the resultant vector. So the fix is to just create a List and pass that to PersistentVector.create(). 2. The returned Object must be serializable. Depending on the build and runtime environment, the serialization done by the storm supervisor during initialization will fail, crashing the supervisor. That was happening because we were passing back the ConcurrentHashMap$KeySetView object, which is not serializable. Here too, the fix is to just create a List and pass that to PersistentVector.create(). NOTE: I haven't been able to reproduce problem 2. unfortunately.

erikdw mentioned this pull request Mar 16, 2016

create unique task ID per task launch #94

Closed

erikdw force-pushed the unique-task-ids-take-2 branch 2 times, most recently from 3ff2816 to a772308 Compare March 24, 2016 01:39

erikdw added 2 commits March 23, 2016 19:03

Make TaskAssignments.AssignmentInfo serializable to avoid exception i…

f4b5061

…n mk-supervisor

erikdw force-pushed the unique-task-ids-take-2 branch from 9d738c4 to f4b5061 Compare March 24, 2016 02:04

erikdw merged commit a9b907b into mesos:master Mar 24, 2016

erikdw deleted the unique-task-ids-take-2 branch March 24, 2016 02:07

drewrobb mentioned this pull request Mar 30, 2016

ConcurrentHashMap serialization/type issue #119

Closed

erikdw mentioned this pull request Apr 2, 2016

Fix MesosSupervisor.getMetadata #121

Merged

erikdw mentioned this pull request Apr 4, 2016

logviewer support for configurable storm.log.dir #118

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unique TaskIDs #106

Unique TaskIDs #106

erikdw commented Mar 16, 2016

erikdw commented Mar 16, 2016

dsKarthick commented Mar 24, 2016

Unique TaskIDs #106

Unique TaskIDs #106

Conversation

erikdw commented Mar 16, 2016

create unique task ID per task launch

erikdw commented Mar 16, 2016

dsKarthick commented Mar 24, 2016