Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unique TaskIDs #106

Merged
merged 2 commits into from
Mar 24, 2016
Merged

Unique TaskIDs #106

merged 2 commits into from
Mar 24, 2016

Conversation

erikdw
Copy link
Collaborator

@erikdw erikdw commented Mar 16, 2016

create unique task ID per task launch

This is needed to avoid a problem with mesos-slave recovery resulting in LOST tasks.

i.e., we discovered that if you relaunch a topology's task onto the same worker slot (so there are 2 different instances with the same "task ID" that have run), then when the mesos-slave process is recovering, it terminates the task upon finding a "terminal" update in the recorded state of the task. The terminal state having been recorded the 1st time the task with that task ID stopped.

To solve this we ensure all task IDs are unique, by adding a milisecond-granularity timestamp onto the task IDs.

@erikdw
Copy link
Collaborator Author

erikdw commented Mar 16, 2016

continuation of #94 : had problems merging and rebasing and gave up. :) Just cherry-picked my changes into this new branch. @dsKarthick & @JessicaLHartog: please 👀 at your convenience.

@dsKarthick
Copy link
Collaborator

:shipit:

@erikdw erikdw force-pushed the unique-task-ids-take-2 branch 2 times, most recently from 3ff2816 to a772308 Compare March 24, 2016 01:39
This is needed to avoid a problem with mesos-slave recovery resulting in LOST tasks.

i.e., we discovered that if you relaunch a topology's task onto the same worker slot (so there are 2 different instances with the same "task ID" that have run), then when the mesos-slave process is recovering, it terminates the task upon finding a "terminal" update in the recorded state of the task. The terminal state having been recorded the 1st time the task with that task ID stopped.

To solve this we ensure all task IDs are unique, by adding a milisecond-granularity timestamp onto the task IDs.
@erikdw erikdw merged commit a9b907b into mesos:master Mar 24, 2016
@erikdw erikdw deleted the unique-task-ids-take-2 branch March 24, 2016 02:07
erikdw added a commit to erikdw/storm-mesos that referenced this pull request Apr 2, 2016
This is intended to fix issue mesos#119.

With the introduction of the TaskAssignment refactor for creating unique
task IDs (mesos#106), I introduced a couple of bugs in the implementation of
MesosSupervisor.getMetadata:

1. The slot counts in the Storm UI were broken -- the return from
   getMetadata was always a single element vector, due to using a
   Set instead of a java array previously. This was causing the
   PersistentVector.create(Object object) method to be matched,
   which doesn't unpack the items and repackage them into a new
   vector. Instead it was
   So the fix is to just create a List and pass that to
   PersistentVector.create().
2. The returned Object must be serializable.  Depending on the
   build and runtime environment, the serialization done by the storm
   supervisor during initialization will fail, crashing the supervisor.
   That was happening because we were passing back the
   ConcurrentHashMap$KeySetView object, which is not serializable.
   Here too, the fix is to just create a List and pass that to
   PersistentVector.create().

NOTE: I haven't been able to reproduce problem 2. unfortunately.
erikdw added a commit to erikdw/storm-mesos that referenced this pull request Apr 2, 2016
This is intended to fix issue mesos#119.

With the introduction of the TaskAssignment refactor for creating unique
task IDs (mesos#106), I introduced a couple of bugs in the implementation of
MesosSupervisor.getMetadata:

1. The slot counts in the Storm UI were broken -- the return from
   getMetadata was always a single element vector, due to using a
   Set instead of a java array previously. This was causing the
   PersistentVector.create(Object object) method to be matched,
   which just puts the passed object into a vector without iterating
   over its constituent elements.
   So the fix is to just create a List and pass that to
   PersistentVector.create().
2. The returned Object must be serializable.  Depending on the
   build and runtime environment, the serialization done by the storm
   supervisor during initialization will fail, crashing the supervisor.
   That was happening because we were passing back the
   ConcurrentHashMap$KeySetView object, which is not serializable.
   Here too, the fix is to just create a List and pass that to
   PersistentVector.create().

NOTE: I haven't been able to reproduce problem 2. unfortunately.
erikdw added a commit to erikdw/storm-mesos that referenced this pull request Apr 2, 2016
This is intended to fix issue mesos#119.

With the introduction of the TaskAssignment refactor for creating unique
task IDs (mesos#106), I introduced a couple of bugs in the implementation of
MesosSupervisor.getMetadata:

1. The slot counts in the Storm UI were broken -- the return from
   getMetadata was always a single element vector, due to using a
   Set instead of a java array previously. This was causing the
   PersistentVector.create(Object object) method to be matched,
   which just puts the passed object into a vector without iterating
   over its constituent elements.
   So the fix is to just create a List and pass that to
   PersistentVector.create().
2. The returned Object must be serializable.  Depending on the
   build and runtime environment, the serialization done by the storm
   supervisor during initialization will fail, crashing the supervisor.
   That was happening because we were passing back the
   ConcurrentHashMap$KeySetView object, which is not serializable.
   Here too, the fix is to just create a List and pass that to
   PersistentVector.create().

NOTE: I haven't been able to reproduce problem 2. unfortunately.
erikdw added a commit to erikdw/storm-mesos that referenced this pull request Apr 2, 2016
This is intended to fix issue mesos#119.

With the introduction of the TaskAssignment refactor for creating unique
task IDs (mesos#106), I introduced a couple of bugs in the implementation of
MesosSupervisor.getMetadata:

1. The slot counts in the Storm UI were broken -- the return from
   getMetadata was always a single element vector, due to using a
   Set instead of a java array previously. This was causing the
   PersistentVector.create(Object ... object) method to be matched,
   which just puts the passed objects into a vector without iterating
   over their constituent elements.  Since we are passing a single
   Set object, we are getting a single element in the resultant vector.
   So the fix is to just create a List and pass that to
   PersistentVector.create().
2. The returned Object must be serializable.  Depending on the
   build and runtime environment, the serialization done by the storm
   supervisor during initialization will fail, crashing the supervisor.
   That was happening because we were passing back the
   ConcurrentHashMap$KeySetView object, which is not serializable.
   Here too, the fix is to just create a List and pass that to
   PersistentVector.create().

NOTE: I haven't been able to reproduce problem 2. unfortunately.
erikdw added a commit to erikdw/storm-mesos that referenced this pull request Apr 2, 2016
This is intended to fix issue mesos#119.

With the introduction of the TaskAssignments refactor for creating unique
task IDs (mesos#106), I introduced a couple of bugs in the implementation of
MesosSupervisor.getMetadata:

1. The slot counts in the Storm UI were broken -- the return from
   getMetadata was always a single element vector, due to using a
   Set instead of a java array previously. This was causing the
   PersistentVector.create(Object ... object) method to be matched,
   which just puts the passed objects into a vector without iterating
   over their constituent elements.  Since we are passing a single
   Set object, we are getting a single element in the resultant vector.
   So the fix is to just create a List and pass that to
   PersistentVector.create().
2. The returned Object must be serializable.  Depending on the
   build and runtime environment, the serialization done by the storm
   supervisor during initialization will fail, crashing the supervisor.
   That was happening because we were passing back the
   ConcurrentHashMap$KeySetView object, which is not serializable.
   Here too, the fix is to just create a List and pass that to
   PersistentVector.create().

NOTE: I haven't been able to reproduce problem 2. unfortunately.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants