Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Framework not declining offers #132

Closed
drewrobb opened this issue Apr 19, 2016 · 26 comments
Closed

Framework not declining offers #132

drewrobb opened this issue Apr 19, 2016 · 26 comments

Comments

@drewrobb
Copy link
Contributor

We observed in production our Storm framework having outstanding offers persist for much longer than the expected 2 minutes (

Number filterSeconds = Optional.fromNullable((Number) _conf.get(CONF_MESOS_OFFER_FILTER_SECONDS)).or(120);
final Protos.Filters filters = Protos.Filters.newBuilder()
.setRefuseSeconds(filterSeconds.intValue())
.build();
_offers = new RotatingMap<>(
new RotatingMap.ExpiredCallback<Protos.OfferID, Protos.Offer>() {
@Override
public void expire(Protos.OfferID key, Protos.Offer val) {
driver.declineOffer(
val.getId(),
filters
). I verified that the offerId of the offer was the same for at least 10 minutes, so it wasn't just a different offer for the exact same resources. I only noticed this because another framework was unable to schedule tasks due to Storm holding an offer for some particular slave that that framework needed resources from. I'm also aware of #50 but this is different and potentially more severe

Fortunately, it seems like an easy workaround is to just set mesos-master option --offer_timeout (we didn't have this set prior). However, it seems like whatever underlying cause there might be in the storm scheduler to forget about an outstanding issue might need to be addressed eventually? At least in the mean time, I think it would be good to warn people to use --offer_timeout because that might be easier than trying to figure out how to reproduce and fix this issue?

@erikdw
Copy link
Collaborator

erikdw commented Apr 21, 2016

hey @drewrobb sorry for the delayed response. Can you please let me know what the values of these config options in your storm.yaml are?

nimbus.monitor.freq.secs
mesos.offer.expiry.multiplier

@drewrobb
Copy link
Contributor Author

nimbus.monitor.freq.secs = 10
mesos.offer.expiry.multiplier = is not set

This is from the nimbus UI configuration section. Neither are set in my storm.yaml

@erikdw
Copy link
Collaborator

erikdw commented Apr 22, 2016

Can you please clarify which commit SHA you are running the framework from? The one you linked isn't at HEAD, and there have been some changes that might affect behavior in this area. (I have a lot more that I've written that I'll send when I get this info.)

@drewrobb
Copy link
Contributor Author

I'm running 6b31108 (I think that link was just stolen from #50). Also, I don't know if my initial report was clear, the vast majority of offers are declined, there are just a few that get stuck.

@erikdw erikdw closed this as completed Apr 22, 2016
@erikdw erikdw reopened this Apr 22, 2016
@erikdw
Copy link
Collaborator

erikdw commented Apr 22, 2016

@drewrobb : hmmm, I'm surprised by the behavior you are reporting.

Let me clarify that there are 2 behaviors involved in the lines originally pointed at.

  1. The "offer filter refused_seconds" timeout. This is set as 120 seconds by default now, and is the amount of time that the Mesos Master will delay re-offering _declined_ resources to our framework. i.e., after the MesosNimbus declines an offer, the Mesos Master will not resend those resources to us for 120 seconds.
    • At first blush, this doesn't seem related to the reported behavior of some offers being hoarded by the MesosNimbus for 10 minutes.
    • Side note: notice how I highlighted that the configured filter timeout of 120 seconds was for _declined_ offers? That's because the setting isn't applied to partially used resources, since we don't set a filter when launching our tasks. i.e., if we use only part of an offer's resources when we launch tasks, then the default of 5 seconds is applied to the unused resources, per the comment in mesos.proto.
  2. The 2nd behavior is setting up the callback which allows the RotatingMap to decline offers when they "expire". This is what controls how long the MesosNimbus will hoard an offer. If an offer received by MesosNimbus is not used for N seconds it will be "expired", which means declined and sent back to the Mesos Master, applying the 120 second filter timeout to those declined resources.

So no offer should be held onto for more than 75 seconds before the MesosNimbus declines it, and once declined the resources of that offer should not be seen again by this MesosNimbus for 120 seconds.

Another clarification: which mesos version are you running?

@drewrobb
Copy link
Contributor Author

Thanks for clarifying all that, I think the issue is either with 2) (although maybe something outside of that exact code and configuration). Total speculation, but say nimbus is killed while it has a pending offer, would it know about that offer when it starts back again? Is there any other place where state about pending offers between nimbus and mesos could get out of sync?

I was running mesos 0.27.1 when this was observed, but we have since upgraded to 0.28, and now with --offer_timeout set to 150 on our cluster, it may be difficult to reproduce this again.

@drewrobb
Copy link
Contributor Author

Small update: even with --offer_timeout we were running into problems again. Taking some more observations it appears that many of the offers are refreshed immediately instead of after 120 seconds (as you described for behavior 1). The way I have observed this is noticing an offer for a particular host go away and then immediately come back (the host is the same but the offerID changes). The framework hasn't scheduled any additional tasks using that offer. I think it is possible that #134 would help with this?

@erikdw
Copy link
Collaborator

erikdw commented May 18, 2016

thanks for the update @drewrobb. So if I'm understanding you correctly, you are saying that the framework is declining an offer and then its constituent resources are sent back immediately (in a new offer) to this same framework? You're not talking about the leftover resources from a task that used part of an offer, right?

I'm not super familiar with all the vagaries of the mesos resource handling. Can you confirm whether you are using anything like reservations or other non-default resource handling mechanisms?

I would think #134 would not prevent this -- it specifically is meant to lower the amount of time an offer takes to get back to the framework, in the case where we declined it due to expiring in the rotating map.

@drewrobb
Copy link
Contributor Author

I can't confirm that the framework is declining the offer. The offer may have been timed out by our --offer_timeout setting, I'm not sure how to tell. I'm not using any non default mesos resource handling.

@erikdw
Copy link
Collaborator

erikdw commented May 18, 2016

@drewrobb : hmm, I'm a bit confused about what the issue is then? What are you expecting to happen in this case? Can you confirm some exact timings? I'm assuming you're running some other frameworks in the cluster and the behavior you're seeing is that the offers are largely (or all?) getting hoarded by the storm-mesos framework? Maybe we can jump on a call or something tomorrow. My email is my GitHub handle @ gmail.com.

@drewrobb
Copy link
Contributor Author

@erikdw I'd be happy to jump on a call but I did some more debugging and I keep on learning more about what might be happening, so I think I should jot down what I have learned so far.

First some more background on the actual problem: We are also running marathon. Our cluster gets to a state where there is only one slave in the cluster with enough resources to launch some big task on marathon. I observe that marathon is stuck trying to launch that task for a very long time (at least on hour). I then look at mesos ui outstanding offers page and see that storm has an offer for that slave. Marathon doesn't seem to ever get the offer, it is reoffered to Storm immediately. If I stop the storm scheduler, marathon will quickly be able to launch its task.

At this point in my debugging it is clear that the following is happening:

  • There are topologies that need assignments on our cluster during all of this (they are running on less than the number of workers that the topology requests, so for example a topology is running on 4 workers and requested 5).
  • There are enough available slots for assignments for the topology: Number of available slots: 5
  • No tasks get launched, and storm seems to hold on to the offer until it is rescinded due to mesos master --offer_timeout. I'm increasing our setting of offer_timeout from 150sec to 600sec to see if that setting was simply too small? I'm not sure why mesosnimbus does not decide to launch additional tasks to satisfy the topology that needs assignments. I'm also not sure if the fact that the topology needs assignments is the reason why offers are being held on to by storm for longer than the 75 seconds that you described.

@dsKarthick
Copy link
Collaborator

dsKarthick commented May 20, 2016

@drewrobb Is your problem same as An app stays in "Waiting" forever listed in https://mesosphere.github.io/marathon/docs/troubleshooting.html? If so, are you also running storm framework on marathon or outside of marathon? If you are running storm framework outside of marathon, are you using dynamic/static reservation features of marathon?

@dsKarthick
Copy link
Collaborator

@drewrobb Another question, do you "constraints" feature as mentioned in https://mesosphere.github.io/marathon/

@drewrobb
Copy link
Contributor Author

drewrobb commented May 20, 2016

@dsKarthick that sort of describes my issue, but I have confirmed that it is storm that is hoarding offers in a way that I don't think it is meant to and also in a way that is independent of my marathon usage. I've verified the following multiple times:

  1. Some marathon apps are stuck waiting.

  2. Storm has outstanding offers.

  3. Stopping the storm scheduler manually allows the marathon apps to finally get scheduled, using resources that were previously locked up in offers to storm. (Note: I just mean stopping the scheduler, so the storm framework is still registered and using resources, it is just not receiving offers).

I do run storm framework on marathon. Some apps do use constraints.

@dsKarthick
Copy link
Collaborator

dsKarthick commented May 20, 2016

@drewrobb I see that @erikdw already pointed out offer filter refused_second. But you are observing that the resources are being re-offered immediately. Is storm framework is using any of the offers that you think are being hoarded? If so like @erikdw mentioned in his earlier comments, offer filter refused_second isn't applicable and its less of a surprise that the resources are being re-offered to storm itself.

Quoting what you said in the previous comment

There are topologies that need assignments on our cluster during all of this

Does this mean your storm tasks are constantly dying and not getting launched or that the offer is not enough for launching the tasks? If your storm tasks are constantly dying, then it could imply that the resources are being used by the storm-framework and therefore [https://github.com/apache/mesos/blob/0.25.0/include/mesos/mesos.proto#L1161-L1173](offer filter refused_seconds) setting is not applicable.

@drewrobb
Copy link
Contributor Author

I found some nimbus logs that may be very helpful. See https://gist.github.com/drewrobb/78d0cbea14e2b78cacfa7e3a50f3578c

Here is what I observe in this log:

  • The log shows everything happening during ~10 minute period. We have offer_timeout set for 10 minutes.
  • activity_grouping topology is running on 5 workers, it should be running on 8, thus nimbus is not immediately declining offers. Topologies that need assignments: #{"activity_grouping-14-1463439910"}
  • at the beginning of the log, nimbus gets offers and they are enough to launch 8 workers, from what I can tell Number of available slots for activity_grouping-14-1463439910 : 8. Also for the topology topology.workers = 8
  • nimbus doesn't launch any new tasks or use these offers at all, even though the topology needs assignments.
  • at the end of the log, about 10 minutes from the start, the offers are finally rescined by mesos Offer rescinded. offerId: 07f29506-8a09-4b00-83f2-7d084dba7e3c-O27693.
  • Similar offers for the same set of resources are immediately sent to storm, and the process repeats again wither new offers.

This raises two questions:

  1. Why does nimbus not assign new tasks to the topology that needs assignments during this period, when there are apparently enough available slots?

  2. Why are the offers not declined after 75 seconds, regardless of whether tasks are launched or now?

@dsKarthick
Copy link
Collaborator

dsKarthick commented May 20, 2016

Logs are interesting indeed. Helped me identify a bug https://github.com/mesos/storm/blob/master/storm/src/main/storm/mesos/schedulers/DefaultScheduler.java#L48. It should be private final Logger log = LoggerFactory.getLogger(DefaultScheduler.class);

  1. Why does nimbus not assign new tasks to the topology that needs assignments during this period, when there are apparently enough available slots?

I am assuming that your storm-core logs like below are going to a different location.

2016-05-16T11:37:29.409+0000 b.s.d.nimbus [INFO] Executor sampleTopology-17-1463093449:[2 2] not alive
2016-05-16T11:37:29.409+0000 b.s.d.nimbus [INFO] Executor sampleTopology-17-1463093449:[3 3] not alive
2016-05-16T11:37:29.409+0000 b.s.d.nimbus [INFO] Executor sampleTopology-17-1463093449:[35 35] not alive

Let me try to reproduce it.

@dsKarthick
Copy link
Collaborator

dsKarthick commented May 24, 2016

@drewrobb I couldn't reproduce the problem on Friday. But earlier today, I inadvertently reproduced a problem with the symptom that you described. Realized while talking to @erikdw that I had built the stom-mesos archive wrong (I used mvn package and used the resultant jar to deploy rather than using the one resulting from bin/build-release.sh). The moment I re-deployed the correct version, I could see expected behavior. Do you want to do a hangout session with us (@erikdw and I) tomorrow? We are available between 3PM and 7PM PDT tomorrow.

@erikdw
Copy link
Collaborator

erikdw commented May 24, 2016

@dsKarthick : Warriors game is tomorrow, so I'll be unavailable from 5:30pm on.

@drewrobb
Copy link
Contributor Author

@erikdw @dsKarthick today may work, I think I will have some time around 3-4pm

@dsKarthick
Copy link
Collaborator

@drewrobb 3-4pm it is then! Lets do a hangout. Could you email us at {d.s.karthick@, erikdw@}gmail.com.

@erikdw
Copy link
Collaborator

erikdw commented May 25, 2016

Thanks for meeting with us today @drewrobb . Our understanding of this issue is:

  1. The core Nimbus logic is calling INimbus.allSlotsAvailableForScheduling() with a topology that "needs assignment", thus triggering the storm-mesos logic to hold onto offers in order to formulate slots out of them and keep them until INimbus.assignSlots() is called by the core Nimbus.
  2. However, the core Nimbus is counterintuitively never actually calling INimbus.assignSlots() for this needy topology.
  3. This seems to be happening for cases where the Nimbus has settled on having less than the requested number of workers for a topology.
    • So our plan of action is to understand this case, and potentially implement a change to storm (in erikdw/storm) to have a flag to disable this logic in core Nimbus.

@erikdw
Copy link
Collaborator

erikdw commented Jul 30, 2016

@drewrobb : we discovered a corner case in the code prior to #154 which might explain the behavior above. In our own storm-mesos-specific scheduler logic, we had a pathological case:

  • if all executors got assigned to less than the requested number of workers, then we would never explicitly try to spread those executors out, even if we had enough available slots to reassign all workers.

We have fixed that behavior as part of the (numerous) changes in #154. I've kicked off the build of release 0.1.5 of this project, so assuming it goes smoothly there will be a new version for you to try out within the hour. #154 was a large effort by our team (@dsKarthick, @JessicaLHartog, and myself) and it should unblock much quicker changes that will improve various things in the project, such as finally addressing #50 which relates strongly to this issue.

@drewrobb
Copy link
Contributor Author

drewrobb commented Aug 8, 2016

@erikdw awesome work by your team. I've been on vacation but am back this week. I'll look to try out release 0.1.5 in our environment this week.

@erikdw
Copy link
Collaborator

erikdw commented Nov 6, 2016

@drewrobb : can you please confirm whether this behavior is gone with 0.1.5 or 0.1.7? I presume we have fixed this issue now.

@erikdw
Copy link
Collaborator

erikdw commented Apr 19, 2017

We believe this is resolved so I'm closing this issue. Please reopen if the issue persists.

@erikdw erikdw closed this as completed Apr 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants