Document pge dev best practice for hard/soft time limits #6

ldang · 2018-06-20T17:56:33Z

Update the job spec (hysds-io) page (HySDS github wiki) to highlight a gotcha when it comes to hard/soft time limits.

PGEs can specify a soft and hard time limit, which will cause the worker to timeout a job if it runs too long.

We had been seeing issues where there was an inconsistency with Mozart showing a job being in a started state, when it was no longer in the system. This appears to happen when the soft and hard timeout limits are the same.

Our design was:
Soft time limit should do sigterm
Hard time limit should do sigkill -9

We think there may be a race condition where celery will attempt to kill the task based on both the soft and hard timeout. It killed sigkill -9 verdi too quickly after sigterm

We could recommend that PGE developers never set the soft and hard time limit to be the same.
We could further add in a container builder that will put in a check to fail the build if the soft and hard time limits are the same.

ldang · 2018-06-22T17:56:00Z

New thought is to only require PGE developer to set the soft limit, and have the system calculate a default hard limit based on soft limit + fudge factor of, say, 5 minutes.

@pymonger brought up the following concern that we need to be careful or make a judgement call about the soft time limit for sciflo jobs that are blocking on spawned jobs. We can set it really high soft time limit to avoid prematurely terminating the entire workflow, but take on the risk of hanging jobs keeping us from running more sciflo jobs. Or we can increase the number of workers for the queue to support more concurrent sciflo jobs.

ldang · 2018-06-22T19:08:40Z

Another concern when estimating the soft time limit, is to make sure the developer accounts for the maximum runtime rather than the average. Otherwise, we will never be able to complete the job.

Examples might be crawler jobs that pull data from a data provider. Although these jobs might not take long to run on a regular basis, we may experience occasional downtimes where we accumulate a backlog and the crawler will take much longer to run than normal.

ldang added the documentation label Jun 20, 2018

ldang self-assigned this Jun 20, 2018

ldang mentioned this issue Jun 22, 2018

Rearchitect hard/soft time limits for PGEs #9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document pge dev best practice for hard/soft time limits #6

Document pge dev best practice for hard/soft time limits #6

ldang commented Jun 20, 2018

ldang commented Jun 22, 2018 •

edited

Loading

ldang commented Jun 22, 2018

Document pge dev best practice for hard/soft time limits #6

Document pge dev best practice for hard/soft time limits #6

Comments

ldang commented Jun 20, 2018

ldang commented Jun 22, 2018 • edited Loading

ldang commented Jun 22, 2018

ldang commented Jun 22, 2018 •

edited

Loading