Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document pge dev best practice for hard/soft time limits #6

Open
ldang opened this issue Jun 20, 2018 · 2 comments
Open

Document pge dev best practice for hard/soft time limits #6

ldang opened this issue Jun 20, 2018 · 2 comments
Assignees

Comments

@ldang
Copy link

ldang commented Jun 20, 2018

Update the job spec (hysds-io) page (HySDS github wiki) to highlight a gotcha when it comes to hard/soft time limits.

PGEs can specify a soft and hard time limit, which will cause the worker to timeout a job if it runs too long.

We had been seeing issues where there was an inconsistency with Mozart showing a job being in a started state, when it was no longer in the system. This appears to happen when the soft and hard timeout limits are the same.

Our design was:
Soft time limit should do sigterm
Hard time limit should do sigkill -9

We think there may be a race condition where celery will attempt to kill the task based on both the soft and hard timeout. It killed sigkill -9 verdi too quickly after sigterm

We could recommend that PGE developers never set the soft and hard time limit to be the same.
We could further add in a container builder that will put in a check to fail the build if the soft and hard time limits are the same.

@ldang
Copy link
Author

ldang commented Jun 22, 2018

New thought is to only require PGE developer to set the soft limit, and have the system calculate a default hard limit based on soft limit + fudge factor of, say, 5 minutes.

@pymonger brought up the following concern that we need to be careful or make a judgement call about the soft time limit for sciflo jobs that are blocking on spawned jobs. We can set it really high soft time limit to avoid prematurely terminating the entire workflow, but take on the risk of hanging jobs keeping us from running more sciflo jobs. Or we can increase the number of workers for the queue to support more concurrent sciflo jobs.

@ldang
Copy link
Author

ldang commented Jun 22, 2018

Another concern when estimating the soft time limit, is to make sure the developer accounts for the maximum runtime rather than the average. Otherwise, we will never be able to complete the job.

Examples might be crawler jobs that pull data from a data provider. Although these jobs might not take long to run on a regular basis, we may experience occasional downtimes where we accumulate a backlog and the crawler will take much longer to run than normal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant