Skip to content

Commit

Permalink
Merge branch 'ug-graceful-shutdown' of https://github.com/marian-nmt/…
Browse files Browse the repository at this point in the history
…marian-dev into ug-graceful-shutdown
  • Loading branch information
ugermann committed Aug 18, 2020
2 parents 0a0b83b + d4102cb commit 9ab0be5
Show file tree
Hide file tree
Showing 2 changed files with 22 additions and 27 deletions.
37 changes: 18 additions & 19 deletions src/common/signal_handling.h
Original file line number Diff line number Diff line change
Expand Up @@ -3,25 +3,24 @@

// SIGNAL HANDLING

// The Marian signal handlers set global flags that thread can
// consider when a signal is received. This can be used for a graceful
// shutdown instead of a hard abandonment, e.g. after receiving
// SIGTERM during training.

// When SIGTERM is received, the global (static member) flag sigterm_
// (false by default) is set to true by signalHandler(). When sigterm_
// is true, keepGoing() returns false, and the current state of
// training models is saved prior to exiting. This functionality is
// helpful when training on clusters with time limits on compute
// slots, e.g., on s clusters managed by slurm. Slurm can be asked to
// sending a (custom) warning signal to a process at a given point in
// time prior to the hard "time's up".
//
// Correspondingly, fetchBatches in the batch generator checks the flag
// frequently and quits after the overall process receives a SIGTERM.

// The Marian signal handler setSignalFlag is a general purpose signal handler
// that sets a global flag upon receiving a signal (with SIGNAL No. < 32) in line
// with the recommendations for signal handling in the SEI CERT C Coding Standard, specifically
// - SIG30-C: https://wiki.sei.cmu.edu/confluence/display/c/SIG30-C.+Call+only+asynchronous-safe+functions+within+signal+handlers
// - SIG31-C: https://wiki.sei.cmu.edu/confluence/display/c/SIG31-C.+Do+not+access+shared+objects+in+signal+handlers
// Usage:
// - install the signal handler for a specific signal with signal(SIGNAL, setSignalFlag),
// e.g. signal(SIGTERM, setSignalFlag)
// - check the flag wherever appropriate with getSignalFlag(SIGNAL),
// e.g. getSignalFlag(SIGTERM)
//
// This mechanism is currently used in marian training to ensure a graceful shutdown after receiving
// SIGTERM, saving the current state of training before exiting. This behavior is particularly desirable
// when training on clusters with time limits on computeslots, e.g., on certain clusters managed by slurm.
// Slurm can be asked to send a (custom) warning signal to a process at a certain time priopr to the
// hard end of the time slot.

namespace marian {
bool getSignalFlag(int sig); // return true if sig was received, false otherwise
void setSignalFlag(int sig); // set custom handler (set flag) for sig
}
void setSignalFlag(int sig); // custom handler (set flag) for sig
} // end of namespace marian
12 changes: 4 additions & 8 deletions src/data/batch_generator.h
Original file line number Diff line number Diff line change
Expand Up @@ -138,9 +138,8 @@ class BatchGenerator : public RNGEngine {

size_t sets = 0;
while(current_ != data_->end() && maxiBatch->size() < maxSize) { // loop over data
if (getSignalFlag(SIGTERM)) { // received SIGTERM, abandon ship ...
return tempBatches;
}
if (getSignalFlag(SIGTERM)) // received SIGTERM, abandon ship ...
return std::deque<BatchPtr>();
maxiBatch->push(*current_);
sets = current_->size();
// do not consume more than required for the maxi batch as this causes
Expand All @@ -165,11 +164,8 @@ class BatchGenerator : public RNGEngine {
cachedStatsIter = stats_->begin();

while(!maxiBatch->empty()) { // while there are sentences in the queue

if (getSignalFlag(SIGTERM)) { // received SIGTERM, abandon ship ...
return tempBatches;
}

if (getSignalFlag(SIGTERM)) // received SIGTERM, abandon ship ...
return std::deque<BatchPtr>();
// push item onto batch
batchVector.push_back(maxiBatch->top());
maxiBatch->pop(); // fetch next-shortest
Expand Down

0 comments on commit 9ab0be5

Please sign in to comment.