Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved handling of SIGTERM #660

Merged
merged 72 commits into from
Sep 1, 2020
Merged
Show file tree
Hide file tree
Changes from 59 commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
73bdb1f
Return exit code 15 (SIGTERM) after SIGTERM.
ugermann Nov 19, 2019
653b13d
Added explanatory comment about exiting marian_train with non-zero st…
ugermann Nov 22, 2019
8a44759
Merge branch 'ug-graceful-shutdown' of https://github.com/marian-nmt/…
ugermann Apr 3, 2020
2586af7
Bug fix: better handling of SIGTERM for graceful shutdown during trai…
ugermann Apr 6, 2020
66711b5
Merged PR 11929: Move around code to make later comparison with FP16 …
emjotde Mar 14, 2020
dd06542
bump version
emjotde Mar 14, 2020
68581a6
Merged PR 11831: Change the weight matrix quantization to use 7-bit m…
ykim362 Mar 25, 2020
d0fa14e
Merged PR 12243: For int8 quantized model, use int8 quantization for …
ykim362 Mar 27, 2020
71e0f0b
Support tab-separated inputs (#617)
snukky Apr 10, 2020
c95676e
bump version
emjotde Apr 10, 2020
71cc43a
actually save the merge file
emjotde Apr 10, 2020
09904e0
use float values for catch::Approx
emjotde Apr 10, 2020
4d12ffa
Fix TSV training with mini-batch-fit after the last merge
snukky Apr 11, 2020
855c94a
Update submodule regression-tests
snukky Apr 11, 2020
c18fc71
fix 0 * nan behavior in concatention
emjotde Apr 11, 2020
0ba438c
Fix 0 * nan behavior due to using -O3 instead of -OFast (#630)
emjotde Apr 11, 2020
93a27dc
Update submodule regression-tests
snukky Apr 11, 2020
733cb50
Support relative paths in shortlist and sqlite options (#612)
snukky Apr 12, 2020
7bf486a
Fix Iris example on CPU (#623)
snukky Apr 12, 2020
34bc47c
Dump version
snukky Apr 12, 2020
bc8b6fa
Merged PR 12442: cherry pick a few improvements/fixes from Frank's br…
emjotde Apr 14, 2020
ce94fe9
update changelog and version
emjotde Apr 14, 2020
59dad14
python3 shebang from #620 (#621)
kpu Apr 16, 2020
342db58
Update submodule regression-tests
snukky Apr 26, 2020
3f7b459
Update Simple-WebSocket-Server and move it to submodules (#639)
snukky Apr 27, 2020
9ae1951
Batched gemm (#633)
XapaJIaMnu May 14, 2020
1603d2f
update version
emjotde May 14, 2020
ae1dd47
Update submodule regression-tests
snukky May 17, 2020
9cd1623
Bug fix: better handling of SIGTERM for graceful shutdown during trai…
ugermann Apr 6, 2020
63006db
Merged PR 11831: Change the weight matrix quantization to use 7-bit m…
ykim362 Mar 25, 2020
128e1fc
Merged PR 12243: For int8 quantized model, use int8 quantization for …
ykim362 Mar 27, 2020
98dff9d
Support tab-separated inputs (#617)
snukky Apr 10, 2020
b06531d
actually save the merge file
emjotde Apr 10, 2020
5ce67c6
use float values for catch::Approx
emjotde Apr 10, 2020
709522c
Fix TSV training with mini-batch-fit after the last merge
snukky Apr 11, 2020
17167dd
Fix 0 * nan behavior due to using -O3 instead of -OFast (#630)
emjotde Apr 11, 2020
65c9c44
Support relative paths in shortlist and sqlite options (#612)
snukky Apr 12, 2020
1312c18
update changelog and version
emjotde Apr 14, 2020
2f2a00b
Update Simple-WebSocket-Server and move it to submodules (#639)
snukky Apr 27, 2020
d2d3563
Post-rebase fixes.
ugermann May 22, 2020
f2d9f1e
Merge branch 'ug-graceful-shutdown' of https://github.com/marian-nmt/…
ugermann Jun 17, 2020
69f8192
Update training.h
ugermann Jul 26, 2020
9a7e3a0
Merge branch 'master' into ug-graceful-shutdown
ugermann Jul 26, 2020
3dc0795
Merge branch 'master' into ug-graceful-shutdown
snukky Jul 26, 2020
fdc519a
Merge branch 'master' into ug-graceful-shutdown
ugermann Aug 2, 2020
27cba8f
Fix space after comma.
ugermann Aug 2, 2020
75459e3
Merge branch 'ug-graceful-shutdown' of https://github.com/marian-nmt/…
ugermann Aug 2, 2020
286b980
Update batch_generator.h
ugermann Aug 3, 2020
0a406d8
Update batch_generator.h
ugermann Aug 3, 2020
521560b
Update signal_handling.h
ugermann Aug 4, 2020
d4102cb
Update comments in signal_handling.h
ugermann Aug 5, 2020
0a0b83b
Merge branch 'master' into ug-graceful-shutdown
ugermann Aug 18, 2020
9ab0be5
Merge branch 'ug-graceful-shutdown' of https://github.com/marian-nmt/…
ugermann Aug 18, 2020
5d80ab4
Configurable signal handlers for SIGTERM
ugermann Aug 19, 2020
3bcd5f8
Merge remote-tracking branch 'origin' into ug-graceful-shutdown
ugermann Aug 23, 2020
bdda7bd
Updated CHANGELOG.md.
ugermann Aug 23, 2020
3c1656d
Fixed comment in scheduler.h.
ugermann Aug 23, 2020
30cf713
Squashed commit of the following:
ugermann Aug 23, 2020
cb4b42f
Merge branch 'ug-graceful-shutdown-squashed' into ug-graceful-shutdown
ugermann Aug 26, 2020
6a753d6
Update config_parser.cpp
ugermann Aug 26, 2020
deafe6a
Update config_parser.cpp
ugermann Aug 26, 2020
8e41f22
Update signal_handling.cpp
ugermann Aug 26, 2020
69b10cf
Update config_parser.cpp
ugermann Aug 26, 2020
1a5fbbc
Fix trailing whitespace.
ugermann Aug 27, 2020
506eb71
Cosmetic cleanup.
ugermann Aug 27, 2020
139052a
Comment code.
ugermann Aug 28, 2020
56e38b3
Change values for --sigterm in response to code review.
ugermann Aug 28, 2020
c7ad4f1
Bug fix in variable name.
ugermann Aug 28, 2020
8d50a85
Update signal_handling.cpp.
ugermann Aug 28, 2020
0406025
Rename 'graceful exit' to 'save and exit'.
ugermann Aug 28, 2020
d7044f7
Update CHANGELOG.md.
ugermann Aug 28, 2020
6a4d887
rename function name to saveAndExitRequested()
emjotde Sep 1, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
- Properly record cmake variables in the cmake build directory instead of the source tree.
- Added default "none" for option shuffle in BatchGenerator, so that it works in executables where shuffle is not an option.
- Added a few missing header files in shortlist.h and beam_search.h.
- Improved handling for graceful shutdown upon receiving SIGTERM. SIGTERM now also interrupts batch prefetching, which runs in a separate thread. Graceful shutdown can be disabled with --sigterm 'immediate'.
ugermann marked this conversation as resolved.
Show resolved Hide resolved

### Changed
- Move Simple-WebSocket-Server to submodule
Expand Down
2 changes: 1 addition & 1 deletion src/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ add_library(marian STATIC
common/filesystem.cpp
common/file_stream.cpp
common/file_utils.cpp
common/signal_handling.cpp
common/types.cpp

data/alignment.cpp
Expand Down Expand Up @@ -99,7 +100,6 @@ add_library(marian STATIC
training/graph_group_singleton.cpp
training/validator.cpp
training/communicator.cpp
training/scheduler.cpp

# this is only compiled to catch build errors, but not linked
microsoft/quicksand.cpp
Expand Down
11 changes: 5 additions & 6 deletions src/command/marian_train.cpp
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#include <signal.h>
#include "marian.h"

#include "common/signal_handling.h"
#include "training/graph_group_async.h"
#include "training/graph_group_singleton.h"
#include "training/graph_group_sync.h"
Expand Down Expand Up @@ -42,14 +43,12 @@ int main(int argc, char** argv) {
New<Train<AsyncGraphGroup>>(options)->run();
}
}

// If we exit due to SIGTERM, exit with 128 + the signal number, as suggested
// for bash in http://tldp.org/LDP/abs/html/exitcodes.html. This allows parent
// If we exit due to a graceful exit request via SIGTERM, exit with 128 + SIGTERM,
// as suggested for bash in http://tldp.org/LDP/abs/html/exitcodes.html. This allows parent
// scripts to determine if training terminated naturally or via SIGTERM.
// Whith this approach we can accommodate additional signals in the future.
// An alternative would be to return 124, which is what the timeout command
// An alternative would be to exit with code 124, which is what the timeout command
// returns for timeout -s SIGTERM <seconds> ...., because exiting after SIGTERM
ugermann marked this conversation as resolved.
Show resolved Hide resolved
// is not technically a fatal error (which is what the 128+x convention usually
// stands for).
return getSigtermFlag() ? (128 + SIGTERM) : 0;
exit(getSignalFlag(SIGTERM) ? 128 + SIGTERM : EXIT_SUCCESS);
}
12 changes: 11 additions & 1 deletion src/common/config_parser.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -324,13 +324,23 @@ void ConfigParser::addOptionsModel(cli::CLIWrapper& cli) {
"Dropout for transformer attention (0 = no dropout)");
cli.add<float>("--transformer-dropout-ffn",
"Dropout for transformer filter (0 = no dropout)");

ugermann marked this conversation as resolved.
Show resolved Hide resolved
}
cli.switchGroup(previous_group);
// clang-format on
}

void ConfigParser::addOptionsTraining(cli::CLIWrapper& cli) {
auto previous_group = cli.switchGroup("Training options");
auto previous_group = cli.switchGroup("Signal Handling");
ugermann marked this conversation as resolved.
Show resolved Hide resolved
// --sigterm is deliberately not a boolean, to allow for a consistent
// pattern of specifying custom signal handling in the future.
// (e.g., dump model but continue training upon SIGUSR1, or report current
// training status upon SIGINFO.)
cli.add<std::string>("--sigterm",
"What to do with SIGTERM: 'graceful' => save and exit (default); "
"'immediate' => exit immediately.", "graceful");
ugermann marked this conversation as resolved.
Show resolved Hide resolved

cli.switchGroup("Training options");
// clang-format off
cli.add<std::string>("--cost-type", // @TODO: rename to loss-type
"Optimization criterion: ce-mean, ce-mean-words, ce-sum, perplexity", "ce-mean");
Expand Down
33 changes: 33 additions & 0 deletions src/common/signal_handling.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
#include "common/logging.h"
#include "signal_handling.h"

// We use signal() here instead of the usual strong recommendation for
// using sigaction, which apparently is not available for Windows (cf.
// https://stackoverflow.com/questions/231912/what-is-the-difference-between-sigaction-and-signal).

namespace marian{
volatile std::sig_atomic_t sigflags_{0};
volatile std::sig_atomic_t gracefulExitRequested_{0};

bool getSignalFlag(const int sig) {
// sig_atomic_t has 32 bits. We don't accommodate signals beyond that.
ABORT_IF(sig >= 32, "Signal out of range (must be < 32, is {}).", sig);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this check? I suggest to remove both the comment and the ABORT, since sig_atomic_t is, by its name, clearly meant for signals, and therefore defines the valid signal ranges. Marian is not an operating system, we don't need to hedge against every possible invalid argument. You can also remove the check and comment in setSignalFlag

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sig_atomic_t is meant for signals in the sense that certain operations on it are guaranteed to be atomic even during signal handling. Essentially it is an int type that I'm using here as a bit field, so that we have a generic signal handler that we can install for arbitrary signals (lower than 30). That check is a check against programming errors. Signals can be higher than 30, but this code does not accommodate installing setSignalFlag for such signals.

return sigflags_ & (1<<sig);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

& is an integer operation, but you return bool.

Can you please also explain what this function does? It seems it does two things: It converts the signal, but then also masks it with a class member. I cannot understand the meaning of the function without a comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explanation added in code.

}

void requestGracefulExit(int sig) {
setSignalFlag(sig); // keep track of triggering signal
gracefulExitRequested_ = 1; // set flag to exit gracefully
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gracefulExitRequested() returns a bool, but here you set it to an integer. Please choose one (bool if possible)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This relies on implicit type conversion from int types to bool. sig_atomic_t is specifically meant to be used in signal handlers. This is run-of-the-mill signal handling, see "Compliant Solution (Writing volatile sig_atomic_t)" here: https://wiki.sei.cmu.edu/confluence/display/c/SIG31-C.+Do+not+access+shared+objects+in+signal+handlers.

LOG(debug, "Graceful exit requested via signal {}.", sig);
ugermann marked this conversation as resolved.
Show resolved Hide resolved
}

bool gracefulExitRequested() {
return gracefulExitRequested_;
}

void setSignalFlag(int sig) {
// sig_atomic_t has 32 bits. We don't accommodate signals beyond that.
if (sig < 32) sigflags_ |= (1<<sig);
}

}
39 changes: 39 additions & 0 deletions src/common/signal_handling.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
#pragma once
#include <csignal>
#include <string>

// SIGNAL HANDLING

// The signal handlers (and checkers) here are implemented in line with with the recommendations
// for signal handling in the SEI CERT C Coding Standard, specifically
//
// - SIG30-C:
// https://wiki.sei.cmu.edu/confluence/display/c/SIG30-C.+Call+only+asynchronous-safe+functions+within+signal+handlers
//
// - SIG31-C:
frankseide marked this conversation as resolved.
Show resolved Hide resolved
// https://wiki.sei.cmu.edu/confluence/display/c/SIG31-C.+Do+not+access+shared+objects+in+signal+handlers
//
// The exact behavior of 'graceful exit' depends on the application; for training, it means 'save model and exit',
// for a server (not implemented yet): 'block new requests but serve pending requests and then exit'.
//
// Graceful exit for training is useful for training on clusters with time limits on jobs. Slurm, for example, can be
// set up to send a custom signal at a set time before the end of the time slot, giving Marian time to save its current
// state before getting killed.

namespace marian {


/// Request graceful exit (signal handler)
void requestGracefulExit(const int sig);

/// Check if graceful exit was requested.
bool gracefulExitRequested();

/// General purpose signal handler that simply sets a flag when a signal is received.
// (only for SIGNAL No. < 32).
void setSignalFlag(const int sig); // custom handler (set flag) for sig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to not use const in these function signatures. It exposes an internal behavior of the function to the caller. And the functions are so simple that adding const does not make them easier to udnerstand or verify.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Const removed.


/// Check if a setSignalFlag was triggered for this signal
bool getSignalFlag(const int sig);

} // End of namespace marian
9 changes: 7 additions & 2 deletions src/data/batch_generator.h
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
#pragma once

#include "common/options.h"
#include "common/signal_handling.h"
#include "data/batch_stats.h"
#include "data/rng_engine.h"
#include "training/training_state.h"
Expand Down Expand Up @@ -136,6 +137,8 @@ class BatchGenerator : public RNGEngine {
}
size_t sets = 0;
while(current_ != data_->end() && maxiBatch->size() < maxSize) { // loop over data
if (gracefulExitRequested()) // stop generating batches
return std::deque<BatchPtr>();
maxiBatch->push(*current_);
sets = current_->size();
// do not consume more than required for the maxi batch as this causes
Expand All @@ -161,6 +164,8 @@ class BatchGenerator : public RNGEngine {
if (stats_)
cachedStatsIter = stats_->begin();
while(!maxiBatch->empty()) { // while there are sentences in the queue
if (gracefulExitRequested()) // stop generating batches
return std::deque<BatchPtr>();
// push item onto batch
batchVector.push_back(maxiBatch->top());
maxiBatch->pop(); // fetch next-shortest
Expand Down Expand Up @@ -249,15 +254,15 @@ class BatchGenerator : public RNGEngine {
"If you have changed the training corpus, add --no-restore-corpus to the training command and run it again.");
bufferedBatches_ = std::move(futureBufferedBatches_.get());
// if bg thread returns an empty swath, we hit the end of the epoch
if (bufferedBatches_.empty()) {
if (bufferedBatches_.empty() || gracefulExitRequested()) {
frankseide marked this conversation as resolved.
Show resolved Hide resolved
return nullptr;
}
// and kick off the next bg operation
fetchBatchesAsync();
} else { // don't spawn any threads, i.e. batch fetching is blocking.
bufferedBatches_ = fetchBatches();
// if bufferedBatches is empty we hit the end of the epoch
if (bufferedBatches_.empty()) {
if (bufferedBatches_.empty() || gracefulExitRequested()) {
frankseide marked this conversation as resolved.
Show resolved Hide resolved
return nullptr;
}
}
Expand Down
43 changes: 0 additions & 43 deletions src/training/scheduler.cpp

This file was deleted.

19 changes: 8 additions & 11 deletions src/training/scheduler.h
Original file line number Diff line number Diff line change
@@ -1,16 +1,14 @@
#pragma once

#include "common/options.h"
#include "common/signal_handling.h"
#include "training/training_state.h"
#include "training/validator.h"
#include "training/communicator.h"
#include "layers/loss.h"

namespace marian {

bool getSigtermFlag();
void installSignalHandlers();

class Scheduler : public TrainingObserver {
private:
Ptr<Options> options_;
Expand Down Expand Up @@ -154,11 +152,10 @@ class Scheduler : public TrainingObserver {
: options_(options), state_(state) {
ABORT_IF(state_->factor != 1, "state.factor unexpectedly not 1 at this point??");
updateLearningRate(*state);
installSignalHandlers();
}

bool keepGoing() {
if(getSigtermFlag()) // received signal SIGERM => exit gracefully
if(gracefulExitRequested()) // via SIGTERM
return false;

// stop if it reached the maximum number of epochs
Expand Down Expand Up @@ -192,13 +189,12 @@ class Scheduler : public TrainingObserver {

void started() { LOG(info, "Training started"); }
void finished() {
if (getSigtermFlag())
LOG(info, "Training interrupted (SIGTERM).");
if (gracefulExitRequested())
ugermann marked this conversation as resolved.
Show resolved Hide resolved
LOG(info, "Training interrupted (via signal).");
else
LOG(info, "Training finished");
}


void addValidator(Ptr<ValidatorBase> validator) {
validators_.push_back(validator);

Expand All @@ -223,9 +219,10 @@ class Scheduler : public TrainingObserver {

void validate(const std::vector<Ptr<ExpressionGraph>>& graphs,
bool isFinal = false) {
// Do not validate if already validated (for instance, after the model is
// loaded) or if validation is scheduled for another update, or when signal SIGTERM was received
if(getSigtermFlag() // SIGTERM was received
// Do not validate if already validated (for instance, after the model is loaded)
// or if validation is scheduled for another update, or when a graceful shutdown
// was requested.
if(gracefulExitRequested()
|| state_->validated // already validated (in resumed training, for example)
|| (!state_->enteredNewPeriodOf(options_->get<std::string>("valid-freq")) && !isFinal)) // not now
return;
Expand Down
17 changes: 17 additions & 0 deletions src/training/training.h
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ template <class ModelWrapper>
class Train : public ModelTask {
private:
Ptr<Options> options_;
void installCustomSignalHandlers();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it installing multiple signal handlers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially in the future. I prefer to keep API changes rare.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename to installEarlyExitSignalHandler()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly disagree on this one. installCustomSignalHandlers() is like parseOptions() a generic concept: install whatever custom signal handlers may be specified in the options. Either my pull request goes in as-is now, or I'll withdraw it. @emjotde


public:
Train(Ptr<Options> options) : options_(options) {}
Expand Down Expand Up @@ -77,6 +78,9 @@ class Train : public ModelTask {
bool restored = !options_->get<bool>("no-restore-corpus")
&& batchGenerator->restore(trainState);

// We only want custom behavior once training starts.
installCustomSignalHandlers();

// -- main training loop
scheduler->started();
while(scheduler->keepGoing()) {
Expand Down Expand Up @@ -107,4 +111,17 @@ class Train : public ModelTask {
finalizeMPI(std::move(mpi));
}
};

template <class ModelWrapper>
void Train<ModelWrapper>::installCustomSignalHandlers()
{
const std::string sigTermAction = options_->get<std::string>("sigterm");
if (sigTermAction == "graceful") {
LOG(debug, "Enabling graceful shutdown for SIGTERM.");
signal(SIGTERM, requestGracefulExit);
}
else if (sigTermAction != "immediate")
ABORT("Unrecognized value '{}' for --sigterm", sigTermAction);
}

} // namespace marian