Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support tab-separated inputs #617

Merged
merged 25 commits into from
Apr 10, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
33b8a26
Add basic support for TSV inputs
snukky Mar 26, 2020
4abea30
Fix mini-batch-fit for TSV inputs
snukky Mar 26, 2020
12f427b
Abort if shuffling data from stdin
snukky Mar 26, 2020
1e1fbf3
Fix terminating training with data from STDIN
snukky Mar 26, 2020
2cdee18
Allow creating vocabs from TSV files
snukky Mar 27, 2020
3d3953c
Add comments; clean creation of vocabs from TSV files
snukky Mar 28, 2020
da7676a
Guess --tsv-size based on the model type
snukky Mar 28, 2020
ba6f50b
Add shortcut for STDIN inputs
snukky Mar 28, 2020
33f67a1
Rename --tsv-size to --tsv-fields
snukky Mar 29, 2020
81558ef
Allow only one 'stdin' in --train-sets
snukky Mar 29, 2020
d2e8b09
Properly create separate vocabularies from a TSV file
snukky Apr 2, 2020
3c449cb
Clearer logging message
snukky Apr 2, 2020
cc1ca18
Add error message for wrong number of valid sets if --tsv is used
snukky Apr 2, 2020
be0e431
Use --no-shuffle instead of --shuffle in the error message
snukky Apr 2, 2020
1d6da8b
Fix continuing training from STDIN
snukky Apr 2, 2020
08a2900
Update CHANGELOG
snukky Apr 3, 2020
baebd8f
Support both 'stdin' and '-'
snukky Apr 6, 2020
b90b140
Guess --tsv-fields from dim-vocabs if special:model.yml available
snukky Apr 6, 2020
8352c94
Update error messages
snukky Apr 6, 2020
74d7b74
Move variable outside the loop
snukky Apr 6, 2020
6da35a2
Refactorize utils::splitTsv; add unit tests
snukky Apr 6, 2020
2fccd8f
Support '-' as stdin; refactorize; add comments
snukky Apr 6, 2020
92cc1f8
Abort if excessive field(s) in the TSV input
snukky Apr 8, 2020
614a9fe
Add a TODO on passing one vocab with fully-tied embeddings
snukky Apr 8, 2020
6b4281a
Remove the unit test with excessive tab-separated fields
snukky Apr 9, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 9 additions & 7 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
## [1.9.0] - 2020-03-10

### Added
- Training and scoring from STDIN
- Support for tab-separated inputs, added ptions --tsv and --tsv-fields
- An option to print cached variables from CMake
- Add support for compiling on Mac (and clang)
- An option for resetting stalled validation metrics
Expand All @@ -34,15 +36,15 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
- Support for 16-bit packed models with FBGEMM
- Multiple separated parameter types in ExpressionGraph, currently inference-only
- Safe handling of sigterm signal
- Automatic vectorization of elementwise operations on CPU for tensors dims that
- Automatic vectorization of elementwise operations on CPU for tensors dims that
are divisible by 4 (AVX) and 8 (AVX2)
- Replacing std::shared_ptr<T> with custom IntrusivePtr<T> for small objects like
- Replacing std::shared_ptr<T> with custom IntrusivePtr<T> for small objects like
Tensors, Hypotheses and Expressions.
- Fp16 inference working for translation
- Gradient-checkpointing

### Fixed
- Replace value for INVALID_PATH_SCORE with std::numer_limits<float>::lowest()
- Replace value for INVALID_PATH_SCORE with std::numer_limits<float>::lowest()
to avoid overflow with long sequences
- Break up potential circular references for GraphGroup*
- Fix empty source batch entries with batch purging
Expand All @@ -53,16 +55,16 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
- FastOpt now reads "n" and "y" values as strings, not as boolean values
- Fixed multiple reduction kernels on GPU
- Fixed guided-alignment training with cross-entropy
- Replace IntrusivePtr with std::uniq_ptr in FastOpt, fixes random segfaults
- Replace IntrusivePtr with std::uniq_ptr in FastOpt, fixes random segfaults
due to thread-non-safty of reference counting.
- Make sure that items are 256-byte aligned during saving
- Make explicit matmul functions respect setting of cublasMathMode
- Fix memory mapping for mixed paramter models
- Removed naked pointer and potential memory-leak from file_stream.{cpp,h}
- Compilation for GCC >= 7 due to exception thrown in destructor
- Sort parameters by lexicographical order during allocation to ensure consistent
- Sort parameters by lexicographical order during allocation to ensure consistent
memory-layout during allocation, loading, saving.
- Output empty line when input is empty line. Previous behavior might result in
- Output empty line when input is empty line. Previous behavior might result in
hallucinated outputs.
- Compilation with CUDA 10.1

Expand All @@ -73,7 +75,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
- Return error signal on SIGTERM
- Dropped support for CUDA 8.0, CUDA 9.0 is now minimal requirement
- Removed autotuner for now, will be switched back on later
- Boost depdendency is now optional and only required for marian_server
- Boost depdendency is now optional and only required for marian_server
- Dropped support for g++-4.9
- Simplified file stream and temporary file handling
- Unified node intializers, same function API.
Expand Down
5 changes: 3 additions & 2 deletions src/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,9 @@ add_library(marian STATIC
common/io.cpp
common/filesystem.cpp
common/file_stream.cpp
common/file_utils.cpp
common/types.cpp

data/alignment.cpp
data/vocab.cpp
data/default_vocab.cpp
Expand Down Expand Up @@ -139,7 +140,7 @@ cuda_add_library(marian_cuda
tensors/gpu/algorithm.cu
tensors/gpu/prod.cpp
tensors/gpu/element.cu
tensors/gpu/add.cu
tensors/gpu/add.cu
tensors/gpu/add_all.cu
tensors/gpu/tensor_operators.cu
tensors/gpu/cudnn_wrappers.cu
Expand Down
40 changes: 36 additions & 4 deletions src/common/config.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -49,11 +49,12 @@ void Config::initialize(ConfigParser const& cp) {
}

// load model parameters
bool loaded = false;
if(mode == cli::mode::translation || mode == cli::mode::server) {
auto model = get<std::vector<std::string>>("models")[0];
try {
if(!get<bool>("ignore-model-config"))
loadModelParameters(model);
loaded = loadModelParameters(model);
} catch(std::runtime_error& ) {
LOG(info, "[config] No model configuration found in model file");
}
Expand All @@ -64,13 +65,42 @@ void Config::initialize(ConfigParser const& cp) {
if(filesystem::exists(model) && !get<bool>("no-reload")) {
try {
if(!get<bool>("ignore-model-config"))
loadModelParameters(model);
loaded = loadModelParameters(model);
} catch(std::runtime_error&) {
LOG(info, "[config] No model configuration found in model file");
}
}
}

// guess --tsv-fields (the number of streams) if not set
if(get<bool>("tsv") && get<size_t>("tsv-fields") == 0) {
size_t tsvFields = 0;
if(loaded) {
// model.npz has properly set vocab dimensions in special:model.yml,
// so we may use them to determine the number of streams
for(auto dim : get<std::vector<size_t>>("dim-vocabs"))
if(dim != 0) // language models have a fake extra vocab
++tsvFields;
// For translation there is no target stream
if((mode == cli::mode::translation || mode == cli::mode::server) && tsvFields > 1)
--tsvFields;
} else {
// TODO: This is very britle, find a better solution
// If parameters from model.npz special:model.yml were not loaded,
// guess the number of inputs and outputs based on the model type name.
auto modelType = get<std::string>("type");

tsvFields = 1;
if(modelType.find("multi-", 0) != std::string::npos) // is a dual-source model
tsvFields += 1;
if(mode == cli::mode::training || mode == cli::mode::scoring)
if(modelType.rfind("lm", 0) != 0) // unless it is a language model
tsvFields += 1;
}

snukky marked this conversation as resolved.
Show resolved Hide resolved
config_["tsv-fields"] = tsvFields;
}

// echo full configuration
log();

Expand Down Expand Up @@ -124,16 +154,18 @@ void Config::save(const std::string& name) {
out << *this;
}

void Config::loadModelParameters(const std::string& name) {
bool Config::loadModelParameters(const std::string& name) {
YAML::Node config;
io::getYamlFromModel(config, "special:model.yml", name);
override(config);
return true;
}

void Config::loadModelParameters(const void* ptr) {
bool Config::loadModelParameters(const void* ptr) {
YAML::Node config;
io::getYamlFromModel(config, "special:model.yml", ptr);
override(config);
return true;
}

void Config::override(const YAML::Node& params) {
Expand Down
4 changes: 2 additions & 2 deletions src/common/config.h
Original file line number Diff line number Diff line change
Expand Up @@ -77,8 +77,8 @@ class Config {
}

YAML::Node getModelParameters();
void loadModelParameters(const std::string& name);
void loadModelParameters(const void* ptr);
bool loadModelParameters(const std::string& name);
bool loadModelParameters(const void* ptr);

std::vector<DeviceId> getDevices(size_t myMPIRank = 0, size_t numRanks = 1);

Expand Down
35 changes: 32 additions & 3 deletions src/common/config_parser.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -375,6 +375,7 @@ void ConfigParser::addOptionsTraining(cli::CLIWrapper& cli) {
"10000u");

addSuboptionsInputLength(cli);
addSuboptionsTSV(cli);

// data management options
cli.add<std::string>("--shuffle",
Expand Down Expand Up @@ -497,8 +498,10 @@ void ConfigParser::addOptionsTraining(cli::CLIWrapper& cli) {
{"float32", "float32", "float32"});
cli.add<std::vector<std::string>>("--cost-scaling",
"Dynamic cost scaling for mixed precision training: "
"power of 2, scaling window, scaling factor, tolerance, range, minimum factor")->implicit_val("7.f 2000 2.f 0.05f 10 1.f");
cli.add<bool>("--normalize-gradient", "Normalize gradient by multiplying with no. devices / total labels");
"power of 2, scaling window, scaling factor, tolerance, range, minimum factor")
->implicit_val("7.f 2000 2.f 0.05f 10 1.f");
cli.add<bool>("--normalize-gradient",
"Normalize gradient by multiplying with no. devices / total labels");

// multi-node training
cli.add<bool>("--multi-node",
Expand Down Expand Up @@ -623,8 +626,9 @@ void ConfigParser::addOptionsTranslation(cli::CLIWrapper& cli) {
"Keep the output segmented into SentencePiece subwords");
#endif

addSuboptionsDevices(cli);
addSuboptionsInputLength(cli);
addSuboptionsTSV(cli);
addSuboptionsDevices(cli);
addSuboptionsBatching(cli);

cli.add<bool>("--optimize",
Expand Down Expand Up @@ -684,6 +688,7 @@ void ConfigParser::addOptionsScoring(cli::CLIWrapper& cli) {
->implicit_val("1"),

addSuboptionsInputLength(cli);
addSuboptionsTSV(cli);
addSuboptionsDevices(cli);
addSuboptionsBatching(cli);

Expand Down Expand Up @@ -791,6 +796,15 @@ void ConfigParser::addSuboptionsInputLength(cli::CLIWrapper& cli) {
// clang-format on
}

void ConfigParser::addSuboptionsTSV(cli::CLIWrapper& cli) {
// clang-format off
cli.add<bool>("--tsv",
"Tab-separated input");
cli.add<size_t>("--tsv-fields",
"Number of fields in the TSV input, guessed based on the model type");
// clang-format on
}

void ConfigParser::addSuboptionsULR(cli::CLIWrapper& cli) {
// clang-format off
// support for universal encoder ULR https://arxiv.org/pdf/1802.05368.pdf
Expand Down Expand Up @@ -861,6 +875,21 @@ Ptr<Options> ConfigParser::parseOptions(int argc, char** argv, bool doValidate){
cli::processPaths(config_, cli::InterpolateEnvVars, PATHS);
}

// Option shortcuts for input from STDIN for trainer and scorer
if(mode_ == cli::mode::training || mode_ == cli::mode::scoring) {
auto trainSets = get<std::vector<std::string>>("train-sets");
YAML::Node config;
// Assume the input will come from STDIN if --tsv is set but no --train-sets are given
if(get<bool>("tsv") && trainSets.empty()) {
config["train-sets"].push_back("stdin");
// Assume the input is in TSV format if --train-sets is set to "stdin"
} else if(trainSets.size() == 1 && (trainSets[0] == "stdin" || trainSets[0] == "-")) {
config["tsv"] = true;
frankseide marked this conversation as resolved.
Show resolved Hide resolved
}
if(!config.IsNull())
cli_.updateConfig(config, cli::OptionPriority::CommandLine, "A shortcut for STDIN failed.");
}

if(doValidate) {
ConfigValidator(config_).validateOptions(mode_);
}
Expand Down
1 change: 1 addition & 0 deletions src/common/config_parser.h
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,7 @@ class ConfigParser {
void addSuboptionsDevices(cli::CLIWrapper&);
void addSuboptionsBatching(cli::CLIWrapper&);
void addSuboptionsInputLength(cli::CLIWrapper&);
void addSuboptionsTSV(cli::CLIWrapper&);
void addSuboptionsULR(cli::CLIWrapper&);

// Extract paths to all config files found in the config object.
Expand Down
25 changes: 20 additions & 5 deletions src/common/config_validator.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -70,9 +70,20 @@ void ConfigValidator::validateOptionsParallelData() const {
auto trainSets = get<std::vector<std::string>>("train-sets");
ABORT_IF(trainSets.empty(), "No train sets given in config file or on command line");

auto vocabs = get<std::vector<std::string>>("vocabs");
ABORT_IF(!vocabs.empty() && vocabs.size() != trainSets.size(),
"There should be as many vocabularies as training sets");
auto numVocabs = get<std::vector<std::string>>("vocabs").size();
ABORT_IF(!get<bool>("tsv") && numVocabs > 0 && numVocabs != trainSets.size(),
"There should be as many vocabularies as training files");

// disallow, for example --tsv --train-sets file1.tsv file2.tsv
ABORT_IF(get<bool>("tsv") && trainSets.size() != 1,
"A single file must be provided with --train-sets (or stdin) for a tab-separated input");

// disallow, for example --train-sets stdin stdin or --train-sets stdin file.tsv
ABORT_IF(trainSets.size() > 1
&& std::any_of(trainSets.begin(),
trainSets.end(),
[](const std::string& s) { return (s == "stdin") || (s == "-"); }),
"Only one 'stdin' or '-' in --train-sets is allowed");
}

void ConfigValidator::validateOptionsScoring() const {
Expand All @@ -94,7 +105,7 @@ void ConfigValidator::validateOptionsTraining() const {
ABORT_IF(has("embedding-vectors")
&& get<std::vector<std::string>>("embedding-vectors").size() != trainSets.size()
&& !get<std::vector<std::string>>("embedding-vectors").empty(),
"There should be as many embedding vector files as training sets");
"There should be as many embedding vector files as training files");

filesystem::Path modelPath(get<std::string>("model"));

Expand All @@ -105,10 +116,14 @@ void ConfigValidator::validateOptionsTraining() const {
ABORT_IF(!modelDir.empty() && !filesystem::isDirectory(modelDir),
"Model directory does not exist");

std::string errorMsg = "There should be as many validation files as training files";
if(get<bool>("tsv"))
errorMsg += ". If the training set is in the TSV format, validation sets have to also be a single TSV file";

ABORT_IF(has("valid-sets")
&& get<std::vector<std::string>>("valid-sets").size() != trainSets.size()
&& !get<std::vector<std::string>>("valid-sets").empty(),
"There should be as many validation sets as training sets");
errorMsg);

// validations for learning rate decaying
ABORT_IF(get<float>("lr-decay") > 1.f, "Learning rate decay factor greater than 1.0 is unusual");
Expand Down
6 changes: 5 additions & 1 deletion src/common/file_stream.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ InputFileStream::InputFileStream(const std::string &file)
ABORT_IF(!marian::filesystem::exists(file_), "File '{}' does not exist", file);

streamBuf1_.reset(new std::filebuf());
auto ret = static_cast<std::filebuf*>(streamBuf1_.get())->open(file.c_str(), std::ios::in | std::ios::binary);
auto ret = static_cast<std::filebuf*>(streamBuf1_.get())->open(file.c_str(), std::ios::in | std::ios::binary);
ABORT_IF(!ret, "File cannot be opened", file);
ABORT_IF(ret != streamBuf1_.get(), "Return value is not equal to streambuf pointer, that is weird");

Expand Down Expand Up @@ -84,6 +84,10 @@ OutputFileStream::~OutputFileStream() {
this->flush();
}

std::string OutputFileStream::getFileName() const {
return file_.string();
}

///////////////////////////////////////////////////////////////////////////////////////////////
TemporaryFile::TemporaryFile(const std::string &base, bool earlyUnlink)
: OutputFileStream(), unlink_(earlyUnlink) {
Expand Down
2 changes: 2 additions & 0 deletions src/common/file_stream.h
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,8 @@ class OutputFileStream : public std::ostream {
explicit OutputFileStream(const std::string& file);
virtual ~OutputFileStream();

std::string getFileName() const;

template <typename T>
size_t write(const T* ptr, size_t num = 1) {
std::ostream::write((char*)ptr, num * sizeof(T));
Expand Down
28 changes: 28 additions & 0 deletions src/common/file_utils.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#include "common/file_utils.h"
#include "common/utils.h"

namespace marian {
namespace fileutils {

void cut(const std::string& tsvIn,
Ptr<io::TemporaryFile> tsvOut,
const std::vector<size_t>& fields,
size_t numFields,
const std::string& sep /*= "\t"*/) {
std::vector<std::string> tsvFields(numFields);
std::string line;
io::InputFileStream ioIn(tsvIn);
while(getline(ioIn, line)) {
tsvFields.clear();
utils::splitTsv(line, tsvFields, numFields); // split tab-separated fields
for(size_t i = 0; i < fields.size(); ++i) {
*tsvOut << tsvFields[fields[i]];
if(i < fields.size() - 1)
*tsvOut << sep; // concatenating fields with the custom separator
}
*tsvOut << std::endl;
}
};

} // namespace fileutils
} // namespace marian
18 changes: 18 additions & 0 deletions src/common/file_utils.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#pragma once

#include <string>
#include <vector>

#include "common/file_stream.h"

namespace marian {
namespace fileutils {

void cut(const std::string& tsvIn,
Ptr<io::TemporaryFile> tsvOut,
const std::vector<size_t>& fields,
size_t numFields,
const std::string& sep = "\t");

} // namespace utils
} // namespace marian
4 changes: 2 additions & 2 deletions src/common/filesystem.h
Original file line number Diff line number Diff line change
Expand Up @@ -115,5 +115,5 @@ namespace filesystem {

using FilesystemError = Pathie::PathieError;

}
}
} // namespace filesystem
} // namespace marian
Loading