MeMAD subtitle translation pipeline

This repository contains scripts that implement the subtitle translation pipeline developed as part of the MeMAD project. The pipeline makes use of subalign and OpusTools-perl for converting between subtitle and plain text formats, the Moses toolkit for pre- and post-processing, SentencePiece for subword segmentation, and Marian NMT transformers trained as (1) restoration models, which substitute for normalizing punctuation and truecasing, and (2) translation models.

Pre-trained restoration, translation, and segmentation models for the pipeline are available for download from the pipeline models repository on Zenodo. The translation models distributed from there are the default bilingual translation models. Multilingual translation models, other bilingual models fine-tuned on in-domain data, and further models that support guided alignment based on subword segments can all be found in the supplementary models repository. These models collectively support Dutch, English, Finnish, French, German and Swedish, and allow subtitle translation between any two of these languages.

Dependencies

perl >= 5.30
python >= 3.0
marian >= 1.8.0
Data processing scripts from moses
Data processing scripts from OpusTools-perl
- Required Perl libraries listed in the repository
Data processing scripts from subalign

To check for and install the required Python libraries, navigate to the directory where you cloned the repository, and run the following command:

pip install --user -r requirements.txt

Models

Usage

The pipeline requires segmentation models (sentencepiece-models.tar.gz), restoration models for the source language (restore-xx.tar.gz), and translation models from the source language to the target language (xx-yy.tar.gz). Download the desired pre-trained models from the Zenodo repository, and extract the archives into the models folder.

Before you run the pipeline, make sure you edit the configuration file paths.conf, so that the variables point to the correct paths for your system.

The script translate.py runs the entire pipeline from source language subtitles to target language subtitles. Run ./translate.py --help to read about supported options.

Quickstart

Marian on CPU

./translate.py --src-lang de \
               --tgt-lang en \
               --input data/sample.de.srt \
               --output sample.en.srt \
               --cpu-threads 1 \
               --verbose \
               --log process.log \
               --strict-sentence-parsing

Marian on GPU

./translate.py --src-lang de \
               --tgt-lang en \
               --input data/sample.de.srt \
               --output sample.en.srt \
               --gpu-devices 4 \
               --verbose \
               --log process.log \
               --strict-sentence-parsing

Translating plain text data

If you need to circumvent the subtitle segmentation steps, and use the rest of the pipeline (restoration, translation, and other auxiliary processing) on plain text data, this is possible through the use of the --plain-text-mode option. Using this option requires an input file containing one sentence per line, rather than SRT-formatted subtitles. The output will be produced in the same format as the input.

./translate.py --src-lang de \
               --tgt-lang en \
               ...
               --plain-text-mode

When using the pipeline exclusively in this mode, the OpusTools-perl and subalign software dependencies are no longer required.

Models

Training and test data

The models provided for download have been trained on a snapshot of the entire OPUS collection from October 2020, except for a small portion of data held out as a test set. This test set was sampled from a multi-parallel selection of movie subtitles (~100k sentence pairs per language pair) from the OpenSubtitles corpus.

Translation model benchmarks

For reference, we provide some BLEU scores for the translation models in particular.

OpenSubtitles held-out test set

src → tgt	`→ de`	`→ en`	`→ fi`	`→ fr`	`→ nl`	`→ sv`
`de →`	—	25.60	16.07	19.20	21.20	19.20
`en →`	29.16	—	22.90	27.11	29.86	28.56
`fi →`	14.08	16.96	—	14.16	16.57	17.78
`fr →`	21.39	25.28	16.76	—	20.85	19.26
`nl →`	22.27	27.22	18.79	20.62	—	22.20
`sv →`	21.21	26.88	21.36	20.21	23.70	—

WMT news translation test sets

Test sets from WMT news translation tasks until 2019 were part of the training data for the pretrained translation models. We provide additional benchmarking results on the WMT 2020 news translation task test sets below.

WMT 2020

src → tgt	`→ de`	`→ en`	`→ fr`
`de →`	—	32.97	29.28
`en →`	28.41	—	—
`fr →`	24.31	—	—

Tatoeba-MT challenge models

Additional models have been trained on the Tatoeba-MT challenge datasets. Those models cover the same six language pairs and have been benchmarked with MeMAD-internal test sets including translations between Finnish (FIN) and Swedish (SWE) subtitles from selected YLE TV programs with variants for the hearing impaired (indicated by the label FIH and SWH, espectively). We have also fine-tuned each model with disjoint training data from general movie subtitles (OpenSubtitles) and MeMAD-internal YLE data. The results are listed below:

tune / test	FIN-SWE	FIH-SWE	FIN-SWH	SWE-FIN	SWH-FIN	SWE-FIH
baseline	22.3	17.0	18.2	20.8	15.9	12.2
OpenSubtitles (1M)	22.0	16.8	17.9	20.9	15.7	12.5
YLE-all (2M)	24.7	19.6	19.5	22.7	17.4	13.6
YLE-FIN-SWE (1.1M)	24.9	18.9	19.5	23.1	17.3	13.9
YLE-FIH-SWE (47k)	23.6	19.7	18.4	21.5	16.0	14.8
YLE-FIN-SWH (850k)	23.8	18.5	19.5	23.0	17.7	13.9

Publications

MT for Subtitling: Investigating professional translators’ user experience and feedback

@inproceedings{koponen-etal-2020-mt,
    title = "{MT} for Subtitling: Investigating professional translators{'} user experience and feedback",
    author = {Koponen, Maarit  and
      Sulubacak, Umut  and
      Vitikainen, Kaisa  and
      Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of 1st Workshop on Post-Editing in Modern-Day Translation",
    month = oct,
    year = "2020",
    address = "Virtual",
    publisher = "Association for Machine Translation in the Americas",
    url = "https://www.aclweb.org/anthology/2020.amta-pemdt.6",
    pages = "79--92",
}

MT for subtitling: User evaluation of post-editing productivity

@inproceedings{koponen-etal-2020-mt-subtitling,
    title = "{MT} for subtitling: User evaluation of post-editing productivity",
    author = {Koponen, Maarit  and
      Sulubacak, Umut  and
      Vitikainen, Kaisa  and
      Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://www.aclweb.org/anthology/2020.eamt-1.13",
    pages = "115--124",
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
__pycache__		__pycache__
data		data
models		models
scripts		scripts
LICENSE		LICENSE
README.md		README.md
paths.conf		paths.conf
paths.py		paths.py
requirements.txt		requirements.txt
translate.py		translate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MeMAD subtitle translation pipeline

Dependencies

Models

Usage

Quickstart

Marian on CPU

Marian on GPU

Translating plain text data

Models

Training and test data

Translation model benchmarks

OpenSubtitles held-out test set

WMT news translation test sets

WMT 2020

Tatoeba-MT challenge models

Publications

About

Releases 1

Packages

Contributors 2

Languages

License

MeMAD-project/subtitle-translation

Folders and files

Latest commit

History

Repository files navigation

MeMAD subtitle translation pipeline

Dependencies

Models

Usage

Quickstart

Marian on CPU

Marian on GPU

Translating plain text data

Models

Training and test data

Translation model benchmarks

OpenSubtitles held-out test set

WMT news translation test sets

WMT 2020

Tatoeba-MT challenge models

Publications

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages