about « all posts

Neural Machine Translation

Apr 4 2023 · 8 min read
#nlp #neural-machine-translation

Disclaimer: this post is for my bachelor and master students at UvA, to give them introduction for PRACTICAL NMT from top to bottom. There is a gap between theory and practice, that is not always addressed by books.


Since you are an informatics or ai program’s student, let’s assume you know something about machine learning, or maybe even machine translation. You’ve heard about Transformer. You probably read The Illustrated Transformer blog post. Maybe you even had NLP course (if not, see resources). You know python and pytorch (I hope). And now you have a task to train NMT system. You can of course write the code yourself. That’s tedious. So instead people tend to use ML frameworks such as fairseq, hugging face transformers, joeynmt Here you can find Good overview of the frameworks. We will look into each part and see examples with fairseq framework. Why frameworks? Training/evaluation loops are common for seq2seq models, frameworks deal with it and also with the data loading and many more things. It’s not worth it to spend time writing your own code for routine tasks when you can focus on research part.

To make things easier, we can split NMT pipeline into 3 parts and look at them separately:

Pre-processing: It’s all about data

One day you wake up and think “I need to train the best -to- translation model!”. First and foremost, hold your horses and check if you have access to parallel data.

What is the parallel data? It a set of sentences in language X (source) and same sentence in language Y (target). In practice you need 2 files: source and target data. Source file contains only sentences in the source language, one sentence per line. Each line of target file contains same sentences in target language (i.e., the same sentences have the same line number).

Good morning.
i used to be an adventurer like you, then i took an arrow in the knee.
Goede morgen.
Ik was vroeger een avonturier zoals jij toen nam ik een pijl in de knie

Where to find parallel data? There are several publicly available NMT datasets. The main purpose of such curated datasets is that evaluation and test sets are kept the same, so you can compare your results with other works.

Sometimes data comes in one csv file, sometimes it’s xml files. Then you need to parse them. I hope you know some bash. If not, just google and copy from stackoverflow.

Surprisingly (no) we cannot feed raw text into the neural network. They work with numbers not words. So pre-processing aims to map text into (some) numerical representation.

Pre-processing usually include:

There are also some framework-specific modifications, but let’s talk about it later.

One example of pre-processing pipeline:


Tools: moses, hugging face tokenizer etc

  1. Tokenization: Needed to split sentence into the individual tokens. E.g. split punctuations.
i used to be an adventurer like you , then i took an arrow in the knee .

You see? The main difference is that each word can be assigned an individual index.

  1. Truecasing: restore capitalizeion where necessarily.
I used to be an adventurer like you , then I took an arrow in the knee .


Vocabulary is important for NMT system. It defines how your text will be encoded and how generation process will go. With larger vocabulary you have a slower training, BUT you can generate bigger variety of tokens. It’s always about compromizes and best practices. Example of vocabulary building:

def get_vocab(train_file, vocab_file):

    c = Counter()

    for line in train_file:
        for word in line.strip('\r\n ').split(' '):
            if word:
                c[word] += 1

    for key,f in sorted(c.items(), key=lambda x: x[1], reverse=True):
        vocab_file.write(key+" "+ str(f) + "\n")

The most frequent tokens usually have the lowest index. Each word can be encoded with an index of this word in vocabulary.

I used to be an adventurer like you , then I took an arrow in the knee .
13 300 9 40 45 1 50 16 5 85 13 401 45 1 14 7 1 6

If the word is not in the vocabulary, it can be encoded with unk token (index 1 in example). Same procedere for target language.

Sub-tokens splitting

Tools: sentencepiece (spm) subword-nmt (bpe)

As you can see from example above, word-level vocabulary results into multiple unknowns, which called out-of-vocabulary words (OOV) in NLP literature. To reduce the amount of oov, subwords splitting was suggested Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., ACL 2016).

You can train your spm/bpe model on training data, or external data. You can also train bpe jointly on source/target.

I used to be an ad@@ vent@@ ur@@ er like you , then I took an ar@@ row in the k@@ nee .
13 300 9 40 45 271 1458 716 187 50 16 5 85 13 401 45 207 1632 14 7 255 4811 6

Amazing! With sub-words we do not have any unknowns!

At the end, your files can look like this

├── test.nl-en.bpe.en
├── test.nl-en.bpe.nl
├── train.nl-en.bpe.en
├── train.nl-en.bpe.nl
├── valid.nl-en.bpe.en
└── valid.nl-en.bpe.nl

Framework-specific preprocessing

It might be needed to do additional step before we satrt train the model. It always depends on what you use for training your model. Fairseq requires binary-format data. And it can de prepared via fairseq-preprocess

$ fairseq-preprocess \
    --source-lang nl --target-lang en \
    --trainpref my_awesome_nmt_data/train.nl-en \
		--validpref my_awesome_nmt_data/valid.nl-en \
		--testpref my_awesome_nmt_data/test.nl-en \
    --destdir data-bin/my_awesome_nmt_data_bin \
		--thresholdtgt 0 --thresholdsrc --workers 20

Fairseq will automatically build a vocabulary. You can use joint vocabulary (meaning you have the same source AND target vocabulary, and this vocabulary contains tokens from both sides) by using --joined-dictionary parameter, sometime it helps to boost performance.

You can also make your own vocabulary and pass file to fairseq-preprocess using --tgtdict --srcdict. But if you need custom vocabulary, I trust you know what you are doing.


Before you start training, you need to make sure you have a data (see previous step), model and criterion implemented. Do you understand what is your input and output? Do you understand your model? Do you understand criterion you optimize?

I want to make this part a bit more framework-specific to show examples. But I will give some tips on how to start with a framework.

Case 1 (easy): Everything is implemented (in a framework)

Only training parameters are your responsibility. Example from fairseq

Personally I prefer using .yaml configs and [fairseq-hydra-train](https://github.com/facebookresearch/fairseq/blob/main/docs/hydra_integration.md) rather than CLI training . But you can do what suits you best.

Example config:

# @package _group_
  _name: translation
  data: my_awesome_nmt_data_bin
  source_lang: nl
  target_lang: en
  eval_bleu: true
  eval_bleu_args: '{"beam":1}'
  eval_bleu_detok: moses
  eval_bleu_remove_bpe: sentencepiece
  eval_bleu_print_samples: false
  _name: label_smoothed_cross_entropy
  label_smoothing: 0.1
  _name: transformer
    learned_pos: true
    learned_pos: true
  dropout: 0.3
  share_decoder_input_output_embed: true
  _name: adam
  adam_betas: (0.9,0.98)
  _name: inverse_sqrt
  warmup_updates: 10000
  warmup_init_lr: 1e-07
  max_tokens: 8000
  validate_interval_updates: 2000
  lr: [0.0005]
  update_freq: [8]
  max_update: 50000
  stop_min_lr: 1e-09
  no_epoch_checkpoints: true
  best_checkpoint_metric: bleu
  maximize_best_checkpoint_metric: true
  save_dir: my_awesome_model
  wandb_project: ???
  log_format: simple
  log_interval: 100

then you just run

$ fairseq-hydra-train \
    --config-dir /path/to/external/configs \
    --config-name my_awesome_config.yaml

If you want to override config parameter

$ fairseq-hydra-train \
    --config-dir /path/to/external/configs \
    --config-name my_awesome_config.yaml common.wandb_project=MYPROJECT

if you want to add new param:

$ fairseq-hydra-train \
    --config-dir /path/to/external/configs \
    --config-name my_awesome_config.yaml +common.new_param=paramvalue

How to tune parameters?

You can of course try different combinations, however there are MANY, so it’s not feasible unless you have unlimited gpu resource. If we freeze data and model architecture, there are two important parameters: learning rate (and warmup schedule) and batch size. Other parameters can give performance boost too of course, but these two are most likely to affect training.

If you use well-known dataset, you can look up the papers that report results on it and see their hyperparameters. E.g., you can use papers with code to sort papers based on test sets.

Please also refer to tuning playbook for general tips on tuning.


Case 2 (not so easy): You need to implement some parts

The most exciting part. Some tips on extending fairseq you can find here.

Case 3 (hero mode): You need to implement everything. Good luck.

You have reached the world’s edge, none but devils play past here

Generation (+Evaluation)

At this point you have your model file. You want to generate targets for source sentence in a test subset and evaluate them using automated evaluation metric. Easy.

$ fairseq-generate my_awesome_data_bin \
    --path checkpoints/checkpoint_best.pt \
    --batch-size 128 --beam 5 --remove-bpe > hypothesis

If your subword splitting is spm, use --remove-bpe sentencepiece

Here comes the tricky part. You need to do post-processing, meaning the reverse process of pre-processing. Last step first. E.g., you first remove truecasing, then tokenization and only then you assess your translation using e.g. https://github.com/mjpost/sacrebleu

$ cat hypothesis | grep -P \"^H\" | sort -V | cut -f3- | $MOSES/scripts/recaser/detruecase.perl | $MOSES/scripts/tokenizer/detokenizer.perl -l {trg} > eval_ready_hyps
$ cat eval_ready_hyps | sacrebleu target_test_file

BLEU score is a common metric for automated evaluation of translation. But it’s not the best. There are many other metrics you can use.


Courses and tutorials:






Under construction. Ask me questions, I will add them here.