Evgeniia Tokarchuk | Neural Machine Translation

Disclaimer: this post is for my bachelor and master students at UvA, to give them introduction for PRACTICAL NMT from top to bottom. There is a gap between theory and practice, that is not always addressed by books.

Introduction

Since you are an informatics or ai program’s student, let’s assume you know something about machine learning, or maybe even machine translation. You’ve heard about Transformer. You probably read The Illustrated Transformer blog post. Maybe you even had NLP course (if not, see resources). You know python and pytorch (I hope). And now you have a task to train NMT system. You can of course write the code yourself. That’s tedious. So instead people tend to use ML frameworks such as fairseq, hugging face transformers, joeynmt Here you can find Good overview of the frameworks. We will look into each part and see examples with fairseq framework. Why frameworks? Training/evaluation loops are common for seq2seq models, frameworks deal with it and also with the data loading and many more things. It’s not worth it to spend time writing your own code for routine tasks when you can focus on research part.

To make things easier, we can split NMT pipeline into 3 parts and look at them separately:

Pre-processing: It’s all about data

One day you wake up and think “I need to train the best -to- translation model!”. First and foremost, hold your horses and check if you have access to parallel data.

What is the parallel data? It a set of sentences in language X (source) and same sentence in language Y (target). In practice you need 2 files: source and target data. Source file contains only sentences in the source language, one sentence per line. Each line of target file contains same sentences in target language (i.e., the same sentences have the same line number).

Where to find parallel data? There are several publicly available NMT datasets. The main purpose of such curated datasets is that evaluation and test sets are kept the same, so you can compare your results with other works.

Sometimes data comes in one csv file, sometimes it’s xml files. Then you need to parse them. I hope you know some bash. If not, just google and copy from stackoverflow.

Surprisingly (no) we cannot feed raw text into the neural network. They work with numbers not words. So pre-processing aims to map text into (some) numerical representation.

There are also some framework-specific modifications, but let’s talk about it later.

You see? The main difference is that each word can be assigned an individual index.

Vocabulary is important for NMT system. It defines how your text will be encoded and how generation process will go. With larger vocabulary you have a slower training, BUT you can generate bigger variety of tokens. It’s always about compromizes and best practices. Example of vocabulary building:

#https://github.com/rsennrich/subword-nmt/blob/master/subword_nmt/get_vocab.py#L40
def get_vocab(train_file, vocab_file):

    c = Counter()

    for line in train_file:
        for word in line.strip('\r\n ').split(' '):
            if word:
                c[word] += 1

    for key,f in sorted(c.items(), key=lambda x: x[1], reverse=True):
        vocab_file.write(key+" "+ str(f) + "\n")

The most frequent tokens usually have the lowest index. Each word can be encoded with an index of this word in vocabulary.

If the word is not in the vocabulary, it can be encoded with unk token (index 1 in example). Same procedere for target language.

As you can see from example above, word-level vocabulary results into multiple unknowns, which called out-of-vocabulary words (OOV) in NLP literature. To reduce the amount of oov, subwords splitting was suggested Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., ACL 2016).

You can train your spm/bpe model on training data, or external data. You can also train bpe jointly on source/target.

It might be needed to do additional step before we satrt train the model. It always depends on what you use for training your model. Fairseq requires binary-format data. And it can de prepared via fairseq-preprocess

$ fairseq-preprocess \
    --source-lang nl --target-lang en \
    --trainpref my_awesome_nmt_data/train.nl-en \
		--validpref my_awesome_nmt_data/valid.nl-en \
		--testpref my_awesome_nmt_data/test.nl-en \
    --destdir data-bin/my_awesome_nmt_data_bin \
		--thresholdtgt 0 --thresholdsrc --workers 20

Fairseq will automatically build a vocabulary. You can use joint vocabulary (meaning you have the same source AND target vocabulary, and this vocabulary contains tokens from both sides) by using --joined-dictionary parameter, sometime it helps to boost performance.

You can also make your own vocabulary and pass file to fairseq-preprocess using --tgtdict --srcdict. But if you need custom vocabulary, I trust you know what you are doing.

Training

Before you start training, you need to make sure you have a data (see previous step), model and criterion implemented. Do you understand what is your input and output? Do you understand your model? Do you understand criterion you optimize?

I want to make this part a bit more framework-specific to show examples. But I will give some tips on how to start with a framework.

Case 1 (easy): Everything is implemented (in a framework)

Personally I prefer using .yaml configs and [fairseq-hydra-train](https://github.com/facebookresearch/fairseq/blob/main/docs/hydra_integration.md) rather than CLI training . But you can do what suits you best.

# @package _group_
task:
  _name: translation
  data: my_awesome_nmt_data_bin
  source_lang: nl
  target_lang: en
  eval_bleu: true
  eval_bleu_args: '{"beam":1}'
  eval_bleu_detok: moses
  eval_bleu_remove_bpe: sentencepiece
  eval_bleu_print_samples: false
criterion:
  _name: label_smoothed_cross_entropy
  label_smoothing: 0.1
model:
  _name: transformer
  decoder:
    learned_pos: true
  encoder:
    learned_pos: true
  dropout: 0.3
  share_decoder_input_output_embed: true
optimizer:
  _name: adam
  adam_betas: (0.9,0.98)
lr_scheduler:
  _name: inverse_sqrt
  warmup_updates: 10000
  warmup_init_lr: 1e-07
dataset:
  max_tokens: 8000
  validate_interval_updates: 2000
optimization:
  lr: [0.0005]
  update_freq: [8]
  max_update: 50000
  stop_min_lr: 1e-09
checkpoint:
  no_epoch_checkpoints: true
  best_checkpoint_metric: bleu
  maximize_best_checkpoint_metric: true
  save_dir: my_awesome_model
common:
  wandb_project: ???
  log_format: simple
  log_interval: 100

$ fairseq-hydra-train \
    --config-dir /path/to/external/configs \
    --config-name my_awesome_config.yaml

$ fairseq-hydra-train \
    --config-dir /path/to/external/configs \
    --config-name my_awesome_config.yaml common.wandb_project=MYPROJECT

$ fairseq-hydra-train \
    --config-dir /path/to/external/configs \
    --config-name my_awesome_config.yaml +common.new_param=paramvalue

You can of course try different combinations, however there are MANY, so it’s not feasible unless you have unlimited gpu resource. If we freeze data and model architecture, there are two important parameters: learning rate (and warmup schedule) and batch size. Other parameters can give performance boost too of course, but these two are most likely to affect training.

If you use well-known dataset, you can look up the papers that report results on it and see their hyperparameters. E.g., you can use papers with code to sort papers based on test sets.

Case 2 (not so easy): You need to implement some parts

Case 3 (hero mode): You need to implement everything. Good luck.

Generation (+Evaluation)

At this point you have your model file. You want to generate targets for source sentence in a test subset and evaluate them using automated evaluation metric. Easy.

$ fairseq-generate my_awesome_data_bin \
    --path checkpoints/checkpoint_best.pt \
    --batch-size 128 --beam 5 --remove-bpe > hypothesis

Here comes the tricky part. You need to do post-processing, meaning the reverse process of pre-processing. Last step first. E.g., you first remove truecasing, then tokenization and only then you assess your translation using e.g. https://github.com/mjpost/sacrebleu

$ cat hypothesis | grep -P \"^H\" | sort -V | cut -f3- | $MOSES/scripts/recaser/detruecase.perl | $MOSES/scripts/tokenizer/detokenizer.perl -l {trg} > eval_ready_hyps
$ cat eval_ready_hyps | sacrebleu target_test_file

BLEU score is a common metric for automated evaluation of translation. But it’s not the best. There are many other metrics you can use.

Neural Machine Translation