Disclaimer: this post is for my bachelor and master students at UvA, to give them introduction for PRACTICAL NMT from top to bottom. There is a gap between theory and practice, that is not always addressed by books.
Since you are an informatics or ai program’s student, let’s assume you know something about machine learning, or maybe even machine translation. You’ve heard about Transformer. You probably read The Illustrated Transformer blog post. Maybe you even had NLP course (if not, see resources). You know python
and pytorch
(I hope). And now you have a task to train NMT system. You can of course write the code yourself. That’s tedious. So instead people tend to use ML frameworks such as fairseq, hugging face transformers, joeynmt Here you can find Good overview of the frameworks.
We will look into each part and see examples with fairseq
framework. Why frameworks? Training/evaluation loops are common for seq2seq models, frameworks deal with it and also with the data loading and many more things. It’s not worth it to spend time writing your own code for routine tasks when you can focus on research part.
To make things easier, we can split NMT pipeline into 3 parts and look at them separately:
One day you wake up and think “I need to train the best
What is the parallel data? It a set of sentences in language X (source) and same sentence in language Y (target). In practice you need 2 files: source and target data. Source file contains only sentences in the source language, one sentence per line. Each line of target file contains same sentences in target language (i.e., the same sentences have the same line number).
Hello.
Good morning.
i used to be an adventurer like you, then i took an arrow in the knee.
Hallo.
Goede morgen.
Ik was vroeger een avonturier zoals jij toen nam ik een pijl in de knie
Where to find parallel data? There are several publicly available NMT datasets. The main purpose of such curated datasets is that evaluation and test sets are kept the same, so you can compare your results with other works.
Sometimes data comes in one csv file, sometimes it’s xml files. Then you need to parse them. I hope you know some bash. If not, just google and copy from stackoverflow.
Surprisingly (no) we cannot feed raw text into the neural network. They work with numbers not words. So pre-processing aims to map text into (some) numerical representation.
Pre-processing usually include:
There are also some framework-specific modifications, but let’s talk about it later.
One example of pre-processing pipeline:
Normalization
Tools: moses, hugging face tokenizer etc
i used to be an adventurer like you , then i took an arrow in the knee .
You see? The main difference is that each word can be assigned an individual index.
I used to be an adventurer like you , then I took an arrow in the knee .
Vocabulary
Vocabulary is important for NMT system. It defines how your text will be encoded and how generation process will go. With larger vocabulary you have a slower training, BUT you can generate bigger variety of tokens. It’s always about compromizes and best practices. Example of vocabulary building:
#https://github.com/rsennrich/subword-nmt/blob/master/subword_nmt/get_vocab.py#L40
def get_vocab(train_file, vocab_file):
c = Counter()
for line in train_file:
for word in line.strip('\r\n ').split(' '):
if word:
c[word] += 1
for key,f in sorted(c.items(), key=lambda x: x[1], reverse=True):
vocab_file.write(key+" "+ str(f) + "\n")
The most frequent tokens usually have the lowest index. Each word can be encoded with an index of this word in vocabulary.
I used to be an adventurer like you , then I took an arrow in the knee .
13 300 9 40 45 1 50 16 5 85 13 401 45 1 14 7 1 6
If the word is not in the vocabulary, it can be encoded with unk
token (index 1 in example). Same procedere for target language.
Sub-tokens splitting
Tools: sentencepiece (spm) subword-nmt (bpe)
As you can see from example above, word-level vocabulary results into multiple unknowns
, which called out-of-vocabulary words (OOV) in NLP literature. To reduce the amount of oov, subwords splitting was suggested Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., ACL 2016).
You can train your spm/bpe model on training data, or external data. You can also train bpe jointly on source/target.
I used to be an ad@@ vent@@ ur@@ er like you , then I took an ar@@ row in the k@@ nee .
13 300 9 40 45 271 1458 716 187 50 16 5 85 13 401 45 207 1632 14 7 255 4811 6
Amazing! With sub-words we do not have any unknowns!
At the end, your files can look like this
my_awesome_nmt_data
├── test.nl-en.bpe.en
├── test.nl-en.bpe.nl
├── train.nl-en.bpe.en
├── train.nl-en.bpe.nl
├── valid.nl-en.bpe.en
└── valid.nl-en.bpe.nl
Framework-specific preprocessing
It might be needed to do additional step before we satrt train the model. It always depends on what you use for training your model. Fairseq requires binary-format data. And it can de prepared via fairseq-preprocess
$ fairseq-preprocess \
--source-lang nl --target-lang en \
--trainpref my_awesome_nmt_data/train.nl-en \
--validpref my_awesome_nmt_data/valid.nl-en \
--testpref my_awesome_nmt_data/test.nl-en \
--destdir data-bin/my_awesome_nmt_data_bin \
--thresholdtgt 0 --thresholdsrc --workers 20
Fairseq will automatically build a vocabulary. You can use joint vocabulary (meaning you have the same source AND target vocabulary, and this vocabulary contains tokens from both sides) by using --joined-dictionary
parameter, sometime it helps to boost performance.
You can also make your own vocabulary and pass file to fairseq-preprocess
using --tgtdict
--srcdict
. But if you need custom vocabulary, I trust you know what you are doing.
Before you start training, you need to make sure you have a data (see previous step), model and criterion implemented. Do you understand what is your input and output? Do you understand your model? Do you understand criterion you optimize?
I want to make this part a bit more framework-specific to show examples. But I will give some tips on how to start with a framework.
Only training parameters are your responsibility. Example from fairseq
Personally I prefer using .yaml
configs and [fairseq-hydra-train](https://github.com/facebookresearch/fairseq/blob/main/docs/hydra_integration.md)
rather than CLI training . But you can do what suits you best.
Example config:
# @package _group_
task:
_name: translation
data: my_awesome_nmt_data_bin
source_lang: nl
target_lang: en
eval_bleu: true
eval_bleu_args: '{"beam":1}'
eval_bleu_detok: moses
eval_bleu_remove_bpe: sentencepiece
eval_bleu_print_samples: false
criterion:
_name: label_smoothed_cross_entropy
label_smoothing: 0.1
model:
_name: transformer
decoder:
learned_pos: true
encoder:
learned_pos: true
dropout: 0.3
share_decoder_input_output_embed: true
optimizer:
_name: adam
adam_betas: (0.9,0.98)
lr_scheduler:
_name: inverse_sqrt
warmup_updates: 10000
warmup_init_lr: 1e-07
dataset:
max_tokens: 8000
validate_interval_updates: 2000
optimization:
lr: [0.0005]
update_freq: [8]
max_update: 50000
stop_min_lr: 1e-09
checkpoint:
no_epoch_checkpoints: true
best_checkpoint_metric: bleu
maximize_best_checkpoint_metric: true
save_dir: my_awesome_model
common:
wandb_project: ???
log_format: simple
log_interval: 100
then you just run
$ fairseq-hydra-train \
--config-dir /path/to/external/configs \
--config-name my_awesome_config.yaml
If you want to override config parameter
$ fairseq-hydra-train \
--config-dir /path/to/external/configs \
--config-name my_awesome_config.yaml common.wandb_project=MYPROJECT
if you want to add new param:
$ fairseq-hydra-train \
--config-dir /path/to/external/configs \
--config-name my_awesome_config.yaml +common.new_param=paramvalue
How to tune parameters?
You can of course try different combinations, however there are MANY, so it’s not feasible unless you have unlimited gpu resource. If we freeze data and model architecture, there are two important parameters: learning rate (and warmup schedule) and batch size. Other parameters can give performance boost too of course, but these two are most likely to affect training.
If you use well-known dataset, you can look up the papers that report results on it and see their hyperparameters. E.g., you can use papers with code to sort papers based on test sets.
Please also refer to tuning playbook for general tips on tuning.
Tips:
The most exciting part. Some tips on extending fairseq
you can find here.
You have reached the world’s edge, none but devils play past here
At this point you have your model file. You want to generate targets for source sentence in a test subset and evaluate them using automated evaluation metric. Easy.
$ fairseq-generate my_awesome_data_bin \
--path checkpoints/checkpoint_best.pt \
--batch-size 128 --beam 5 --remove-bpe > hypothesis
If your subword splitting is spm
, use --remove-bpe sentencepiece
Here comes the tricky part. You need to do post-processing, meaning the reverse process of pre-processing. Last step first. E.g., you first remove truecasing
, then tokenization
and only then you assess your translation using e.g. https://github.com/mjpost/sacrebleu
$ cat hypothesis | grep -P \"^H\" | sort -V | cut -f3- | $MOSES/scripts/recaser/detruecase.perl | $MOSES/scripts/tokenizer/detokenizer.perl -l {trg} > eval_ready_hyps
$ cat eval_ready_hyps | sacrebleu target_test_file
BLEU score is a common metric for automated evaluation of translation. But it’s not the best. There are many other metrics you can use.
Courses and tutorials:
Books:
(Blog)posts
https://github.com/neubig/nmt-tips
https://jalammar.github.io/illustrated-transformer/
Under construction. Ask me questions, I will add them here.