Spaces:
Runtime error
Runtime error
| # Deep Transformers with Latent Depth (Li et al., 2020) | |
| [https://arxiv.org/abs/2009.13102](https://arxiv.org/abs/2009.13102). | |
| ## Introduction | |
| We present a probabilistic framework to automatically learn which layer(s) to use by learning the posterior distributions of layer selection. As an extension of this framework, we propose a novel method to train one shared Transformer network for multilingual machine translation with different layer selection posteriors for each language pair. | |
| ## Training a multilingual model with latent depth | |
| Below is an example of training with latent depth in decoder for one-to-many (O2M) related languages. We use the same preprocessed (numberized and binarized) TED8 dataset as in [Balancing Training for Multilingual Neural Machine Translation (Wang et al., 2020)](https://github.com/cindyxinyiwang/multiDDS), which could be generated by [the script](https://github.com/cindyxinyiwang/multiDDS/blob/multiDDS/util_scripts/prepare_multilingual_data.sh) the author provided. | |
| ```bash | |
| lang_pairs_str="eng-aze,eng-bel,eng-ces,eng-glg,eng-por,eng-rus,eng-slk,eng-tur" | |
| databin_dir=<path to binarized data> | |
| fairseq-train ${databin_dir} \ | |
| --user-dir examples/latent_depth/latent_depth_src \ | |
| --lang-pairs "${lang_pairs_str}" \ | |
| --arch multilingual_transformer_iwslt_de_en \ | |
| --task multilingual_translation_latent_depth \ | |
| --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ | |
| --share-encoders \ | |
| --share-decoders \ | |
| --decoder-langtok \ | |
| --share-decoder-input-output-embed \ | |
| --dropout 0.3 --attention-dropout 0.3 \ | |
| --optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' \ | |
| --lr-scheduler inverse_sqrt --stop-min-lr 1e-9 --warmup-init-lr 1e-7 --warmup-updates 8000 \ | |
| --max-tokens 4096 --update-freq 1 \ | |
| --lr 0.0015 \ | |
| --clip-norm 1.0 \ | |
| --seed 2 \ | |
| --ddp-backend=legacy_ddp \ | |
| --encoder-layers 12 \ | |
| --decoder-layers 24 \ | |
| --decoder-latent-layer \ | |
| --sparsity-weight 0.1 \ | |
| --anneal-updates 5000 \ | |
| --soft-update 500 \ | |
| --target-layers 12 \ | |
| --share-weight 0.1 | |
| ``` | |
| ## Inference command | |
| ```bash | |
| lang_pairs_str="eng-aze,eng-bel,eng-ces,eng-glg,eng-por,eng-rus,eng-slk,eng-tur" | |
| databin_dir=<path to binarized data> | |
| model_path=<path to checkpoint> | |
| src_lang=<source language to translate from> | |
| tgt_lang=<target language to translate to> | |
| gen_data=<name of data split, e.g. valid, test, etc> | |
| fairseq-generate ${databin_dir} \ | |
| --path ${model_path} \ | |
| --task multilingual_translation_latent_depth \ | |
| --decoder-latent-layer \ | |
| --lang-pairs "${lang_pairs_str}" \ | |
| -s ${src_lang} -t ${tgt_lang} \ | |
| --gen-subset $gen_data \ | |
| --scoring sacrebleu \ | |
| --remove-bpe 'sentencepiece' \ | |
| --lenpen 1.0 \ | |
| --beam 5 \ | |
| --decoder-langtok \ | |
| --max-tokens 4096 | |
| ``` | |
| ## Citation | |
| ```bibtex | |
| @article{li2020deep, | |
| title={Deep Transformers with Latent Depth}, | |
| author={Li, Xian and Stickland, Asa Cooper and Tang, Yuqing and Kong, Xiang}, | |
| journal={arXiv preprint arXiv:2009.13102}, | |
| year={2020} | |
| } | |
| ``` | |