Neural Machine Translation for EN-MS and MS-EN
9 October 2022 | Open Source
- Neural Machine Translation for EN-MS and MS-EN
- Introduction
- Available NLLB models
- Problem with NLLB
- Good thing about NLLB
- How we fine tuned baseline T5 multitask models
- EN-MS translation
- NLLB benchmarks
- Google Translate benchmarks
- Mesolitica T5
- How to translate using Mesolitica T5
- MS-EN translation
- NLLB benchmarks
- Google Translate benchmarks
- Mesolitica T5
- How to translate using Mesolitica T5
- Go beyond
Introduction
Neural Machine Translation or NMT is taking over the world for automatic translation task, from Google Translate, Facebook and other big companies are using NMT to reduce the gap to understand multinational users.
And recently, META AI released No Language Left Behind or NLLB, a NMT that able to translate up to 200 languages, read more about NLLB at https://ai.facebook.com/research/no-language-left-behind/
Good thing about NLLB, the models are 100% open sourced, from 600M parameters up to 54.5B parameters, https://github.com/facebookresearch/fairseq/tree/nllb#open-sourced-models-and-community-integrations.
Available NLLB models
For easier interface to run NLLB, you can try it using HuggingFace Transformers library, but only available from 600M parameters to 3.3B parameters only.
- 3.3B parameters, model size 17.58 GB, https://huggingface.co/facebook/nllb-200-3.3B
- distilled 1.3B parameters, model size 5.48 GB, https://huggingface.co/facebook/nllb-200-distilled-1.3B
- 1.3B parameters, model size 5.48 GB, https://huggingface.co/facebook/nllb-200-1.3B
- distilled 600M parameters, model size 2.46 GB, https://huggingface.co/facebook/nllb-200-distilled-600M
All NLLB models able to translate EN-MS and MS-EN by simply using prefix text, you can check an example how to use HuggingFace Transformers for NLLB at https://huggingface.co/spaces/Geonmo/nllb-translation-demo
Problem with NLLB
The sizes are huge, the smallest model is 2.46GB, we as Malaysian majority want only to translate EN-MS and MS-EN only, and able to translate up to 200 languages is unnecessary for Malaysian daily context.
Good thing about NLLB
The good about NLLB, they released the dataset to train that 200 languages. META AI bitexts in 1613 directions and released the steps and data at https://github.com/facebookresearch/LASER/tree/main/data/nllb200
But if you want to download the dataset included the texts, AllenAI did a great job to populate the texts, https://huggingface.co/datasets/allenai/nllb
Because Mesolitica already have baseline T5 multitask models,
- Base size, 892 MB, https://huggingface.co/mesolitica/t5-base-standard-bahasa-cased
- Small size, 242 MB, https://huggingface.co/mesolitica/t5-small-standard-bahasa-cased
- Tiny size, 139 MB, https://huggingface.co/mesolitica/t5-tiny-standard-bahasa-cased
- Super Tiny size, 50.7 MB, https://huggingface.co/mesolitica/t5-super-tiny-bahasa-cased
- Super Super Tiny size, 23.3 MB, https://huggingface.co/mesolitica/t5-super-super-tiny-standard-bahasa-cased
- 3x Super Tiny size, 9.68 MB, https://huggingface.co/mesolitica/t5-3x-super-tiny-standard-bahasa-cased
We can use these models to train our own EN-MS and MS-EN NMT models using NLLB dataset and added with our own dataset.
How we fine tuned baseline T5 multitask models
- Download the dataset and filter texts longer than 256 words (this is to reduce training computation cost and we already covered ~81% of dataset), and do simple text preparation,
{"translation": {"src": "the human rights violations inflicted by the Zionists on mostly", "tgt": "kezaliman yang tidak berperikemanusiaan tentera-tentera zionis terutamanya", "prefix": "terjemah Inggeris ke Melayu: "}}
{"translation": {"src": "kezaliman yang tidak berperikemanusiaan tentera-tentera zionis terutamanya", "tgt": "the human rights violations inflicted by the Zionists on mostly", "prefix": "terjemah Melayu ke Inggeris: "}}
From here we can see, we got 2 prefixes,
terjemah Inggeris ke Melayu:
terjemah Melayu ke Inggeris:
If we want to translate EN → MS, we need to use terjemah Inggeris ke Melayu:
, if we want to translate MS → EN, we need to use terjemah Melayu ke Inggeris:
.
- Run fine tune script, you can get it from here, https://github.com/huseinzol05/malaya/tree/master/session/translation/hf-t5
EN-MS translation
NLLB benchmarks
Meta released their benchmarks based on dev set they provided, you can get it at https://github.com/facebookresearch/fairseq/tree/nllb#multilingual-translation-models, the score based on chrF2++
- NLLB-200, MOE, 54.5B parameters, 66.5
- NLLB-200, Dense, 3.3B parameters, 17.58 GB, 66.3
- NLLB-200, Dense, 1.3B parameters, 5.48 GB, 65.2
- NLLB-200-Distilled, Dense, 1.3B parameters, 5.48 GB, 65.5
- NLLB-200-Distilled, Dense, 600M parameters, 2.46 GB, 63.5
Google Translate benchmarks
We use https://github.com/ssut/py-googletrans library for Google Translate python interface, can check benchmark notebook at https://github.com/huseinzol05/malay-dataset/blob/master/translation/malay-english/flores200-en-ms-google-translate.ipynb, and the final benchmarks,
{'name': 'BLEU',
'score': 39.12728212969207,
'_mean': -1.0,
'_ci': -1.0,
'_verbose': '71.1/47.2/32.7/22.8 (BP = 0.984 ratio = 0.984 hyp_len = 21679 ref_len = 22027)',
'bp': 0.9840757522087613,
'counts': [15406, 9770, 6435, 4256],
'totals': [21679, 20682, 19685, 18688],
'sys_len': 21679,
'ref_len': 22027,
'precisions': [71.0641634761751,
47.2391451503723,
32.68986537973076,
22.773972602739725],
'prec_str': '71.1/47.2/32.7/22.8',
'ratio': 0.9842012076088437}
chrF2++ = 64.45
Mesolitica T5
- Base size, 892 MB, https://huggingface.co/mesolitica/t5-base-standard-bahasa-cased, 67.60
- Small size, 242 MB, https://huggingface.co/mesolitica/t5-small-standard-bahasa-cased, 67.43
- Tiny size, 139 MB, https://huggingface.co/mesolitica/t5-tiny-standard-bahasa-cased, 65.70
- Super Tiny size, 50.7 MB, https://huggingface.co/mesolitica/t5-super-tiny-bahasa-cased, 64.03
- Super Super Tiny size, 23.3 MB, https://huggingface.co/mesolitica/t5-super-super-tiny-standard-bahasa-cased, 61.89
The metrics are slightly better than META NLLB and Google Translate!
How to translate using Mesolitica T5
We use https://huggingface.co/ to store the models and easy interface by using HuggingFace transformers library, https://huggingface.co/docs/transformers/index
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('mesolitica/t5-small-standard-bahasa-cased')
model = AutoModelForSeq2SeqLM.from_pretrained('mesolitica/t5-small-standard-bahasa-cased')
string = 'Hi guys! I noticed semalam & harini dah ramai yang dapat cookies ni kan. So harini i nak share some post mortem of our first batch:'
input_ids = tokenizer.encode(f'terjemah Inggeris ke Melayu: {string}', return_tensors = 'pt')
outputs = model.generate(input_ids, max_length = 100)
print(tokenizer.decode(outputs[0]))
-> <pad> Hai lelaki! Saya perhatikan semalam & harini dah ramai yang dapat kuki ni kan. Jadi harini saya nak kongsi beberapa bedah siasat kumpulan pertama kami:</s>
Simple as that, and it can read mixed language also and translate to output language that we want.
you can do beam decoder, k-sampling text generation and so much more, read more at https://huggingface.co/blog/how-to-generate
MS-EN translation
NLLB benchmarks
Meta released their benchmarks based on dev set they provided, you can get it at https://github.com/facebookresearch/fairseq/tree/nllb#multilingual-translation-models, the score based on chrF2++
- NLLB-200, MOE, 54.5B parameters, 68
- NLLB-200, Dense, 3.3B parameters, 17.58 GB, 67.8
- NLLB-200, Dense, 1.3B parameters, 5.48 GB, 66.4
- NLLB-200-Distilled, Dense, 1.3B parameters, 5.48 GB, 66.2
- NLLB-200-Distilled, Dense, 600M parameters, 2.46 GB, 64.3
Google Translate benchmarks
We use https://github.com/ssut/py-googletrans library for Google Translate python interface, can check benchmark notebook at https://github.com/huseinzol05/malay-dataset/blob/master/translation/malay-english/flores200-ms-en-google-translate.ipynb, and the final benchmarks,
{'name': 'BLEU',
'score': 36.152220848177286,
'_mean': -1.0,
'_ci': -1.0,
'_verbose': '68.2/43.5/29.7/20.5 (BP = 0.986 ratio = 0.986 hyp_len = 23243 ref_len = 23570)',
'bp': 0.9860297505310752,
'counts': [15841, 9688, 6318, 4147],
'totals': [23243, 22246, 21249, 20252],
'sys_len': 23243,
'ref_len': 23570,
'precisions': [68.15385277287785,
43.54940213971051,
29.733163913595934,
20.476989926920798],
'prec_str': '68.2/43.5/29.7/20.5',
'ratio': 0.986126431904964}
chrF2++ = 60.27
Mesolitica T5
- Base size, 892 MB, https://huggingface.co/mesolitica/t5-base-standard-bahasa-cased, 65.44
- Small size, 242 MB, https://huggingface.co/mesolitica/t5-small-standard-bahasa-cased, 64.67
- Tiny size, 139 MB, https://huggingface.co/mesolitica/t5-tiny-standard-bahasa-cased, 61.29
- Super Tiny size, 50.7 MB, https://huggingface.co/mesolitica/t5-super-tiny-bahasa-cased, 59.18
- Super Super Tiny size, 23.3 MB, https://huggingface.co/mesolitica/t5-super-super-tiny-standard-bahasa-cased, 56.46
For Base, Small and Tiny models are better than Google Translate, but scored lower than META NLLB.
How to translate using Mesolitica T5
We use https://huggingface.co/ to store the models and easy interface by using HuggingFace transformers library, https://huggingface.co/docs/transformers/index
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('mesolitica/t5-small-standard-bahasa-cased')
model = AutoModelForSeq2SeqLM.from_pretrained('mesolitica/t5-small-standard-bahasa-cased')
string = 'Hi guys! I noticed semalam & harini dah ramai yang dapat cookies ni kan. So harini i nak share some post mortem of our first batch:'
input_ids = tokenizer.encode(f'terjemah Melayu ke Inggeris: {string}', return_tensors = 'pt')
outputs = model.generate(input_ids, max_length = 100)
print(tokenizer.decode(outputs[0]))
-> <pad> Hi guys! I noticed yesterday & today many got cookies. So today I want to share some post mortem of our first batch:</s>
Simple as that, and it can read mixed language also and translate to output language that we want.
Go beyond
We use short forms, slangs and abbreviations for local social media, to translate end-to-end local texts is a very hard task but that is not going to stop us to release better models, stay tuned on our updates!