| | --- |
| | language: |
| | - eng |
| | - tig |
| | tags: |
| | - tokenizer |
| | - machine-translation |
| | - low-resource |
| | - geez-script |
| | license: mit |
| | datasets: |
| | - nllb |
| | - opus |
| | metrics: |
| | - bleu |
| | --- |
| | |
| | # English–Tigrinya Machine Translation & Tokenizer |
| |
|
| | ### 📌 Conference |
| | Accepted at the **3rd International Conference on Foundation and Large Language Models (FLLM2025)** |
| | 📍 25–28 November 2025 | Vienna, Austria |
| |
|
| | **Paper Title**: *Low-Resource English–Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks* |
| |
|
| | --- |
| |
|
| | ## 📝 Model Summary |
| |
|
| | This repository provides a **custom tokenizer** and a **fine-tuned MarianMT model** for **English ↔ Tigrinya machine translation**. |
| | It leverages the NLLB dataset for training and OPUS parallel corpora for testing and evaluation, with BLEU used as the primary metric. |
| |
|
| | - **Languages:** English (eng), Tigrinya (tig) |
| | - **Tokenizer:** SentencePiece, customized for Geez-script representation |
| | - **Model:** MarianMT (multilingual transformer) fine-tuned for English–Tigrinya translation |
| | - **License:** MIT |
| |
|
| | --- |
| |
|
| | ## 🔍 Model Details |
| |
|
| | ### Tokenizer |
| | - **Type**: SentencePiece-based subword tokenizer |
| | - **Purpose**: Handles Geez-script specific tokenization for Tigrinya |
| | - **Training Data**: NLLB English–Tigrinya subset |
| | - **Evaluation Data**: OPUS parallel corpus |
| |
|
| | ### Translation Model |
| | - **Base Model**: MarianMT |
| | - **Frameworks**: Hugging Face Transformers, PyTorch |
| | - **Task**: Bidirectional English ↔ Tigrinya MT |
| |
|
| | --- |
| |
|
| | ## ⚙️ Training Details |
| |
|
| | - **Training Dataset**: NLLB Parallel Corpus (English ↔ Tigrinya) |
| | - **Testing Dataset**: OPUS Parallel Corpus |
| | - **Epochs**: 3 |
| | - **Batch Size**: 8 |
| | - **Max Sequence Length**: 128 tokens |
| | - **Learning Rate**: `1.44e-07` with decay |
| |
|
| | **Training Loss** |
| | - Epoch 1: 0.443 |
| | - Epoch 2: 0.4077 |
| | - Epoch 3: 0.4379 |
| | - Final Loss: 0.4756 |
| |
|
| | **Gradient Norms** |
| | - Epoch 1: 1.14 |
| | - Epoch 2: 1.11 |
| | - Epoch 3: 1.06 |
| |
|
| | **Performance** |
| | - Training Time: ~12 hours (43,376.7s) |
| | - Speed: 96.7 samples/sec | 12.08 steps/sec |
| |
|
| | --- |
| |
|
| | ## 📊 Evaluation |
| |
|
| | - **Metric**: BLEU score |
| | - **Evaluation Dataset**: OPUS parallel English–Tigrinya |
| |
|
| | --- |
| |
|
| | ## 🚀 Usage |
| |
|
| | This model can be directly used for **English → Tigrinya** and **Tigrinya → English** translation. |
| |
|
| | ### Example (Python) |
| |
|
| | ```python |
| | from transformers import MarianMTModel, MarianTokenizer |
| | |
| | # Load the model and tokenizer |
| | model_name = "Hailay/MachineT_TigEng" |
| | model = MarianMTModel.from_pretrained(model_name) |
| | tokenizer = MarianTokenizer.from_pretrained(model_name) |
| | |
| | # Translate English → Tigrinya |
| | english_text = "We must obey the Lord and leave them alone" |
| | inputs = tokenizer(english_text, return_tensors="pt", padding=True, truncation=True) |
| | translated = model.generate(**inputs) |
| | translated_text = tokenizer.decode(translated[0], skip_special_tokens=True) |
| | |
| | print("Translated text:", translated_text) |
| | |
| | |
| | |
| | ## 📌Citation |
| | |
| | If you use this model or tokenizer in your work, please cite: |
| | |
| | @inproceedings{hailay2025lowres, |
| | title = {Low-Resource English–Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks}, |
| | author = {Hailay Kidu and collaborators}, |
| | booktitle = {Proceedings of the 3rd International Conference on Foundation and Large Language Models (FLLM2025)}, |
| | year = {2025}, |
| | location = {Vienna, Austria} |
| | } |
| | |