MICWEN PRO

cesear64

AI & ML interests

None yet

Recent Activity

updated a model about 8 hours ago

MEYNG/nllb-sango-finetuned-600m

repliedto their post about 18 hours ago

Just published: how we built production Sango (Central African Republic) translation without fine-tuning, parallel corpus, or training compute. The method — vocabulary-augmented prompting with a 581-entry native-speaker-verified lexicon — generalizes to any of the ~2,000 African languages at the same data-poverty level. Recipe, dataset, and code template all included. 📄 Blog: https://huggingface.co/blog/MEYNG/sangoai 📦 Dataset: https://huggingface.co/datasets/MEYNG/sango-vocabulary Would especially value feedback from anyone working on other low-resource African languages — Ewondo, Lingala, Wolof next on our roadmap.

published an article about 18 hours ago

Scaling Zero-Resource Vocabulary: A Data Pipeline for Sango

View all activity

Organizations

updated a model about 8 hours ago

MEYNG/nllb-sango-finetuned-600m

Translation • 0.6B • Updated about 8 hours ago • 1k

replied to their post about 18 hours ago

Exactly right on the bootstrapping loop — that's precisely the progression we're running.

Small precision on the mechanism: the model has seen some Sango during pretraining (it appears in Common Crawl), but not enough to produce coherent translations cold. The vocabulary injection doesn't teach the language from scratch — it gives the model enough anchoring signal to activate what it weakly learned. The grammar rules and orthography notes handle the parts pretraining didn't cover reliably (tonal distinctions, diacritics, Sango-specific syntax).

And yes, the loop you're describing is live: the vocabulary-augmented outputs → native-speaker verification → parallel corpus → fine-tuned NMT model. We just published BENCH-001 results on the fine-tune: +5.70 BLEU over baseline on French→Sango, +9.10 on Sango→French. The vocabulary-augmented prompting approach (BLEU 2.92 on the same task, zero fine-tuning) is the floor; the fine-tune is what you get once the dataset is big enough.

The data pipeline post documenting that second step just went up here: https://huggingface.co/blog/MEYNG/sango-vocabulary-pipeline

The interesting open question is where the ceiling is for a 600M-parameter model on a language with ~5M speakers and sparse digitized text. We're nowhere near it yet.

published an article about 18 hours ago

Article

Scaling Zero-Resource Vocabulary: A Data Pipeline for Sango

MEYNG

•

about 18 hours ago

updated a model 3 days ago

MEYNG/nllb-sango-finetuned-600m-3ep

Updated 3 days ago • 559

published a model 3 days ago

MEYNG/nllb-sango-finetuned-600m-3ep

Updated 3 days ago • 559

published a model 4 days ago

MEYNG/nllb-sango-finetuned-600m

Translation • 0.6B • Updated about 8 hours ago • 1k

posted an update 8 days ago

Post

4098

Just published: how we built production Sango (Central African Republic) translation without fine-tuning, parallel corpus, or training compute.

The method — vocabulary-augmented prompting with a 581-entry native-speaker-verified lexicon — generalizes to any of the ~2,000 African languages at the same data-poverty level. Recipe, dataset, and code template all included.

📄 Blog: https://huggingface.co/blog/MEYNG/sangoai
📦 Dataset: MEYNG/sango-vocabulary

Would especially value feedback from anyone working on other low-resource African languages — Ewondo, Lingala, Wolof next on our roadmap.

2 replies

published an article 9 days ago

Article

Vocabulary-Augmented Prompting for Sango — Production African Language AI Without a Parallel Corpus

MEYNG

•

9 days ago

• 2

published an article 16 days ago

Article

Vocabulary-Augmented Prompting for Sango — Production African Language AI Without a Parallel Corpus

MEYNG

•

16 days ago

updated a dataset 18 days ago

MEYNG/sango-vocabulary

Viewer • Updated 18 days ago • 971 • 141 • 1

updated a Space about 2 months ago

MEYNG

🌍

published a Space about 2 months ago

MEYNG

🌍

published a dataset about 2 months ago

MEYNG/sango-vocabulary

Viewer • Updated 18 days ago • 971 • 141 • 1

MICWEN PRO

AI & ML interests

Recent Activity

Organizations

cesear64's activity

Scaling Zero-Resource Vocabulary: A Data Pipeline for Sango

Vocabulary-Augmented Prompting for Sango — Production African Language AI Without a Parallel Corpus

Vocabulary-Augmented Prompting for Sango — Production African Language AI Without a Parallel Corpus

MEYNG

MEYNG