π£οΈ SALAMA-STT β Swahili Whisper ASR Model
Developer: AI4NNOV
Authors: AI4NNOV.
Version: v1.0
License: Apache 2.0
Model Type: Automatic Speech Recognition (ASR)
Base Model: openai/whisper-small (fine-tuned for Swahili)
π Overview
SALAMA-STT (Speech-to-Text) is the first module of the SALAMA Framework β a modular end-to-end speech-to-speech AI system built for African languages.
This model is fine-tuned from OpenAIβs Whisper-small architecture for Swahili speech recognition, enhancing performance on African accents and conversational data.
The model converts Swahili audio input into accurate transcriptions and serves as the entry point for downstream LLM and TTS modules.
π§± Model Architecture
SALAMA-STT leverages the Whisper-small architecture with a transformer encoder-decoder optimized for low-resource Swahili audio transcription tasks.
The model was fine-tuned on the Mozilla Common Voice 17.0 Swahili dataset, ensuring robustness to diverse accents and speech clarity.
| Parameter | Value |
|---|---|
| Base Model | openai/whisper-small |
| Fine-Tuning | Full model fine-tuning (fp16 precision) |
| Optimizer | AdamW |
| Learning Rate | 1e-5 |
| Batch Size | 16 |
| Epochs | 10 |
| Frameworks | Transformers + Datasets + TorchAudio |
| Languages | Swahili (sw), English (en) |
π Dataset
| Dataset | Description | Purpose |
|---|---|---|
mozilla-foundation/common_voice_17_0 |
20 hours of Swahili speech data | Supervised fine-tuning |
| Custom local Swahili recordings | Conversational + accent-rich data | Accent robustness |
| Common Voice validation split | 2.3 hours | Evaluation |
π§ Model Capabilities
- Speech-to-text transcription in Swahili
- Recognition of African-accented Swahili
- Handles short and long-form audio
- Supports integration with SALAMA-LLM for full voice assistants
- Provides timestamped segment transcriptions
π Evaluation Metrics
| Metric | Baseline (Whisper-small) | Fine-tuned (SALAMA-STT) | Improvement |
|---|---|---|---|
| WER (Word Error Rate) | 1.15 | 0.43 | π» 62% |
| CER (Character Error Rate) | 0.39 | 0.18 | π» 54% |
| Accuracy | 85.2% | 95.4% | +10.2% |
Evaluation conducted on a 2-hour held-out Swahili validation set from Common Voice.
βοΈ Usage (Python Example)
Below is a quick example for Swahili speech transcription using this model:
from transformers import pipeline
# Load Swahili Whisper ASR
asr_pipeline = pipeline(
"automatic-speech-recognition",
model="EYEDOL/salama-stt",
chunk_length_s=30,
device_map="auto"
)
# Example audio file (replace with your file)
audio_path = "swahili_audio_sample.wav"
# Transcribe audio
result = asr_pipeline(audio_path)
print("π£οΈ Transcription:")
print(result["text"])
Example Output:
βKaribu kwenye mfumo wa SALAMA unaosaidia kutambua na kuelewa sauti ya Kiswahili kwa usahihi mkubwa.β
π Model Performance Summary
| Dataset | Metric | Score |
|---|---|---|
| Common Voice 17.0 (test) | WER | 0.43 |
| Common Voice 17.0 (test) | CER | 0.18 |
| Local Swahili Test Set | Accuracy | 95.4% |
β‘ Key Features
- ποΈ Accurate Swahili ASR trained on diverse voices
- π Adapted for African speech variations and dialects
- π§© Lightweight and compatible with SALAMA-LLM
- π Handles long-form recordings (β₯30s)
- π Fast inference optimized with FP16 precision
π« Limitations
- May misinterpret code-mixed (Swahili-English) speech
- Background noise and poor microphone quality reduce accuracy
- Domain-specific (medical/legal) terms may be transcribed inaccurately
- Performance may decline on non-native Swahili speakers
π Related Models
| Model | Description |
|---|---|
EYEDOL/salama-llm |
Swahili instruction-tuned LLM for reasoning and dialogue |
EYEDOL/salama-tts |
Swahili text-to-speech (VITS) model for natural speech synthesis |
- Downloads last month
- 4
Model tree for EYEDOL/SALAMA_SM_ASR
Base model
openai/whisper-small