🗣️ SALAMA-STT — Swahili Whisper ASR Model

Developer: AI4NNOV
Authors: AI4NNOV.
Version: v1.0
License: Apache 2.0
Model Type: Automatic Speech Recognition (ASR)
Base Model: openai/whisper-small (fine-tuned for Swahili)

🌍 Overview

SALAMA-STT (Speech-to-Text) is the first module of the SALAMA Framework — a modular end-to-end speech-to-speech AI system built for African languages.
This model is fine-tuned from OpenAI’s Whisper-small architecture for Swahili speech recognition, enhancing performance on African accents and conversational data.

The model converts Swahili audio input into accurate transcriptions and serves as the entry point for downstream LLM and TTS modules.

🧱 Model Architecture

SALAMA-STT leverages the Whisper-small architecture with a transformer encoder-decoder optimized for low-resource Swahili audio transcription tasks.
The model was fine-tuned on the Mozilla Common Voice 17.0 Swahili dataset, ensuring robustness to diverse accents and speech clarity.

Parameter	Value
Base Model	`openai/whisper-small`
Fine-Tuning	Full model fine-tuning (fp16 precision)
Optimizer	AdamW
Learning Rate	1e-5
Batch Size	16
Epochs	10
Frameworks	Transformers + Datasets + TorchAudio
Languages	Swahili (`sw`), English (`en`)

📚 Dataset

Dataset	Description	Purpose
`mozilla-foundation/common_voice_17_0`	20 hours of Swahili speech data	Supervised fine-tuning
Custom local Swahili recordings	Conversational + accent-rich data	Accent robustness
Common Voice validation split	2.3 hours	Evaluation

🧠 Model Capabilities

Speech-to-text transcription in Swahili
Recognition of African-accented Swahili
Handles short and long-form audio
Supports integration with SALAMA-LLM for full voice assistants
Provides timestamped segment transcriptions

📊 Evaluation Metrics

Metric	Baseline (Whisper-small)	Fine-tuned (SALAMA-STT)	Improvement
WER (Word Error Rate)	1.15	0.43	🔻 62%
CER (Character Error Rate)	0.39	0.18	🔻 54%
Accuracy	85.2%	95.4%	+10.2%

Evaluation conducted on a 2-hour held-out Swahili validation set from Common Voice.

⚙️ Usage (Python Example)

Below is a quick example for Swahili speech transcription using this model:

from transformers import pipeline

# Load Swahili Whisper ASR
asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model="EYEDOL/salama-stt",
    chunk_length_s=30,
    device_map="auto"
)

# Example audio file (replace with your file)
audio_path = "swahili_audio_sample.wav"

# Transcribe audio
result = asr_pipeline(audio_path)

print("🗣️ Transcription:")
print(result["text"])

Example Output:

“Karibu kwenye mfumo wa SALAMA unaosaidia kutambua na kuelewa sauti ya Kiswahili kwa usahihi mkubwa.”

🔍 Model Performance Summary

Dataset	Metric	Score
Common Voice 17.0 (test)	WER	0.43
Common Voice 17.0 (test)	CER	0.18
Local Swahili Test Set	Accuracy	95.4%

⚡ Key Features

🎙️ Accurate Swahili ASR trained on diverse voices
🌍 Adapted for African speech variations and dialects
🧩 Lightweight and compatible with SALAMA-LLM
🔊 Handles long-form recordings (≥30s)
🚀 Fast inference optimized with FP16 precision

🚫 Limitations

May misinterpret code-mixed (Swahili-English) speech
Background noise and poor microphone quality reduce accuracy
Domain-specific (medical/legal) terms may be transcribed inaccurately
Performance may decline on non-native Swahili speakers

🔗 Related Models

Model	Description
`EYEDOL/salama-llm`	Swahili instruction-tuned LLM for reasoning and dialogue
`EYEDOL/salama-tts`	Swahili text-to-speech (VITS) model for natural speech synthesis

Downloads last month: 4

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for EYEDOL/SALAMA_SM_ASR

Base model

openai/whisper-small

Finetuned

(3156)

this model

EYEDOL
/

SALAMA_SM_ASR