πŸ—£οΈ SALAMA-STT β€” Swahili Whisper ASR Model

Developer: AI4NNOV
Authors: AI4NNOV.
Version: v1.0
License: Apache 2.0
Model Type: Automatic Speech Recognition (ASR)
Base Model: openai/whisper-small (fine-tuned for Swahili)


🌍 Overview

SALAMA-STT (Speech-to-Text) is the first module of the SALAMA Framework β€” a modular end-to-end speech-to-speech AI system built for African languages.
This model is fine-tuned from OpenAI’s Whisper-small architecture for Swahili speech recognition, enhancing performance on African accents and conversational data.

The model converts Swahili audio input into accurate transcriptions and serves as the entry point for downstream LLM and TTS modules.


🧱 Model Architecture

SALAMA-STT leverages the Whisper-small architecture with a transformer encoder-decoder optimized for low-resource Swahili audio transcription tasks.
The model was fine-tuned on the Mozilla Common Voice 17.0 Swahili dataset, ensuring robustness to diverse accents and speech clarity.

Parameter Value
Base Model openai/whisper-small
Fine-Tuning Full model fine-tuning (fp16 precision)
Optimizer AdamW
Learning Rate 1e-5
Batch Size 16
Epochs 10
Frameworks Transformers + Datasets + TorchAudio
Languages Swahili (sw), English (en)

πŸ“š Dataset

Dataset Description Purpose
mozilla-foundation/common_voice_17_0 20 hours of Swahili speech data Supervised fine-tuning
Custom local Swahili recordings Conversational + accent-rich data Accent robustness
Common Voice validation split 2.3 hours Evaluation

🧠 Model Capabilities

  • Speech-to-text transcription in Swahili
  • Recognition of African-accented Swahili
  • Handles short and long-form audio
  • Supports integration with SALAMA-LLM for full voice assistants
  • Provides timestamped segment transcriptions

πŸ“Š Evaluation Metrics

Metric Baseline (Whisper-small) Fine-tuned (SALAMA-STT) Improvement
WER (Word Error Rate) 1.15 0.43 πŸ”» 62%
CER (Character Error Rate) 0.39 0.18 πŸ”» 54%
Accuracy 85.2% 95.4% +10.2%

Evaluation conducted on a 2-hour held-out Swahili validation set from Common Voice.


βš™οΈ Usage (Python Example)

Below is a quick example for Swahili speech transcription using this model:

from transformers import pipeline

# Load Swahili Whisper ASR
asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model="EYEDOL/salama-stt",
    chunk_length_s=30,
    device_map="auto"
)

# Example audio file (replace with your file)
audio_path = "swahili_audio_sample.wav"

# Transcribe audio
result = asr_pipeline(audio_path)

print("πŸ—£οΈ Transcription:")
print(result["text"])

Example Output:

β€œKaribu kwenye mfumo wa SALAMA unaosaidia kutambua na kuelewa sauti ya Kiswahili kwa usahihi mkubwa.”


πŸ” Model Performance Summary

Dataset Metric Score
Common Voice 17.0 (test) WER 0.43
Common Voice 17.0 (test) CER 0.18
Local Swahili Test Set Accuracy 95.4%

⚑ Key Features

  • πŸŽ™οΈ Accurate Swahili ASR trained on diverse voices
  • 🌍 Adapted for African speech variations and dialects
  • 🧩 Lightweight and compatible with SALAMA-LLM
  • πŸ”Š Handles long-form recordings (β‰₯30s)
  • πŸš€ Fast inference optimized with FP16 precision

🚫 Limitations

  • May misinterpret code-mixed (Swahili-English) speech
  • Background noise and poor microphone quality reduce accuracy
  • Domain-specific (medical/legal) terms may be transcribed inaccurately
  • Performance may decline on non-native Swahili speakers

πŸ”— Related Models

Model Description
EYEDOL/salama-llm Swahili instruction-tuned LLM for reasoning and dialogue
EYEDOL/salama-tts Swahili text-to-speech (VITS) model for natural speech synthesis
Downloads last month
4
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for EYEDOL/SALAMA_SM_ASR

Finetuned
(3156)
this model

Dataset used to train EYEDOL/SALAMA_SM_ASR