GitaWhisper
Fine-tuned Whisper-tiny model for Sanskrit śloka transcription with IAST transliteration output.
Model Description
This model is a fine-tuned version of openai/whisper-tiny specifically trained for transcribing Sanskrit ślokas (verses) into IAST (International Alphabet of Sanskrit Transliteration) format.
- Base Model: openai/whisper-tiny
- Parameters: 37.76M (37.18M trainable - 98.47%)
- Training Method: Full fine-tuning
- Training Data: 671 Sanskrit audio-text pairs from Bhagavad-Gita
- Test Data: 30 samples
📖 For detailed technical information, training methodology, and comprehensive analysis, see the Technical Report.
Performance
Evaluation Results (Test Set - 30 samples)
| Metric | Original OpenAI Model | Fine-tuned Model | Absolute Improvement | Relative Improvement |
|---|---|---|---|---|
| WER | 140.96% | 91.81% | -49.15% | 34.9% reduction |
| CER | 43.84% | 22.19% | -21.65% | 49.4% reduction |
Key Improvements Over OpenAI's Whisper-tiny
- ✅ 34.9% WER reduction - Significantly fewer word errors
- ✅ 49.4% CER reduction - Nearly 50% better character-level accuracy
- ✅ Accurate diacritic transcription - Proper handling of Sanskrit diacritics (ā, ī, ū, ṛ, ṃ, ḥ)
- ✅ Better Sanskrit phonetics - Improved recognition of Sanskrit-specific sounds
- ✅ Reduced hallucinations - Less extra text generation compared to base model
Why This Matters
The original OpenAI Whisper-tiny model was trained on multilingual data but struggles with Sanskrit:
- High WER (140.96%): Generates many incorrect words and extra text
- Poor diacritic handling: Often misses or incorrectly transcribes Sanskrit diacritics
- Language confusion: Tries to transcribe Sanskrit as English or other languages
Our fine-tuned model:
- Lower WER (91.81%): Much more accurate word-level transcription
- Excellent diacritic accuracy: Correctly transcribes complex Sanskrit diacritics
- Domain-specific: Optimized specifically for Sanskrit śloka transcription
Comparison Examples
Example 1: Shloka 18_50
Reference:
siddhiṃ prāpto yathā brahma tathāpnoti nibodha me samāsenaiva kaunteya niṣṭhā jñānasya yā parā
OpenAI Whisper-tiny (Original):
sidhim praptu yatha bramhan tathap notini budhami samas enai vakanteya nishthag nyanasiyayā para
❌ Missing diacritics (ā, ṃ, ṭhā)
❌ Incorrect word segmentation
❌ Poor Sanskrit phonetics recognition
Fine-tuned Model:
siddhimprāpto yathā brahmān tathaapnoti nibodha me samāsenaiva kaunteya niṣṭhā jñānasyayā parāmīr
✅ Correct diacritics (ā, ṃ, ṭhā, jñā)
✅ Better word recognition
✅ Accurate Sanskrit phonetics
Example 2: Shloka 18_53
Reference:
ahaṃkāraṃ balaṃ darpaṃ kāmaṃ krodhaṃ parigraham vimucya nirmamaḥ śānto brahmabhūyāya kalpate
OpenAI Whisper-tiny (Original):
ahaṁ karam balandarpam kāmam kurodham parigraham vimu chyanir mamah shantū bramhab huyaya kalpate
❌ Missing diacritics (ṃ, ā, ḥ)
❌ Incorrect word boundaries
❌ Phonetic errors (kurodham vs krodhaṃ)
Fine-tuned Model:
ahaṃ kāraṃ balandarpaṃ kaamakrodhaṃ parigraham vi mucyanirmamaḥ śānto brahma bhūyāya kalpate
✅ Correct diacritics (ṃ, ā, ḥ)
✅ Better word segmentation
✅ Accurate phonetics
Example 3: Shloka 18_51
Reference:
buddhyā viśuddhayā yukto dhṛtyātmānaṃ niyamya ca śabdādīnviṣayāṃstyaktvā rāgadveṣau vyudasya ca
OpenAI Whisper-tiny (Original):
buddha vishuddhaya yuktu, dhrityatmanam miyam yachan. shabdhādin vishayān tiktvah raga dhyeshav vudasya chak
❌ Missing diacritics (yā, ś, ṣ, ṭ, vā)
❌ Incorrect punctuation
❌ Poor word recognition
Fine-tuned Model:
buddhyā viśuddhayā yukto dṛtyātmānammyaṃya ca śabdādiṃ vyṣayāntyaktvā rāgadveśau vyu daśya ca
✅ Correct diacritics (yā, ś, ṣ, ṭ, vā)
✅ No punctuation (as expected)
✅ Much better word recognition
Quantitative Comparison
Character-Level Accuracy (CER):
- Original: 43.84% error rate
- Fine-tuned: 22.19% error rate
- Improvement: 49.4% reduction in character errors
Word-Level Accuracy (WER):
- Original: 140.96% error rate (generates many extra words)
- Fine-tuned: 91.81% error rate
- Improvement: 34.9% reduction in word errors
Diacritic Accuracy:
- Original: Frequently misses or incorrectly transcribes diacritics
- Fine-tuned: Accurately handles complex diacritics (ā, ī, ū, ṛ, ṃ, ḥ, ṣ, ṭ, etc.)
Text Generation:
- Original: Often generates excessive text beyond the reference
- Fine-tuned: More controlled generation with repetition penalty
Installation
pip install transformers torch librosa soundfile
Usage
works best with audio length less than ~25-30 seconds
Basic Inference
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import librosa
import numpy as np
# Load model and processor
model_name = "diabolic6045/GitaWhisper-tiny" # Replace with your HF username
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)
# Disable forced decoder ids for clean output
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
model.eval()
# Load and preprocess audio (must be 16 kHz mono)
audio_path = "path/to/your/audio.wav"
audio, sr = librosa.load(audio_path, sr=16000, mono=True)
# Process audio
inputs = processor.feature_extractor(
audio,
sampling_rate=16000,
return_tensors="pt"
).input_features.to(device)
# Generate transcription with repetition penalty
with torch.no_grad():
generated_ids = model.generate(
inputs,
max_length=448,
repetition_penalty=1.2, # Prevents repetitive text
no_repeat_ngram_size=3, # Prevents 3-gram repetition
length_penalty=1.0,
)
# Decode
transcription = processor.tokenizer.batch_decode(
generated_ids,
skip_special_tokens=True
)[0]
print(transcription.strip().lower())
Advanced Inference with Audio Preprocessing
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import librosa
import numpy as np
def transcribe_sanskrit_audio(audio_path, model, processor, device):
"""
Transcribe Sanskrit audio to IAST transliteration.
Args:
audio_path: Path to audio file
model: Loaded Whisper model
processor: WhisperProcessor
device: torch device
Returns:
str: IAST transliteration text
"""
# Load audio
audio, sr = librosa.load(audio_path, sr=None, mono=False)
# Convert stereo to mono if needed
if len(audio.shape) > 1:
audio = np.mean(audio, axis=0)
# Resample to 16 kHz (Whisper requirement)
if sr != 16000:
audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
# Process audio
inputs = processor.feature_extractor(
audio,
sampling_rate=16000,
return_tensors="pt"
).input_features.to(device)
# Generate transcription
with torch.no_grad():
generated_ids = model.generate(
inputs,
max_length=448,
repetition_penalty=1.2,
no_repeat_ngram_size=3,
length_penalty=1.0,
)
# Decode
transcription = processor.tokenizer.batch_decode(
generated_ids,
skip_special_tokens=True
)[0]
return transcription.strip().lower()
# Usage
model_name = "diabolic6045/GitaWhisper-tiny"
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
model.eval()
# Transcribe
result = transcribe_sanskrit_audio("your_audio.wav", model, processor, device)
print(result)
Using with HuggingFace Pipeline
from transformers import pipeline
import torch
# Create ASR pipeline
asr = pipeline(
"automatic-speech-recognition",
model="diabolic6045/GitaWhisper-tiny",
device=0 if torch.cuda.is_available() else -1,
)
# Transcribe audio
result = asr(
"path/to/audio.wav",
generate_kwargs={
"max_length": 448,
"repetition_penalty": 1.2,
"no_repeat_ngram_size": 3,
}
)
print(result["text"].strip().lower())
Training Details
Training Configuration
- Epochs: 10
- Batch Size: 4 (effective: 16 with gradient accumulation)
- Learning Rate: 5e-6
- Warmup Steps: 50
- Optimizer: AdamW
- LR Schedule: Linear decay
- Mixed Precision: FP16
- Training Time: ~75 minutes (10 epochs)
- GPU: NVIDIA RTX 4090 (24 GB)
Training Data
- Source: JDhruv14/Bhagavad-Gita_Audio
- Training Samples: 671
- Test Samples: 30
- Audio Format: 16 kHz mono WAV
- Text Format: IAST transliteration (lowercase, no punctuation)
- Total Audio Duration: ~2-2.5 hours
Data Preprocessing
The transliteration text underwent normalization:
- Replace dots between words with spaces:
.X→X - Remove all remaining dots
- Remove vertical bars (
|) - Normalize multiple spaces to single space
- Convert to lowercase
Example:
Original: dhṛtarāṣṭra uvāca .dharmakṣetre kurukṣetre
Cleaned: dhṛtarāṣṭra uvāca dharmakṣetre kurukṣetre
Technical Specifications
Model Architecture
- Base: Transformer encoder-decoder
- Encoder: 4 transformer blocks
- Decoder: 4 transformer blocks
- Feature Extractor: 80 mel-spectrogram bins
- Input: 16 kHz mono audio
- Output: IAST transliteration (max 448 tokens)
Generation Parameters
- Max Length: 448 tokens
- Repetition Penalty: 1.2
- No Repeat N-gram Size: 3
- Length Penalty: 1.0
- Decoding: Greedy (beam_size=1)
Limitations
- Word Spacing: Some words may be concatenated (e.g.,
siddhimprāptoinstead ofsiddhiṃ prāpto) - Extra Generation: Model may occasionally generate text beyond the reference length
- Dataset Size: Trained on 671 samples - more data would improve performance
- Domain: Optimized for Bhagavad-Gita style chanting/recitation
- Audio Requirements: Best results with 16 kHz mono audio
Training Pipeline
The model was trained using the following workflow:
Data Preparation (
prepare_data.py):- Load dataset from HuggingFace
- Clean transliteration text
- Split into train/test (671/30)
Training (
train.py):- Full fine-tuning (all parameters trainable)
- Custom Whisper data collator
- Memory-efficient evaluation
Evaluation (
evaluate.py):- WER/CER calculation
- Detailed results export
Comparison (
compare_models.py):- Original vs fine-tuned comparison
- Performance metrics
📚 For complete training details, hyperparameters, and technical implementation, refer to the Technical Report.
Inference Performance
- Speed: ~2.5-3 samples/second
- Latency: ~300-400ms per sample
- GPU Memory: ~1-2 GB
- CPU Memory: ~500 MB
Citation
If you use this model, please cite:
@misc{GitaWhisper,
title={Whisper-tiny Fine-tuned for Sanskrit Transliteration},
author={Your Name},
year={2024},
howpublished={\url{https://huggingface.co/diabolic6045/GitaWhisper-tiny}}
}
License
This model is released under the MIT License, same as the base Whisper model.
Documentation
- 📖 Technical Report: Comprehensive documentation including:
- Detailed training methodology
- Hyperparameter analysis
- Technical implementation details
- Error analysis and performance benchmarks
- Reproducibility guide
Acknowledgments
- Base model: OpenAI Whisper
- Training data: JDhruv14/Bhagavad-Gita_Audio
- Framework: HuggingFace Transformers
Contact
For questions or issues, please open an issue on the model repository.
- Downloads last month
- 35
Model tree for diabolic6045/GitaWhisper-tiny
Base model
openai/whisper-tinyDataset used to train diabolic6045/GitaWhisper-tiny
Space using diabolic6045/GitaWhisper-tiny 1
Collection including diabolic6045/GitaWhisper-tiny
Evaluation results
- Word Error Rate on Bhagavad-Gita Audio (Test Set)test set self-reported91.810
- Character Error Rate on Bhagavad-Gita Audio (Test Set)test set self-reported22.190