GitaWhisper

Fine-tuned Whisper-tiny model for Sanskrit śloka transcription with IAST transliteration output.

Model Description

This model is a fine-tuned version of openai/whisper-tiny specifically trained for transcribing Sanskrit ślokas (verses) into IAST (International Alphabet of Sanskrit Transliteration) format.

  • Base Model: openai/whisper-tiny
  • Parameters: 37.76M (37.18M trainable - 98.47%)
  • Training Method: Full fine-tuning
  • Training Data: 671 Sanskrit audio-text pairs from Bhagavad-Gita
  • Test Data: 30 samples

📖 For detailed technical information, training methodology, and comprehensive analysis, see the Technical Report.

Performance

Evaluation Results (Test Set - 30 samples)

Metric Original OpenAI Model Fine-tuned Model Absolute Improvement Relative Improvement
WER 140.96% 91.81% -49.15% 34.9% reduction
CER 43.84% 22.19% -21.65% 49.4% reduction

Key Improvements Over OpenAI's Whisper-tiny

  • 34.9% WER reduction - Significantly fewer word errors
  • 49.4% CER reduction - Nearly 50% better character-level accuracy
  • Accurate diacritic transcription - Proper handling of Sanskrit diacritics (ā, ī, ū, ṛ, ṃ, ḥ)
  • Better Sanskrit phonetics - Improved recognition of Sanskrit-specific sounds
  • Reduced hallucinations - Less extra text generation compared to base model

Why This Matters

The original OpenAI Whisper-tiny model was trained on multilingual data but struggles with Sanskrit:

  • High WER (140.96%): Generates many incorrect words and extra text
  • Poor diacritic handling: Often misses or incorrectly transcribes Sanskrit diacritics
  • Language confusion: Tries to transcribe Sanskrit as English or other languages

Our fine-tuned model:

  • Lower WER (91.81%): Much more accurate word-level transcription
  • Excellent diacritic accuracy: Correctly transcribes complex Sanskrit diacritics
  • Domain-specific: Optimized specifically for Sanskrit śloka transcription

Comparison Examples

Example 1: Shloka 18_50

Reference:

siddhiṃ prāpto yathā brahma tathāpnoti nibodha me samāsenaiva kaunteya niṣṭhā jñānasya yā parā

OpenAI Whisper-tiny (Original):

sidhim praptu yatha bramhan tathap notini budhami samas enai vakanteya nishthag nyanasiyayā para

❌ Missing diacritics (ā, ṃ, ṭhā)
❌ Incorrect word segmentation
❌ Poor Sanskrit phonetics recognition

Fine-tuned Model:

siddhimprāpto yathā brahmān tathaapnoti nibodha me samāsenaiva kaunteya niṣṭhā jñānasyayā parāmīr

✅ Correct diacritics (ā, ṃ, ṭhā, jñā)
✅ Better word recognition
✅ Accurate Sanskrit phonetics

Example 2: Shloka 18_53

Reference:

ahaṃkāraṃ balaṃ darpaṃ kāmaṃ krodhaṃ parigraham vimucya nirmamaḥ śānto brahmabhūyāya kalpate

OpenAI Whisper-tiny (Original):

ahaṁ karam balandarpam kāmam kurodham parigraham vimu chyanir mamah shantū bramhab huyaya kalpate

❌ Missing diacritics (ṃ, ā, ḥ)
❌ Incorrect word boundaries
❌ Phonetic errors (kurodham vs krodhaṃ)

Fine-tuned Model:

ahaṃ kāraṃ balandarpaṃ kaamakrodhaṃ parigraham vi mucyanirmamaḥ śānto brahma bhūyāya kalpate

✅ Correct diacritics (ṃ, ā, ḥ)
✅ Better word segmentation
✅ Accurate phonetics

Example 3: Shloka 18_51

Reference:

buddhyā viśuddhayā yukto dhṛtyātmānaṃ niyamya ca śabdādīnviṣayāṃstyaktvā rāgadveṣau vyudasya ca

OpenAI Whisper-tiny (Original):

buddha vishuddhaya yuktu, dhrityatmanam miyam yachan. shabdhādin vishayān tiktvah raga dhyeshav vudasya chak

❌ Missing diacritics (yā, ś, ṣ, ṭ, vā)
❌ Incorrect punctuation
❌ Poor word recognition

Fine-tuned Model:

buddhyā viśuddhayā yukto dṛtyātmānammyaṃya ca śabdādiṃ vyṣayāntyaktvā rāgadveśau vyu daśya ca

✅ Correct diacritics (yā, ś, ṣ, ṭ, vā)
✅ No punctuation (as expected)
✅ Much better word recognition

Quantitative Comparison

Character-Level Accuracy (CER):

  • Original: 43.84% error rate
  • Fine-tuned: 22.19% error rate
  • Improvement: 49.4% reduction in character errors

Word-Level Accuracy (WER):

  • Original: 140.96% error rate (generates many extra words)
  • Fine-tuned: 91.81% error rate
  • Improvement: 34.9% reduction in word errors

Diacritic Accuracy:

  • Original: Frequently misses or incorrectly transcribes diacritics
  • Fine-tuned: Accurately handles complex diacritics (ā, ī, ū, ṛ, ṃ, ḥ, ṣ, ṭ, etc.)

Text Generation:

  • Original: Often generates excessive text beyond the reference
  • Fine-tuned: More controlled generation with repetition penalty

Installation

pip install transformers torch librosa soundfile

Usage

works best with audio length less than ~25-30 seconds

Basic Inference

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import librosa
import numpy as np

# Load model and processor
model_name = "diabolic6045/GitaWhisper-tiny"  # Replace with your HF username
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)

# Disable forced decoder ids for clean output
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
model.eval()

# Load and preprocess audio (must be 16 kHz mono)
audio_path = "path/to/your/audio.wav"
audio, sr = librosa.load(audio_path, sr=16000, mono=True)

# Process audio
inputs = processor.feature_extractor(
    audio,
    sampling_rate=16000,
    return_tensors="pt"
).input_features.to(device)

# Generate transcription with repetition penalty
with torch.no_grad():
    generated_ids = model.generate(
        inputs,
        max_length=448,
        repetition_penalty=1.2,  # Prevents repetitive text
        no_repeat_ngram_size=3,  # Prevents 3-gram repetition
        length_penalty=1.0,
    )

# Decode
transcription = processor.tokenizer.batch_decode(
    generated_ids,
    skip_special_tokens=True
)[0]

print(transcription.strip().lower())

Advanced Inference with Audio Preprocessing

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import librosa
import numpy as np

def transcribe_sanskrit_audio(audio_path, model, processor, device):
    """
    Transcribe Sanskrit audio to IAST transliteration.
    
    Args:
        audio_path: Path to audio file
        model: Loaded Whisper model
        processor: WhisperProcessor
        device: torch device
    
    Returns:
        str: IAST transliteration text
    """
    # Load audio
    audio, sr = librosa.load(audio_path, sr=None, mono=False)
    
    # Convert stereo to mono if needed
    if len(audio.shape) > 1:
        audio = np.mean(audio, axis=0)
    
    # Resample to 16 kHz (Whisper requirement)
    if sr != 16000:
        audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
    
    # Process audio
    inputs = processor.feature_extractor(
        audio,
        sampling_rate=16000,
        return_tensors="pt"
    ).input_features.to(device)
    
    # Generate transcription
    with torch.no_grad():
        generated_ids = model.generate(
            inputs,
            max_length=448,
            repetition_penalty=1.2,
            no_repeat_ngram_size=3,
            length_penalty=1.0,
        )
    
    # Decode
    transcription = processor.tokenizer.batch_decode(
        generated_ids,
        skip_special_tokens=True
    )[0]
    
    return transcription.strip().lower()

# Usage
model_name = "diabolic6045/GitaWhisper-tiny"
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
model.eval()

# Transcribe
result = transcribe_sanskrit_audio("your_audio.wav", model, processor, device)
print(result)

Using with HuggingFace Pipeline

from transformers import pipeline
import torch

# Create ASR pipeline
asr = pipeline(
    "automatic-speech-recognition",
    model="diabolic6045/GitaWhisper-tiny",
    device=0 if torch.cuda.is_available() else -1,
)

# Transcribe audio
result = asr(
    "path/to/audio.wav",
    generate_kwargs={
        "max_length": 448,
        "repetition_penalty": 1.2,
        "no_repeat_ngram_size": 3,
    }
)

print(result["text"].strip().lower())

Training Details

Training Configuration

  • Epochs: 10
  • Batch Size: 4 (effective: 16 with gradient accumulation)
  • Learning Rate: 5e-6
  • Warmup Steps: 50
  • Optimizer: AdamW
  • LR Schedule: Linear decay
  • Mixed Precision: FP16
  • Training Time: ~75 minutes (10 epochs)
  • GPU: NVIDIA RTX 4090 (24 GB)

Training Data

  • Source: JDhruv14/Bhagavad-Gita_Audio
  • Training Samples: 671
  • Test Samples: 30
  • Audio Format: 16 kHz mono WAV
  • Text Format: IAST transliteration (lowercase, no punctuation)
  • Total Audio Duration: ~2-2.5 hours

Data Preprocessing

The transliteration text underwent normalization:

  1. Replace dots between words with spaces: .X X
  2. Remove all remaining dots
  3. Remove vertical bars (|)
  4. Normalize multiple spaces to single space
  5. Convert to lowercase

Example:

Original:  dhṛtarāṣṭra uvāca .dharmakṣetre kurukṣetre
Cleaned:   dhṛtarāṣṭra uvāca dharmakṣetre kurukṣetre

Technical Specifications

Model Architecture

  • Base: Transformer encoder-decoder
  • Encoder: 4 transformer blocks
  • Decoder: 4 transformer blocks
  • Feature Extractor: 80 mel-spectrogram bins
  • Input: 16 kHz mono audio
  • Output: IAST transliteration (max 448 tokens)

Generation Parameters

  • Max Length: 448 tokens
  • Repetition Penalty: 1.2
  • No Repeat N-gram Size: 3
  • Length Penalty: 1.0
  • Decoding: Greedy (beam_size=1)

Limitations

  • Word Spacing: Some words may be concatenated (e.g., siddhimprāpto instead of siddhiṃ prāpto)
  • Extra Generation: Model may occasionally generate text beyond the reference length
  • Dataset Size: Trained on 671 samples - more data would improve performance
  • Domain: Optimized for Bhagavad-Gita style chanting/recitation
  • Audio Requirements: Best results with 16 kHz mono audio

Training Pipeline

The model was trained using the following workflow:

  1. Data Preparation (prepare_data.py):

    • Load dataset from HuggingFace
    • Clean transliteration text
    • Split into train/test (671/30)
  2. Training (train.py):

    • Full fine-tuning (all parameters trainable)
    • Custom Whisper data collator
    • Memory-efficient evaluation
  3. Evaluation (evaluate.py):

    • WER/CER calculation
    • Detailed results export
  4. Comparison (compare_models.py):

    • Original vs fine-tuned comparison
    • Performance metrics

📚 For complete training details, hyperparameters, and technical implementation, refer to the Technical Report.

Inference Performance

  • Speed: ~2.5-3 samples/second
  • Latency: ~300-400ms per sample
  • GPU Memory: ~1-2 GB
  • CPU Memory: ~500 MB

Citation

If you use this model, please cite:

@misc{GitaWhisper,
  title={Whisper-tiny Fine-tuned for Sanskrit Transliteration},
  author={Your Name},
  year={2024},
  howpublished={\url{https://huggingface.co/diabolic6045/GitaWhisper-tiny}}
}

License

This model is released under the MIT License, same as the base Whisper model.

Documentation

  • 📖 Technical Report: Comprehensive documentation including:
    • Detailed training methodology
    • Hyperparameter analysis
    • Technical implementation details
    • Error analysis and performance benchmarks
    • Reproducibility guide

Acknowledgments

Contact

For questions or issues, please open an issue on the model repository.

Downloads last month
35
Safetensors
Model size
37.8M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for diabolic6045/GitaWhisper-tiny

Finetuned
(1670)
this model

Dataset used to train diabolic6045/GitaWhisper-tiny

Space using diabolic6045/GitaWhisper-tiny 1

Collection including diabolic6045/GitaWhisper-tiny

Evaluation results