You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

FastConformer Multilingual ASR

A multilingual automatic speech recognition model supporting Kazakh, Russian, Uzbek, and English. Built on NVIDIA NeMo's FastConformer-CTC architecture.

Languages

Language Code
Kazakh kk
Russian ru
Uzbek (Latin) uz
English en

Results

Full test set (76,739 samples):

Language Samples CER WER
Russian 10,203 2.34% 9.16%
Kazakh 33,964 8.27% 14.09%
Uzbek 16,184 7.10% 28.82%
English 16,388 9.53% 22.29%
Overall 76,739 7.73% 16.86%

Usage

Requirements

pip install nemo_toolkit[asr]

Transcribe Audio Files

import nemo.collections.asr as nemo_asr

model = nemo_asr.models.ASRModel.restore_from("fastconformer_multilingual.nemo")
model.freeze()

transcriptions = model.transcribe(["audio.wav"])
print(transcriptions[0])

Real-Time Streaming Transcription

import nemo.collections.asr as nemo_asr
import sounddevice as sd
import numpy as np
import queue
import tempfile
import os
import soundfile as sf

SAMPLE_RATE = 16000
CHUNK_SEC = 3
CHUNK_SAMPLES = SAMPLE_RATE * CHUNK_SEC

model = nemo_asr.models.ASRModel.restore_from("fastconformer_multilingual.nemo")
model.freeze()

audio_queue = queue.Queue()


def audio_callback(indata, frames, time, status):
    audio_queue.put(indata[:, 0].copy())


def transcribe_stream():
    buffer = np.array([], dtype=np.float32)
    with sd.InputStream(
        samplerate=SAMPLE_RATE,
        channels=1,
        callback=audio_callback,
        blocksize=SAMPLE_RATE,
    ):
        print("Listening... (Ctrl+C to stop)")
        while True:
            chunk = audio_queue.get()
            buffer = np.concatenate([buffer, chunk])
            if len(buffer) >= CHUNK_SAMPLES:
                tmp = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
                sf.write(tmp.name, buffer[:CHUNK_SAMPLES], SAMPLE_RATE)
                result = model.transcribe([tmp.name])
                text = result[0] if isinstance(result[0], str) else result[0].text
                if text.strip():
                    print(f"> {text}")
                os.unlink(tmp.name)
                buffer = buffer[CHUNK_SAMPLES:]


if __name__ == "__main__":
    transcribe_stream()

Batch Transcription

import nemo.collections.asr as nemo_asr
from pathlib import Path

model = nemo_asr.models.ASRModel.restore_from("fastconformer_multilingual.nemo")
model.freeze()

audio_files = list(Path("audio_dir").glob("*.wav"))
transcriptions = model.transcribe([str(f) for f in audio_files], batch_size=32)

for path, text in zip(audio_files, transcriptions):
    t = text if isinstance(text, str) else text.text
    print(f"{path.name}: {t}")

Model Details

  • Architecture: FastConformer-CTC
  • Framework: NVIDIA NeMo
  • Audio: 16kHz mono WAV

Limitations

  • Optimized for clear speech; performance may degrade on noisy audio
  • No punctuation or capitalization in output
  • Language is auto-detected, not explicitly specified

License

This model is released under CC BY-NC 4.0. Non-commercial use only.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support