FastConformer Multilingual ASR
A multilingual automatic speech recognition model supporting Kazakh, Russian, Uzbek, and English. Built on NVIDIA NeMo's FastConformer-CTC architecture.
Languages
| Language | Code |
|---|---|
| Kazakh | kk |
| Russian | ru |
| Uzbek (Latin) | uz |
| English | en |
Results
Full test set (76,739 samples):
| Language | Samples | CER | WER |
|---|---|---|---|
| Russian | 10,203 | 2.34% | 9.16% |
| Kazakh | 33,964 | 8.27% | 14.09% |
| Uzbek | 16,184 | 7.10% | 28.82% |
| English | 16,388 | 9.53% | 22.29% |
| Overall | 76,739 | 7.73% | 16.86% |
Usage
Requirements
pip install nemo_toolkit[asr]
Transcribe Audio Files
import nemo.collections.asr as nemo_asr
model = nemo_asr.models.ASRModel.restore_from("fastconformer_multilingual.nemo")
model.freeze()
transcriptions = model.transcribe(["audio.wav"])
print(transcriptions[0])
Real-Time Streaming Transcription
import nemo.collections.asr as nemo_asr
import sounddevice as sd
import numpy as np
import queue
import tempfile
import os
import soundfile as sf
SAMPLE_RATE = 16000
CHUNK_SEC = 3
CHUNK_SAMPLES = SAMPLE_RATE * CHUNK_SEC
model = nemo_asr.models.ASRModel.restore_from("fastconformer_multilingual.nemo")
model.freeze()
audio_queue = queue.Queue()
def audio_callback(indata, frames, time, status):
audio_queue.put(indata[:, 0].copy())
def transcribe_stream():
buffer = np.array([], dtype=np.float32)
with sd.InputStream(
samplerate=SAMPLE_RATE,
channels=1,
callback=audio_callback,
blocksize=SAMPLE_RATE,
):
print("Listening... (Ctrl+C to stop)")
while True:
chunk = audio_queue.get()
buffer = np.concatenate([buffer, chunk])
if len(buffer) >= CHUNK_SAMPLES:
tmp = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
sf.write(tmp.name, buffer[:CHUNK_SAMPLES], SAMPLE_RATE)
result = model.transcribe([tmp.name])
text = result[0] if isinstance(result[0], str) else result[0].text
if text.strip():
print(f"> {text}")
os.unlink(tmp.name)
buffer = buffer[CHUNK_SAMPLES:]
if __name__ == "__main__":
transcribe_stream()
Batch Transcription
import nemo.collections.asr as nemo_asr
from pathlib import Path
model = nemo_asr.models.ASRModel.restore_from("fastconformer_multilingual.nemo")
model.freeze()
audio_files = list(Path("audio_dir").glob("*.wav"))
transcriptions = model.transcribe([str(f) for f in audio_files], batch_size=32)
for path, text in zip(audio_files, transcriptions):
t = text if isinstance(text, str) else text.text
print(f"{path.name}: {t}")
Model Details
- Architecture: FastConformer-CTC
- Framework: NVIDIA NeMo
- Audio: 16kHz mono WAV
Limitations
- Optimized for clear speech; performance may degrade on noisy audio
- No punctuation or capitalization in output
- Language is auto-detected, not explicitly specified
License
This model is released under CC BY-NC 4.0. Non-commercial use only.
- Downloads last month
- -