TTS Evaluation Models

This repository contains models for the objective evaluation of text-to-speech (TTS) models, as presented in the papers ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching, ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching, OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models.

Code: k2-fsa/ZipVoice and k2-fsa/OmniVoice

Evaluation Metrics

This repository specifically supports the following evaluation metrics:

WER: Includes Hubert-based ASR model for LibriSpeech-PC testset, Paraformer-based ASR model for Chinese datasets, Whisper-large-v3 model for general English and other languages test sets, WhisperD model for English dialogue speech.
cpWER: WhisperD model is used to compute concatenated minimum permutation word error rate (cpWER) for English dialogue speech.
SIM-o: A wavlm-based speaker verification model is used to compute the speaker similarity between prompt and generated speech.
cpSIM: A speaker diarization model is used along with the above wavlm-based model to compute concatenated maximum permutation speaker similarity (cpSIM).
UTMOS: The mos prediction model UTMOS is used.

For more details, please refer to repositories ZipVoice and OmniVoice.

Citation

@article{zhu2025zipvoice,
      title={ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching},
      author={Zhu, Han and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Li, Zhaoqing and Zhuang, Weiji and Lin, Long and Povey, Daniel},
      journal={arXiv preprint arXiv:2506.13053},
      year={2025}
}

@article{zhu2025zipvoicedialog,
      title={ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching},
      author={Zhu, Han and Kang, Wei and Guo, Liyong and Yao, Zengwei and Kuang, Fangjun and Zhuang, Weiji and Li, Zhaoqing and Han, Zhifeng and Zhang, Dong and Zhang, Xin and Song, Xingchen and Lin, Long and Povey, Daniel},
      journal={arXiv preprint arXiv:2507.09318},
      year={2025}
}

@article{zhu2026omnivoice,
      title={OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},
      author={Zhu, Han and Ye, Lingxuan and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Han, Zhifeng and Zhuang, Weiji and Lin, Long and Povey, Daniel},
      journal={arXiv preprint arXiv:2604.00688},
      year={2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train k2-fsa/TTS_eval_models

Spaces using k2-fsa/TTS_eval_models 50

Papers for k2-fsa/TTS_eval_models

OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

Paper • 2604.00688 • Published 24 days ago • 10

ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching

Paper • 2507.09318 • Published Jul 12, 2025 • 2

ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching

Paper • 2506.13053 • Published Jun 16, 2025