jeanma's picture
Omnilingual ASR transcription demo
ae238b3 verified
metadata
title: Omnilingual ASR Media Transcription
emoji: 🌍
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
license: mit
suggested_hardware: a100-large

Experimental Omnilingual ASR Media Transcription Demo

A media transcription tool with a web interface for multilingual audio and video transcription using Meta's Omnilingual ASR model. Transcriptions are supported for 1600+ languages.

This application is designed primarily as a web-based media transcription tool with an intuitive frontend interface. While you can interact directly with the API endpoints, the recommended usage is through the web interface at http://localhost:7860.

HuggingFace Space Configuration

This application is configured to run as a HuggingFace Space, however has resource limitations as it is a public. In order to have your own dedicated space, please clone with the following recommended specifications:

  • Hardware: A100 GPU (80GB) - Required for loading the 7B parameter Omnilingual ASR model
    • Alternative: Machines with lower GPU memory can use smaller models by setting the MODEL_NAME environment variable in HuggingFace Space settings, e.g. omniASR_LLM_300M (requires ~8GB GPU memory)
  • Persistent Storage: Enabled for model caching and improved loading times. Medium (150GB)
  • Docker Runtime: Uses custom Dockerfile for fairseq2 and PyTorch integration
  • Port: 7860 (HuggingFace standard)

The A100 machine is specifically chosen to accommodate the large Omnilingual ASR model (~14GB) in GPU memory, ensuring fast inference and real-time transcription capabilities.

Running Outside HuggingFace

While this application is designed for HuggingFace Spaces, it can be run on any machine with Docker and GPU support with similar hardware requirements as the machines on HuggingFace.

Getting Started

Running with Docker

  1. Build and run the container:
docker build -t omnilingual-asr-transcriptions .
docker run --rm -p 7860:7860 --gpus all \
  -e MODEL_NAME=omniASR_LLM_300M \
  -v {your cache directory}:/home/user/app/models \
  omnilingual-asr-transcriptions

The media transcription app will be available at http://localhost:7860

Docker Run Parameters Explained:

  • --rm: Automatically remove the container when it exits
  • -p 7860:7860: Map host port 7860 to container port 7860
  • --gpus all: Enable GPU access for CUDA acceleration
  • -e MODEL_NAME=omniASR_LLM_300M: Set the Omnilingual ASR model variant to use
    • Options: omniASR_LLM_1B (default, 1B parameters), omniASR_LLM_300M (300M parameters, faster)
  • -e ENABLE_TOXIC_FILTERING=true: Enable filtering of toxic words from transcription results (optional)
  • -v {your cache directory}:/home/user/app/models: Mount local models directory
    • Purpose: Persist downloaded models between container runs (14GB+ cache)
    • Benefits: Avoid re-downloading models on each container restart
    • Path: Adjust {your cache directory} to your local models directory

Available API Endpoints

Core Transcription Routes

  • GET /health - Comprehensive health check with GPU/CUDA status, FFmpeg availability, and transcription status
  • GET /status - Get current transcription status (busy/idle, progress, operation type)
  • POST /transcribe - Audio transcription with automatic chunking for files of any length

Additional Routes

  • POST /combine-video-subtitles - Combine video files with subtitle tracks
  • GET / - Serve the web application frontend
  • GET /assets/<filename> - Serve frontend static assets

Usage Recommendations

Primary Usage: Access the web interface at http://localhost:7860 for an intuitive media transcription experience with drag-and-drop file upload, real-time progress tracking, and downloadable results.

API Usage: For programmatic access or integration with other tools, you can call the API endpoints directly as shown in the examples below.

Environment Variables

You are free to change these if you clone the space and set them in the Huggingface space settings or in your own server environment. In the public shared demo these are controled for an optimal experience.

Server Environment Variables

  • API_LOG_LEVEL - Set logging level (DEBUG, INFO, WARNING, ERROR)
  • MODEL_NAME - Omnilingual ASR model to use (default: omniASR_LLM_1B)
  • USE_CHUNKING - Enable/disable audio chunking (default: true)
  • ENABLE_TOXIC_FILTERING - Enable toxic word filtering from transcription results (default: false)

Frontend Environment Variables

  • VITE_ALLOW_ALL_LANGUAGES - Set to true to show all 1,400+ supported languages in the language selector, or false to only show languages with error rates < 10% for public demo (default: false)
  • VITE_ENABLE_ANALYTICS - Set to true to enable Google Analytics tracking, or false to disable analytics (default: false)
  • VITE_REACT_APP_GOOGLE_ANALYTICS_ID - Your Google Analytics measurement ID (e.g., G-XXXXXXXXXX) for tracking usage when analytics are enabled

API Examples (For Developers)

For programmatic access or integration with other tools, you can call the API endpoints directly:

# Health check
curl http://localhost:7860/health

# Get transcription status
curl http://localhost:7860/status

# Transcribe audio file
curl -X POST http://localhost:7860/transcribe \
  -F "audio=@path/to/your/audio.wav"

Project Structure

omnilingual-asr-transcriptions/
β”œβ”€β”€ Dockerfile                      # Multi-stage build with frontend + backend
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt               # Python dependencies
β”œβ”€β”€ deploy.sh                      # Deployment script
β”œβ”€β”€ run_docker.sh                  # Local Docker run script
β”œβ”€β”€ frontend/                      # Web interface (React/Vite)
β”‚   β”œβ”€β”€ package.json
β”‚   β”œβ”€β”€ src/
β”‚   └── dist/                      # Built frontend (served by Flask)
β”œβ”€β”€ models/                       # Model files (automatically downloaded)
β”‚   β”œβ”€β”€ ctc_alignment_mling_uroman_model.pt
β”‚   β”œβ”€β”€ ctc_alignment_mling_uroman_model_dict.txt
β”‚   └── [Additional model files downloaded at runtime]
└── server/                       # Flask API backend
    β”œβ”€β”€ server.py                 # Main Flask application
    β”œβ”€β”€ transcriptions_blueprint.py  # API routes
    β”œβ”€β”€ audio_transcription.py    # Core transcription logic
    β”œβ”€β”€ media_transcription_processor.py  # Media processing
    β”œβ”€β”€ transcription_status.py   # Status tracking
    β”œβ”€β”€ env_vars.py              # Environment configuration
    β”œβ”€β”€ run.sh                   # Production startup script
    β”œβ”€β”€ download_models.sh       # Model download script
    β”œβ”€β”€ wheels/                  # Pre-built Omnilingual ASR wheel packages
    └── inference/               # Model inference components
        β”œβ”€β”€ mms_model_pipeline.py    # Omnilingual ASR model wrapper
        β”œβ”€β”€ audio_chunker.py         # Audio chunking logic
        └── audio_sentence_alignment.py  # Forced alignment

Key Features

  • Simplified Architecture: Single Docker container with built-in model management
  • Auto Model Download: Models are downloaded automatically during container startup
  • Omnilingual ASR Integration: Uses the latest Omnilingual ASR library with 1600+ language support
  • GPU Acceleration: CUDA-enabled inference with automatic device detection
  • Web Interface: Modern React frontend for easy testing and usage
  • Smart Transcription: Single endpoint handles files of any length with automatic chunking
  • Intelligent Processing: Automatic audio format detection and conversion

Note: Model files are large (14GB+ total) and are downloaded automatically when the container starts. The first run may take longer due to model downloads.