FlashSR: One-step Versatile Audio Super-Resolution

This is a convenience redistribution, not the original repository. All credit for the model architecture, research, training, and weights belongs to the original authors. This repository is not affiliated with or endorsed by them.


Authors	Jaekwon Im and Juhan Nam (KAIST)
Paper	FlashSR: One-step Versatile Audio Super-resolution via Diffusion Distillation (arXiv:2501.10807)
Demo	jakeoneijk.github.io/flashsr-demo
Original code	jakeoneijk/FlashSR_Inference
Original weights	jakeoneijk/FlashSR_weights

Note: There are other unrelated projects also named "FlashSR" (for other super-resolution).

About this repository

The original code and weights are split across GitHub and Hugging Face and have dependencies (torchcodec, FFmpeg) that can be difficult to set up. This repository bundles everything into one place with a standalone inference script that only needs PyTorch, soundfile, and scipy.

What is from the original authors: The model code (FlashSR/, TorchJaekwon/) and the pretrained weights (weights/) are from the original repositories linked above.

What is new in this redistribution: The inference script (enhance.py), setup.py, and this README were written independently. The code in this repository (excluding model weights) is released under the Apache License 2.0.

What FlashSR does

FlashSR restores high-frequency audio components in a single forward pass. It takes audio at any sample rate, resamples to 48 kHz, and reconstructs missing high-frequency detail. This is useful for:

Upscaling low-sample-rate recordings to full bandwidth
Enhancing audio that has been through lossy processing (codecs, vocoders, etc.)
Post-processing TTS or voice conversion outputs

The model handles speech, music, and sound effects.

Repository structure

weights/
  student_ldm.pth     (986 MB)  - Distilled latent diffusion model
  sr_vocoder.pth      (599 MB)  - Super-resolution vocoder
  vae.pth             (1.6 GB)  - Variational autoencoder
FlashSR/                        - Model code (from original repo)
TorchJaekwon/                   - Utility library (from original repo)
Assets/ExampleInput/            - Example audio files (speech, music, sound effects)
enhance.py                      - Standalone inference script
setup.py                        - Package installer

Installation

Requirements: Python 3.10+, PyTorch 2.0+ with CUDA, ~6 GB GPU memory.

# Clone this repository
git clone https://huggingface.co/laion/FlashSR_One-step_Versatile_Audio_Super-resolution
cd FlashSR_One-step_Versatile_Audio_Super-resolution

# Install
pip install -e .
pip install einops librosa soundfile tqdm scipy

Verify

python enhance.py --input Assets/ExampleInput/speech.wav --output output.wav

Tip: If you have a conda environment with conflicting cudnn libraries, clear LD_LIBRARY_PATH before running: LD_LIBRARY_PATH="" python enhance.py ...

Usage

Command line

# Single file
python enhance.py --input my_audio.wav --output enhanced.wav

# Entire directory
python enhance.py --input ./audio_folder/ --output ./enhanced_folder/

# With lowpass filter (can help when input was not originally bandwidth-limited)
python enhance.py --input my_audio.wav --output enhanced.wav --lowpass

# Specify GPU
CUDA_VISIBLE_DEVICES=0 python enhance.py --input my_audio.wav --output enhanced.wav

Python API

import torch
import soundfile as sf
import numpy as np
from pathlib import Path
from FlashSR.FlashSR import FlashSR

WEIGHTS_DIR = Path("./weights")
WINDOW_SIZE = 245760  # 5.12 seconds at 48 kHz

# Initialize
model = FlashSR(
    student_ldm_ckpt_path=str(WEIGHTS_DIR / "student_ldm.pth"),
    sr_vocoder_ckpt_path=str(WEIGHTS_DIR / "sr_vocoder.pth"),
    autoencoder_ckpt_path=str(WEIGHTS_DIR / "vae.pth"),
)
model = model.to("cuda").eval()

# Load and prepare audio (must be mono, 48 kHz)
samples, rate = sf.read("input.wav", dtype="float32")
if samples.ndim > 1:
    samples = samples.mean(axis=1)

# The model accepts exactly 245760 samples per call.
# Pad short audio; for longer audio, see enhance.py for chunk-based processing.
waveform = torch.from_numpy(samples).unsqueeze(0)  # shape: (1, num_samples)
n = waveform.shape[-1]
if n < WINDOW_SIZE:
    waveform = torch.nn.functional.pad(waveform, (0, WINDOW_SIZE - n))

waveform = waveform.to("cuda")

with torch.no_grad():
    result = model(waveform, lowpass_input=False)

# Trim padding and save
result = result[:, :n].squeeze(0).cpu().numpy()
sf.write("output.wav", result, 48000)

Notes

Fixed input length: The model processes exactly 245,760 samples (5.12 seconds at 48 kHz). The enhance.py script handles longer audio automatically using overlapping chunks with crossfading.
Sample rate: Input audio at any sample rate is resampled to 48 kHz. Output is always 48 kHz.
Channels: Mono and stereo are both supported. Stereo files are processed channel-by-channel.
lowpass_input flag: Set to True if your input was not originally bandwidth-limited. This applies a lowpass filter before enhancement to better match the model's training distribution.

License

The inference script (enhance.py), setup.py, and this README are released under the Apache License 2.0.

The model weights and original model code (FlashSR/, TorchJaekwon/) are from the original authors' repositories linked above. Please refer to those repositories for their licensing terms.

Citation

If you use FlashSR in your work, please cite the original paper:

@article{im2025flashsr,
  title={FlashSR: One-step Versatile Audio Super-resolution via Diffusion Distillation},
  author={Im, Jaekwon and Nam, Juhan},
  journal={arXiv preprint arXiv:2501.10807},
  year={2025}
}

References

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for laion/FlashSR_One-step_Versatile_Audio_Super-resolution

FlashSR: One-step Versatile Audio Super-resolution via Diffusion Distillation

Paper • 2501.10807 • Published Jan 18, 2025