Multi-Head Low-Rank Attention (MLRA)

Official pretrained weights for Multi-Head Low-Rank Attention (MLRA), a novel attention mechanism that natively supports 4-way tensor parallelism and significantly reduces the key-value (KV) cache size, enabling efficient long-context inference at scale.

Resources

Model Description

Long-context inference in large language models is often bottlenecked by KV cache loading during the decoding stage. While Multi-Head Latent Attention (MLA) reduces the total KV cache size, it suffers from a sharding bottleneck during distributed decoding via Tensor Parallelism (TP).

MLRA addresses this by enabling partitionable latent states for efficient 4-way TP decoding. Experimental results show that MLRA achieves state-of-the-art perplexity and downstream task performance, while delivering a 2.8$\times$ decoding speedup over MLA.

Citation

If you find this work useful, please cite:

@inproceedings{liu2026multi,
  title     = {Multi-Head Low-Rank Attention},
  author    = {Liu, Songtao and Peng, Hongwu and Zhang, Zhiwei and Chen, Zhengyu and Guo, Yue},
  booktitle = {International Conference on Learning Representations},
  year      = {2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Soughing/MLRA