| --- |
| library_name: transformers |
| tags: |
| - multimodal |
| - reasoning |
| - sft |
| - rl |
| datasets: |
| - multimodal-reasoning-lab/Zebra-CoT |
| - ModalityDance/Omni-Bench |
| base_model: |
| - GAIR/Anole-7b-v0.1 |
| pipeline_tag: any-to-any |
| --- |
| |
| # Omni-R1 |
|
|
| [](https://arxiv.org/abs/2601.09536) |
| [](https://github.com/ModalityDance/Omni-R1) |
| [](https://huggingface.co/datasets/ModalityDance/Omni-Bench) |
|
|
| ## Overview |
|
|
| **Omni-R1** is trained with multimodal interleaved supervision. It uses **PeSFT** for stable functional image generation, then **PeRPO** for RL refinement on unified tasks—enabling interleaved multimodal reasoning trajectories. |
|
|
| ## Usage |
|
|
| ```python |
| import torch |
| from PIL import Image |
| from transformers import ChameleonProcessor, ChameleonForConditionalGeneration |
| |
| # 1) Import & load |
| model_id = "ModalityDance/Omni-R1" # or "ModalityDance/Omni-R1-Zero" |
| processor = ChameleonProcessor.from_pretrained(model_id) |
| model = ChameleonForConditionalGeneration.from_pretrained( |
| model_id, |
| torch_dtype=torch.bfloat16, |
| device_map="auto", |
| ) |
| model.eval() |
| |
| # 2) Prepare a single input (prompt contains <image>) |
| prompt = "What is the smiling man in the image wearing? <image>" |
| image = Image.open("image.png").convert("RGB") |
| |
| inputs = processor( |
| prompt, |
| images=[image], |
| padding=False, |
| return_for_text_completion=True, |
| return_tensors="pt", |
| ).to(model.device) |
| |
| # --- minimal image token preprocessing: replace <image> placeholder with image tokens --- |
| input_ids = inputs["input_ids"].long() |
| pixel_values = inputs["pixel_values"] |
| |
| placeholder_id = processor.tokenizer.encode("<image>", add_special_tokens=False)[0] |
| image_tokens = model.get_image_tokens(pixel_values) # shape: [1, N] (or compatible) |
| |
| mask = (input_ids == placeholder_id) |
| input_ids = input_ids.clone() |
| input_ids[mask] = image_tokens.reshape(-1).to(dtype=torch.long, device=input_ids.device) |
| |
| # 3) Call the model |
| outputs = model.generate( |
| input_ids=input_ids, |
| max_length=4096, |
| do_sample=True, |
| temperature=0.5, |
| top_p=0.9, |
| pad_token_id=1, |
| multimodal_generation_mode="unrestricted", |
| ) |
| |
| # 4) Get results |
| text = processor.batch_decode(outputs, skip_special_tokens=False)[0] |
| print(text) |
| ``` |
|
|
| For full scripts (batch JSONL inference, interleaved decoding, and vLLM-based evaluation), please refer to the official GitHub repository: |
| https://github.com/ModalityDance/Omni-R1 |
|
|
| ## License |
|
|
| This project is licensed under the **MIT License**. |
| It also complies with the licenses of referenced third-party projects and dependencies, including the **Chameleon Research License**. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{cheng2026omnir1unifiedgenerativeparadigm, |
| title={Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning}, |
| author={Dongjie Cheng and Yongqi Li and Zhixin Ma and Hongru Cai and Yupeng Hu and Wenjie Wang and Liqiang Nie and Wenjie Li}, |
| year={2026}, |
| eprint={2601.09536}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.AI}, |
| url={https://arxiv.org/abs/2601.09536}, |
| } |
| ``` |
|
|