ModalityDance
/

Omni-R1

image-text-to-text

Model card Files Files and versions

Omni-R1 / README.md

charlesdj's picture

Update README.md

ef1738f verified about 2 months ago

|

history blame contribute delete

3.3 kB

	---
	library_name: transformers
	tags:
	- multimodal
	- reasoning
	- sft
	- rl
	datasets:
	- multimodal-reasoning-lab/Zebra-CoT
	- ModalityDance/Omni-Bench
	base_model:
	- GAIR/Anole-7b-v0.1
	pipeline_tag: any-to-any
	---

	# Omni-R1

	[![Paper](https://img.shields.io/badge/Paper-arXiv-b31b1b?style=for-the-badge&logo=arxiv)](https://arxiv.org/abs/2601.09536)
	[![Code](https://img.shields.io/badge/GitHub-Code-blue?style=for-the-badge&logo=github)](https://github.com/ModalityDance/Omni-R1)
	[![Omni-Bench](https://img.shields.io/badge/Dataset-Omni--Bench-fcc21b?style=for-the-badge&logo=huggingface&logoColor=white)](https://huggingface.co/datasets/ModalityDance/Omni-Bench)

	## Overview

	Omni-R1 is trained with multimodal interleaved supervision. It uses PeSFT for stable functional image generation, then PeRPO for RL refinement on unified tasks—enabling interleaved multimodal reasoning trajectories.

	## Usage

	```python
	import torch
	from PIL import Image
	from transformers import ChameleonProcessor, ChameleonForConditionalGeneration

	# 1) Import & load
	model_id = "ModalityDance/Omni-R1" # or "ModalityDance/Omni-R1-Zero"
	processor = ChameleonProcessor.from_pretrained(model_id)
	model = ChameleonForConditionalGeneration.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	)
	model.eval()

	# 2) Prepare a single input (prompt contains <image>)
	prompt = "What is the smiling man in the image wearing? <image>"
	image = Image.open("image.png").convert("RGB")

	inputs = processor(
	prompt,
	images=[image],
	padding=False,
	return_for_text_completion=True,
	return_tensors="pt",
	).to(model.device)

	# --- minimal image token preprocessing: replace <image> placeholder with image tokens ---
	input_ids = inputs["input_ids"].long()
	pixel_values = inputs["pixel_values"]

	placeholder_id = processor.tokenizer.encode("<image>", add_special_tokens=False)[0]
	image_tokens = model.get_image_tokens(pixel_values) # shape: [1, N] (or compatible)

	mask = (input_ids == placeholder_id)
	input_ids = input_ids.clone()
	input_ids[mask] = image_tokens.reshape(-1).to(dtype=torch.long, device=input_ids.device)

	# 3) Call the model
	outputs = model.generate(
	input_ids=input_ids,
	max_length=4096,
	do_sample=True,
	temperature=0.5,
	top_p=0.9,
	pad_token_id=1,
	multimodal_generation_mode="unrestricted",
	)

	# 4) Get results
	text = processor.batch_decode(outputs, skip_special_tokens=False)[0]
	print(text)
	```

	For full scripts (batch JSONL inference, interleaved decoding, and vLLM-based evaluation), please refer to the official GitHub repository:
	https://github.com/ModalityDance/Omni-R1

	## License

	This project is licensed under the MIT License.
	It also complies with the licenses of referenced third-party projects and dependencies, including the Chameleon Research License.

	## Citation

	```bibtex
	@misc{cheng2026omnir1unifiedgenerativeparadigm,
	title={Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning},
	author={Dongjie Cheng and Yongqi Li and Zhixin Ma and Hongru Cai and Yupeng Hu and Wenjie Wang and Liqiang Nie and Wenjie Li},
	year={2026},
	eprint={2601.09536},
	archivePrefix={arXiv},
	primaryClass={cs.AI},
	url={https://arxiv.org/abs/2601.09536},
	}
	```