Mull-Tokens: Modality-Agnostic Latent Thinking
Paper
•
2512.10941
•
Published
•
1
This is the model for the paper "Mull-Tokens: Modality-Agnostic Latent Thinking".
[Paper] | [Project Page] | [Code]
Mull-Tokens are latent tokens that can be pre-trained to hold intermediate information in either image or text modalities so as to think towards the correct answer. Across four challenging spatial reasoning benchmarks, Mull-Tokens achieve a +3% average improvement and up to +16% on reasoning-heavy splits compared to the strongest baseline.
| Model | Description |
|---|---|
| array/Qwen2.5-VL-Mull | Mull-Tokens with multimodal warm-up |
| array/Qwen2.5-VL-MullGRPO | Mull-Tokens + GRPO reinforcement learning |
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
# Choose model: "array/Qwen2.5-VL-Mull" or "array/Qwen2.5-VL-MullGRPO"
MODEL_ID = "array/Qwen2.5-VL-Mull"
NUM_LATENTS = 20
# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
processor = AutoProcessor.from_pretrained(MODEL_ID)
# Prepare your question
image_path = "path/to/your/image.jpg"
question = "If you stand at the X marked point and turn left, will the table be to your left or right? Please choose between the following answer choices: A. left. B. right. "
question_type = "multiple choice"
QUESTION_TEMPLATE_LATENT = (
"{Question}\n"
"Please think about this question deeply. "
"It's encouraged to include self-reflection or verification in the reasoning process. "
"Provide your final answer between the <answer> </answer> tags."
)
TYPE_TEMPLATE = {
"multiple choice": " Please provide only the single option letter (e.g., A, B, C, D, etc.) within the <answer> </answer> tags.",
"numerical": " Please provide the numerical value (e.g., 42 or 3.14) within the <answer> </answer> tags.",
"OCR": " Please transcribe text from the image/video clearly and provide your text answer within the <answer> </answer> tags.",
"free-form": " Please provide your text answer within the <answer> </answer> tags.",
"regression": " Please provide the numerical value (e.g., 42 or 3.14) within the <answer> </answer> tags.",
}
prompt = QUESTION_TEMPLATE_LATENT.format(Question=question) + TYPE_TEMPLATE[question_type]
# Build messages with latent thinking tokens
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image_path},
{"type": "text", "text": prompt},
],
},
# IMPORTANT: Mull-Tokens requires latent thinking tokens before answer generation
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "<think>" + "<|latent_pad|>" * NUM_LATENTS + "</think>\n",
}
],
},
]
# Process inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
text = text.replace("<|im_end|>\n", "") # Remove end token so model continues generating
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
).to(model.device)
# Generate response
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=512,
do_sample=False,
)
# Decode output (skip input tokens)
generated_ids = output_ids[:, inputs["input_ids"].shape[1]:]
response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
@misc{ray2025mulltokensmodalityagnosticlatentthinking,
title={Mull-Tokens: Modality-Agnostic Latent Thinking},
author={Arijit Ray and Ahmed Abdelkader and Chengzhi Mao and Bryan A. Plummer and Kate Saenko and Ranjay Krishna and Leonidas Guibas and Wen-Sheng Chu},
year={2025},
eprint={2512.10941},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.10941},
}
Base model
Qwen/Qwen2.5-VL-7B-Instruct