UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing
Abstract
UniReason integrates text-to-image generation and image editing through a dual reasoning paradigm that enhances planning with world knowledge and uses editing for visual refinement, achieving superior performance on reasoning-intensive benchmarks.
Unified multimodal models often struggle with complex synthesis tasks that demand deep reasoning, and typically treat text-to-image generation and image editing as isolated capabilities rather than interconnected reasoning steps. To address this, we propose UniReason, a unified framework that harmonizes these two tasks through a dual reasoning paradigm. We formulate generation as world knowledge-enhanced planning to inject implicit constraints, and leverage editing capabilities for fine-grained visual refinement to further correct visual errors via self-reflection. This approach unifies generation and editing within a shared representation, mirroring the human cognitive process of planning followed by refinement. We support this framework by systematically constructing a large-scale reasoning-centric dataset (~300k samples) covering five major knowledge domains (e.g., cultural commonsense, physics, etc.) for planning, alongside an agent-generated corpus for visual self-correction. Extensive experiments demonstrate that UniReason achieves advanced performance on reasoning-intensive benchmarks such as WISE, KrisBench and UniREditBench, while maintaining superior general synthesis capabilities.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Unified Thinker: A General Reasoning Modular Core for Image Generation (2026)
- ReViSE: Towards Reason-Informed Video Editing in Unified Models with Self-Reflective Learning (2025)
- Loom: Diffusion-Transformer for Interleaved Generation (2025)
- Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing (2026)
- Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders (2026)
- CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation (2026)
- ThinkGen: Generalized Thinking for Visual Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper