multimodal
updated
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper
• 2405.15223
• Published
• 17
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision
Models
Paper
• 2405.15574
• Published
• 55
An Introduction to Vision-Language Modeling
Paper
• 2405.17247
• Published
• 90
Matryoshka Multimodal Models
Paper
• 2405.17430
• Published
• 34
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
Paper
• 2405.18669
• Published
• 12
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
Paper
• 2405.20340
• Published
• 20
Parrot: Multilingual Visual Instruction Tuning
Paper
• 2406.02539
• Published
• 36
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with
LLM
Paper
• 2406.02884
• Published
• 18
What If We Recaption Billions of Web Images with LLaMA-3?
Paper
• 2406.08478
• Published
• 43
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio
Understanding in Video-LLMs
Paper
• 2406.07476
• Published
• 36
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation
in Videos
Paper
• 2406.08407
• Published
• 28
AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and
Video Generation
Paper
• 2406.07686
• Published
• 17
OpenVLA: An Open-Source Vision-Language-Action Model
Paper
• 2406.09246
• Published
• 43
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal
Language Models
Paper
• 2406.09403
• Published
• 23
Explore the Limits of Omni-modal Pretraining at Scale
Paper
• 2406.09412
• Published
• 11
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
Paper
• 2406.09406
• Published
• 15
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via
Chart-to-Code Generation
Paper
• 2406.09961
• Published
• 55
Needle In A Multimodal Haystack
Paper
• 2406.07230
• Published
• 54
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images
Interleaved with Text
Paper
• 2406.08418
• Published
• 32
mDPO: Conditional Preference Optimization for Multimodal Large Language
Models
Paper
• 2406.11839
• Published
• 40
LLaNA: Large Language and NeRF Assistant
Paper
• 2406.11840
• Published
• 18
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
Paper
• 2406.14544
• Published
• 35
PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal
Documents
Paper
• 2406.13923
• Published
• 25
Improving Visual Commonsense in Language Models via Multiple Image
Generation
Paper
• 2406.13621
• Published
• 13
Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical
Report
Paper
• 2406.11403
• Published
• 4
Towards Fast Multilingual LLM Inference: Speculative Decoding and
Specialized Drafters
Paper
• 2406.16758
• Published
• 20
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Paper
• 2406.16860
• Published
• 63
Long Context Transfer from Language to Vision
Paper
• 2406.16852
• Published
• 33
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
Paper
• 2406.15704
• Published
• 6
HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into
Multimodal LLMs at Scale
Paper
• 2406.19280
• Published
• 63
Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity
Paper
• 2406.17720
• Published
• 8
OmniJARVIS: Unified Vision-Language-Action Tokenization Enables
Open-World Instruction Following Agents
Paper
• 2407.00114
• Published
• 13
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
Paper
• 2407.02477
• Published
• 24
InternLM-XComposer-2.5: A Versatile Large Vision Language Model
Supporting Long-Contextual Input and Output
Paper
• 2407.03320
• Published
• 94
TokenPacker: Efficient Visual Projector for Multimodal LLM
Paper
• 2407.02392
• Published
• 23
Unveiling Encoder-Free Vision-Language Models
Paper
• 2406.11832
• Published
• 54
RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language
Models
Paper
• 2407.05131
• Published
• 26
ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild
Paper
• 2407.04172
• Published
• 25
Stark: Social Long-Term Multi-Modal Conversation with Persona
Commonsense Knowledge
Paper
• 2407.03958
• Published
• 21
HEMM: Holistic Evaluation of Multimodal Foundation Models
Paper
• 2407.03418
• Published
• 12
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for
Interleaved Image-Text Generation
Paper
• 2407.06135
• Published
• 22
Vision language models are blind
Paper
• 2407.06581
• Published
• 85
Video-STaR: Self-Training Enables Video Instruction Tuning with Any
Supervision
Paper
• 2407.06189
• Published
• 27
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large
Multimodal Models
Paper
• 2407.07895
• Published
• 42
PaliGemma: A versatile 3B VLM for transfer
Paper
• 2407.07726
• Published
• 72
FIRE: A Dataset for Feedback Integration and Refinement Evaluation of
Multimodal Models
Paper
• 2407.11522
• Published
• 9
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality
Models
Paper
• 2407.11691
• Published
• 16
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
Paper
• 2407.11895
• Published
• 7
Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model
Co-development
Paper
• 2407.11784
• Published
• 4
E5-V: Universal Embeddings with Multimodal Large Language Models
Paper
• 2407.12580
• Published
• 42
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Paper
• 2407.12772
• Published
• 35
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
Paper
• 2407.12679
• Published
• 8
EVLM: An Efficient Vision-Language Model for Visual Understanding
Paper
• 2407.14177
• Published
• 45
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language
Models
Paper
• 2407.15841
• Published
• 40
VideoGameBunny: Towards vision assistants for video games
Paper
• 2407.15295
• Published
• 23
MIBench: Evaluating Multimodal Large Language Models over Multiple
Images
Paper
• 2407.15272
• Published
• 10
Visual Haystacks: Answering Harder Questions About Sets of Images
Paper
• 2407.13766
• Published
• 2
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal
Large Language Model
Paper
• 2407.16198
• Published
• 13
VILA^2: VILA Augmented VILA
Paper
• 2407.17453
• Published
• 41
Efficient Inference of Vision Instruction-Following Models with Elastic
Cache
Paper
• 2407.18121
• Published
• 17
Wolf: Captioning Everything with a World Summarization Framework
Paper
• 2407.18908
• Published
• 32
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware
Experts
Paper
• 2407.21770
• Published
• 22
OmniParser for Pure Vision Based GUI Agent
Paper
• 2408.00203
• Published
• 24
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Paper
• 2408.01800
• Published
• 92
Language Model Can Listen While Speaking
Paper
• 2408.02622
• Published
• 40
ExoViP: Step-by-step Verification and Exploration with Exoskeleton
Modules for Compositional Visual Reasoning
Paper
• 2408.02210
• Published
• 9
MMIU: Multimodal Multi-image Understanding for Evaluating Large
Vision-Language Models
Paper
• 2408.02718
• Published
• 62
VITA: Towards Open-Source Interactive Omni Multimodal LLM
Paper
• 2408.05211
• Published
• 50
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation
Agents
Paper
• 2408.06327
• Published
• 17
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper
• 2408.08872
• Published
• 101
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Paper
• 2408.10188
• Published
• 52
Segment Anything with Multiple Modalities
Paper
• 2408.09085
• Published
• 22
Transfusion: Predict the Next Token and Diffuse Images with One
Multi-Modal Model
Paper
• 2408.11039
• Published
• 63
GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
Paper
• 2408.11817
• Published
• 9
Show-o: One Single Transformer to Unify Multimodal Understanding and
Generation
Paper
• 2408.12528
• Published
• 51
SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for
Large-scale Vision-Language Models
Paper
• 2408.12114
• Published
• 15
SEA: Supervised Embedding Alignment for Token-Level Visual-Textual
Integration in MLLMs
Paper
• 2408.11813
• Published
• 12
Building and better understanding vision-language models: insights and
future directions
Paper
• 2408.12637
• Published
• 133
CogVLM2: Visual Language Models for Image and Video Understanding
Paper
• 2408.16500
• Published
• 57
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
Paper
• 2408.15881
• Published
• 21
Law of Vision Representation in MLLMs
Paper
• 2408.16357
• Published
• 95
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal
Models in Multi-View Urban Scenarios
Paper
• 2408.17267
• Published
• 23
VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language
Models for Trait Discovery from Biological Images
Paper
• 2408.16176
• Published
• 8
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Paper
• 2408.16725
• Published
• 53
VideoLLaMB: Long-context Video Understanding with Recurrent Memory
Bridges
Paper
• 2409.01071
• Published
• 27
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via
Hybrid Architecture
Paper
• 2409.02889
• Published
• 54
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding
Benchmark
Paper
• 2409.02813
• Published
• 33
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page
Document Understanding
Paper
• 2409.03420
• Published
• 26
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
Paper
• 2409.05840
• Published
• 49
POINTS: Improving Your Vision-language Model with Affordable Strategies
Paper
• 2409.04828
• Published
• 24
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
Paper
• 2409.06666
• Published
• 60
Guiding Vision-Language Model Selection for Visual Question-Answering
Across Tasks, Domains, and Knowledge Types
Paper
• 2409.09269
• Published
• 8
NVLM: Open Frontier-Class Multimodal LLMs
Paper
• 2409.11402
• Published
• 74
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at
Any Resolution
Paper
• 2409.12191
• Published
• 78
Putting Data at the Centre of Offline Multi-Agent Reinforcement Learning
Paper
• 2409.12001
• Published
• 5
MonoFormer: One Transformer for Both Diffusion and Autoregression
Paper
• 2409.16280
• Published
• 18
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art
Multimodal Models
Paper
• 2409.17146
• Published
• 121
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid
Emotions
Paper
• 2409.18042
• Published
• 39
Emu3: Next-Token Prediction is All You Need
Paper
• 2409.18869
• Published
• 97
MIO: A Foundation Model on Multimodal Tokens
Paper
• 2409.17692
• Published
• 53
UniMuMo: Unified Text, Music and Motion Generation
Paper
• 2410.04534
• Published
• 19
NL-Eye: Abductive NLI for Images
Paper
• 2410.02613
• Published
• 23
Paper
• 2410.07073
• Published
• 69
Personalized Visual Instruction Tuning
Paper
• 2410.07113
• Published
• 70
Deciphering Cross-Modal Alignment in Large Vision-Language Models with
Modality Integration Rate
Paper
• 2410.07167
• Published
• 39
Aria: An Open Multimodal Native Mixture-of-Experts Model
Paper
• 2410.05993
• Published
• 111
Multimodal Situational Safety
Paper
• 2410.06172
• Published
• 12
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal
Foundation Models
Paper
• 2410.02740
• Published
• 54
Video Instruction Tuning With Synthetic Data
Paper
• 2410.02713
• Published
• 41
LLaVA-Critic: Learning to Evaluate Multimodal Models
Paper
• 2410.02712
• Published
• 37
Distilling an End-to-End Voice Assistant Without Instruction Training
Data
Paper
• 2410.02678
• Published
• 23
MLLM as Retriever: Interactively Learning Multimodal Retrieval for
Embodied Agents
Paper
• 2410.03450
• Published
• 36
Baichuan-Omni Technical Report
Paper
• 2410.08565
• Published
• 87
From Generalist to Specialist: Adapting Vision Language Models via
Task-Specific Visual Instruction Tuning
Paper
• 2410.06456
• Published
• 37
LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large
Multimodal Models
Paper
• 2410.09732
• Published
• 54
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large
Vision-Language Models
Paper
• 2410.10139
• Published
• 51
MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks
Paper
• 2410.10563
• Published
• 37
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality
Documents
Paper
• 2410.10594
• Published
• 29
TemporalBench: Benchmarking Fine-grained Temporal Understanding for
Multimodal Video Models
Paper
• 2410.10818
• Published
• 16
TVBench: Redesigning Video-Language Evaluation
Paper
• 2410.07752
• Published
• 6
The Curse of Multi-Modalities: Evaluating Hallucinations of Large
Multimodal Models across Language, Visual, and Audio
Paper
• 2410.12787
• Published
• 30
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language
Models
Paper
• 2410.13085
• Published
• 24
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding
and Generation
Paper
• 2410.13848
• Published
• 35
MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures
Paper
• 2410.13754
• Published
• 75
WorldCuisines: A Massive-Scale Benchmark for Multilingual and
Multicultural Visual Question Answering on Global Cuisines
Paper
• 2410.12705
• Published
• 32
Remember, Retrieve and Generate: Understanding Infinite Visual Concepts
as Your Personalized Assistant
Paper
• 2410.13360
• Published
• 9
γ-MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large
Language Models
Paper
• 2410.13859
• Published
• 8
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial
Samples
Paper
• 2410.14669
• Published
• 39
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex
Capabilities
Paper
• 2410.11190
• Published
• 22
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Paper
• 2410.13861
• Published
• 56
Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages
Paper
• 2410.16153
• Published
• 44
Improve Vision Language Model Chain-of-thought Reasoning
Paper
• 2410.16198
• Published
• 26
Mitigating Object Hallucination via Concentric Causal Attention
Paper
• 2410.15926
• Published
• 18
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large
Vision-Language Models
Paper
• 2410.17637
• Published
• 35
Can Knowledge Editing Really Correct Hallucinations?
Paper
• 2410.16251
• Published
• 55
Unbounded: A Generative Infinite Game of Character Life Simulation
Paper
• 2410.18975
• Published
• 37
WAFFLE: Multi-Modal Model for Automated Front-End Development
Paper
• 2410.18362
• Published
• 13
ROCKET-1: Master Open-World Interaction with Visual-Temporal Context
Prompting
Paper
• 2410.17856
• Published
• 52
Infinity-MM: Scaling Multimodal Performance with Large-Scale and
High-Quality Instruction Data
Paper
• 2410.18558
• Published
• 18
Paper
• 2410.21276
• Published
• 87
Vision Search Assistant: Empower Vision-Language Models as Multimodal
Search Engines
Paper
• 2410.21220
• Published
• 11
VideoWebArena: Evaluating Long Context Multimodal Agents with Video
Understanding Web Tasks
Paper
• 2410.19100
• Published
• 6
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper
• 2410.23218
• Published
• 49
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal
Foundation Models
Paper
• 2410.23266
• Published
• 20
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical
Reasoning Robustness of Vision Language Models
Paper
• 2411.00836
• Published
• 15
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
Paper
• 2411.02327
• Published
• 11
Mixture-of-Transformers: A Sparse and Scalable Architecture for
Multi-Modal Foundation Models
Paper
• 2411.04996
• Published
• 50
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in
Videos
Paper
• 2411.04923
• Published
• 23
Analyzing The Language of Visual Tokens
Paper
• 2411.05001
• Published
• 24
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding
And A Retrieval-Aware Tuning Framework
Paper
• 2411.06176
• Published
• 45
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Paper
• 2411.10440
• Published
• 129
Generative World Explorer
Paper
• 2411.11844
• Published
• 77
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large
Language Models on Mobile Devices
Paper
• 2411.10640
• Published
• 46
Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of
Experts
Paper
• 2411.10669
• Published
• 10
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal
Models in Video Analysis through User Simulation
Paper
• 2411.13281
• Published
• 20
Enhancing the Reasoning Ability of Multimodal Large Language Models via
Mixed Preference Optimization
Paper
• 2411.10442
• Published
• 87
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large
Language Models
Paper
• 2411.14432
• Published
• 25
Large Multi-modal Models Can Interpret Features in Large Multi-modal
Models
Paper
• 2411.14982
• Published
• 19
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A
Comprehensive Multimodal Dataset Towards General Medical AI
Paper
• 2411.14522
• Published
• 38
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Paper
• 2411.17465
• Published
• 89
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
Paper
• 2411.15296
• Published
• 21
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for
Training-Free Acceleration
Paper
• 2411.17686
• Published
• 19
VLRewardBench: A Challenging Benchmark for Vision-Language Generative
Reward Models
Paper
• 2411.17451
• Published
• 11
FINECAPTION: Compositional Image Captioning Focusing on Wherever You
Want at Any Granularity
Paper
• 2411.15411
• Published
• 8
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
Paper
• 2411.18363
• Published
• 10
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video
Comprehension with Video-Text Duet Interaction Format
Paper
• 2411.17991
• Published
• 5
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Paper
• 2411.18203
• Published
• 40
ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting
Paper
• 2411.17176
• Published
• 24
On Domain-Specific Post-Training for Multimodal Large Language Models
Paper
• 2411.19930
• Published
• 31
X-Prompt: Towards Universal In-Context Image Generation in
Auto-Regressive Vision Language Foundation Models
Paper
• 2412.01824
• Published
• 64
PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos
Paper
• 2412.01800
• Published
• 6
OmniCreator: Self-Supervised Unified Generation with Universal Editing
Paper
• 2412.02114
• Published
• 14
PaliGemma 2: A Family of Versatile VLMs for Transfer
Paper
• 2412.03555
• Published
• 133
VideoICL: Confidence-based Iterative In-context Learning for
Out-of-Distribution Video Understanding
Paper
• 2412.02186
• Published
• 23
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual
Prompt Instruction Tuning
Paper
• 2412.03565
• Published
• 10
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Paper
• 2412.04467
• Published
• 117
Florence-VL: Enhancing Vision-Language Models with Generative Vision
Encoder and Depth-Breadth Fusion
Paper
• 2412.04424
• Published
• 62
NVILA: Efficient Frontier Visual Language Models
Paper
• 2412.04468
• Published
• 60
Personalized Multimodal Large Language Models: A Survey
Paper
• 2412.02142
• Published
• 13
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
Paper
• 2412.01169
• Published
• 13
p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay
Paper
• 2412.04449
• Published
• 7
CompCap: Improving Multimodal Large Language Models with Composite
Captions
Paper
• 2412.05243
• Published
• 20
Maya: An Instruction Finetuned Multilingual Multimodal Model
Paper
• 2412.07112
• Published
• 28
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance
Paper
• 2412.06673
• Published
• 11
POINTS1.5: Building a Vision-Language Model towards Real World
Applications
Paper
• 2412.08443
• Published
• 38
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for
Long-term Streaming Video and Audio Interactions
Paper
• 2412.09596
• Published
• 97
Multimodal Latent Language Modeling with Next-Token Diffusion
Paper
• 2412.08635
• Published
• 49
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
Paper
• 2412.09501
• Published
• 48
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Paper
• 2412.10360
• Published
• 147
SynerGen-VL: Towards Synergistic Image Understanding and Generation with
Vision Experts and Token Folding
Paper
• 2412.09604
• Published
• 38
VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal
Retrieval-Augmented Generation
Paper
• 2412.10704
• Published
• 16
Thinking in Space: How Multimodal Large Language Models See, Remember,
and Recall Spaces
Paper
• 2412.14171
• Published
• 24
Progressive Multimodal Reasoning via Active Retrieval
Paper
• 2412.14835
• Published
• 73
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
Paper
• 2412.14475
• Published
• 57
Diving into Self-Evolving Training for Multimodal Reasoning
Paper
• 2412.17451
• Published
• 42
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via
Collective Monte Carlo Tree Search
Paper
• 2412.18319
• Published
• 39
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
Paper
• 2412.18619
• Published
• 60
Task Preference Optimization: Improving Multimodal Large Language Models
with Vision Task Alignment
Paper
• 2412.19326
• Published
• 18
2.5 Years in Class: A Multimodal Textbook for Vision-Language
Pretraining
Paper
• 2501.00958
• Published
• 109
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Paper
• 2501.01957
• Published
• 47
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
Paper
• 2501.01904
• Published
• 33
Dispider: Enabling Video LLMs with Active Real-Time Interaction via
Disentangled Perception, Decision, and Reaction
Paper
• 2501.03218
• Published
• 35
Cosmos World Foundation Model Platform for Physical AI
Paper
• 2501.03575
• Published
• 82
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One
Vision Token
Paper
• 2501.03895
• Published
• 52
OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment
across Language with Real-time Self-Aware Emotional Speech Synthesis
Paper
• 2501.04561
• Published
• 17
Are VLMs Ready for Autonomous Driving? An Empirical Study from the
Reliability, Data, and Metric Perspectives
Paper
• 2501.04003
• Published
• 27
VideoRAG: Retrieval-Augmented Generation over Video Corpus
Paper
• 2501.05874
• Published
• 75
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper
• 2501.06186
• Published
• 65
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video
Understanding?
Paper
• 2501.05510
• Published
• 44
ReFocus: Visual Editing as a Chain of Thought for Structured Image
Understanding
Paper
• 2501.05452
• Published
• 15
Infecting Generative AI With Viruses
Paper
• 2501.05542
• Published
• 13
Migician: Revealing the Magic of Free-Form Multi-Image Grounding in
Multimodal Large Language Models
Paper
• 2501.05767
• Published
• 29
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
Paper
• 2501.06282
• Published
• 53
Tarsier2: Advancing Large Vision-Language Models from Detailed Video
Description to Comprehensive Video Understanding
Paper
• 2501.07888
• Published
• 15
Do generative video models learn physical principles from watching
videos?
Paper
• 2501.09038
• Published
• 34
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward
Model
Paper
• 2501.12368
• Published
• 45
MSTS: A Multimodal Safety Test Suite for Vision-Language Models
Paper
• 2501.10057
• Published
• 10
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video
Understanding
Paper
• 2501.13106
• Published
• 90
Temporal Preference Optimization for Long-Form Video Understanding
Paper
• 2501.13919
• Published
• 23
Baichuan-Omni-1.5 Technical Report
Paper
• 2501.15368
• Published
• 60
Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with
Modality-Aware Sparsity
Paper
• 2501.16295
• Published
• 8
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal
Understanding
Paper
• 2502.01341
• Published
• 39
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive
Modality Alignment
Paper
• 2502.04328
• Published
• 29
EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
Paper
• 2502.06788
• Published
• 13
Scaling Pre-training to One Hundred Billion Data for Vision Language
Models
Paper
• 2502.07617
• Published
• 29
Magma: A Foundation Model for Multimodal AI Agents
Paper
• 2502.13130
• Published
• 58
Qwen2.5-VL Technical Report
Paper
• 2502.13923
• Published
• 213
Slamming: Training a Speech Language Model on One GPU in a Day
Paper
• 2502.15814
• Published
• 69
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
Paper
• 2502.18411
• Published
• 74
MLLMs Know Where to Look: Training-free Perception of Small Visual
Details with Multimodal LLMs
Paper
• 2502.17422
• Published
• 7
Introducing Visual Perception Token into Multimodal Large Language Model
Paper
• 2502.17425
• Published
• 16
Token-Efficient Long Video Understanding for Multimodal LLMs
Paper
• 2503.04130
• Published
• 96
Unified Reward Model for Multimodal Understanding and Generation
Paper
• 2503.05236
• Published
• 123
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by
Imitating Human Annotator Trajectories
Paper
• 2503.08625
• Published
• 27
OmniMamba: Efficient and Unified Multimodal Understanding and Generation
via State Space Models
Paper
• 2503.08686
• Published
• 19
Aligning Multimodal LLM with Human Preference: A Survey
Paper
• 2503.14504
• Published
• 26
MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs
Paper
• 2503.13111
• Published
• 7
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play
Visual Games with Keyboards and Mouse
Paper
• 2503.16365
• Published
• 41
Judge Anything: MLLM as a Judge Across Any Modality
Paper
• 2503.17489
• Published
• 23
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
Paper
• 2503.18931
• Published
• 30
Video-R1: Reinforcing Video Reasoning in MLLMs
Paper
• 2503.21776
• Published
• 79
PAVE: Patching and Adapting Video Large Language Models
Paper
• 2503.19794
• Published
• 3
SmolVLM: Redefining small and efficient multimodal models
Paper
• 2504.05299
• Published
• 205
Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought
Paper
• 2504.05599
• Published
• 85
OmniCaptioner: One Captioner to Rule Them All
Paper
• 2504.07089
• Published
• 20
Paper
• 2504.07491
• Published
• 137
MM-IFEngine: Towards Multimodal Instruction Following
Paper
• 2504.07957
• Published
• 35
Scaling Laws for Native Multimodal Models Scaling Laws for Native
Multimodal Models
Paper
• 2504.07951
• Published
• 30
InternVL3: Exploring Advanced Training and Test-Time Recipes for
Open-Source Multimodal Models
Paper
• 2504.10479
• Published
• 306
FUSION: Fully Integration of Vision-Language Representations for Deep
Cross-Modal Understanding
Paper
• 2504.09925
• Published
• 39
Mavors: Multi-granularity Video Representation for Multimodal Large
Language Model
Paper
• 2504.10068
• Published
• 30
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
Paper
• 2504.10465
• Published
• 27
The Scalability of Simplicity: Empirical Analysis of Vision-Language
Learning with a Single Transformer
Paper
• 2504.10462
• Published
• 15
Eagle 2.5: Boosting Long-Context Post-Training for Frontier
Vision-Language Models
Paper
• 2504.15271
• Published
• 67
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
Paper
• 2504.16030
• Published
• 36
X-Fusion: Introducing New Modality to Frozen Large Language Models
Paper
• 2504.20996
• Published
• 13
LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive
Streaming Speech Synthesis
Paper
• 2505.02625
• Published
• 23
Ming-Lite-Uni: Advancements in Unified Architecture for Natural
Multimodal Interaction
Paper
• 2505.02471
• Published
• 15
Unified Multimodal Understanding and Generation Models: Advances,
Challenges, and Opportunities
Paper
• 2505.02567
• Published
• 80
Seed1.5-VL Technical Report
Paper
• 2505.07062
• Published
• 155
Aya Vision: Advancing the Frontier of Multilingual Multimodality
Paper
• 2505.08751
• Published
• 13
MMaDA: Multimodal Large Diffusion Language Models
Paper
• 2505.15809
• Published
• 98
Jodi: Unification of Visual Generation and Understanding via Joint
Modeling
Paper
• 2505.19084
• Published
• 20
Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System
Collaboration
Paper
• 2505.20256
• Published
• 19
Muddit: Liberating Generation Beyond Text-to-Image with a Unified
Discrete Diffusion Model
Paper
• 2505.23606
• Published
• 14
UniWorld: High-Resolution Semantic Encoders for Unified Visual
Understanding and Generation
Paper
• 2506.03147
• Published
• 58
Is Extending Modality The Right Path Towards Omni-Modality?
Paper
• 2506.01872
• Published
• 24
Stream-Omni: Simultaneous Multimodal Interactions with Large
Language-Vision-Speech Model
Paper
• 2506.13642
• Published
• 27
Show-o2: Improved Native Unified Multimodal Models
Paper
• 2506.15564
• Published
• 29
OmniGen2: Exploration to Advanced Multimodal Generation
Paper
• 2506.18871
• Published
• 78
Paper
• 2506.23044
• Published
• 61
Kwai Keye-VL Technical Report
Paper
• 2507.01949
• Published
• 131
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and
Future Frontiers
Paper
• 2506.23918
• Published
• 90
VisionThink: Smart and Efficient Vision Language Model via Reinforcement
Learning
Paper
• 2507.13348
• Published
• 79
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal
Large Language Models
Paper
• 2507.12566
• Published
• 15
Pixels, Patterns, but No Poetry: To See The World like Humans
Paper
• 2507.16863
• Published
• 69
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World
Shorts
Paper
• 2507.20939
• Published
• 57
Step-3 is Large yet Affordable: Model-system Co-design for
Cost-effective Decoding
Paper
• 2507.19427
• Published
• 21
Phi-Ground Tech Report: Advancing Perception in GUI Grounding
Paper
• 2507.23779
• Published
• 45
Multimodal Referring Segmentation: A Survey
Paper
• 2508.00265
• Published
• 9
VeOmni: Scaling Any Modality Model Training with Model-Centric
Distributed Recipe Zoo
Paper
• 2508.02317
• Published
• 22
A Glimpse to Compress: Dynamic Visual Token Pruning for Large
Vision-Language Models
Paper
• 2508.01548
• Published
• 14
Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with
Patch-level CLIP Latents
Paper
• 2508.05954
• Published
• 6
Paper
• 2508.11737
• Published
• 112
Intern-S1: A Scientific Multimodal Foundation Model
Paper
• 2508.15763
• Published
• 269
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility,
Reasoning, and Efficiency
Paper
• 2508.18265
• Published
• 214
MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs
Paper
• 2508.18264
• Published
• 25
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs
via Bi-Mode Annealing and Reinforce Learning
Paper
• 2508.21113
• Published
• 110
Kwai Keye-VL 1.5 Technical Report
Paper
• 2509.01563
• Published
• 38
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid
Vision Tokenizer
Paper
• 2509.16197
• Published
• 58
Qwen3-Omni Technical Report
Paper
• 2509.17765
• Published
• 149
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and
Training Recipe
Paper
• 2509.18154
• Published
• 55
How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven
Perspective
Paper
• 2509.18905
• Published
• 30
Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully
Open MLLMs
Paper
• 2510.13795
• Published
• 59
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn
Dialogue
Paper
• 2510.13747
• Published
• 30
ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution
Paper
• 2510.12793
• Published
• 4
InternSVG: Towards Unified SVG Tasks with Multimodal Large Language
Models
Paper
• 2510.11341
• Published
• 35
AndesVL Technical Report: An Efficient Mobile-side Multimodal Large
Language Model
Paper
• 2510.11496
• Published
• 5
Better Together: Leveraging Unpaired Multimodal Data for Stronger
Unimodal Models
Paper
• 2510.08492
• Published
• 10
DeepSeek-OCR: Contexts Optical Compression
Paper
• 2510.18234
• Published
• 92
Glyph: Scaling Context Windows via Visual-Text Compression
Paper
• 2510.17800
• Published
• 68
LongCat-Flash-Omni Technical Report
Paper
• 2511.00279
• Published
• 26
Emu3.5: Native Multimodal Models are World Learners
Paper
• 2510.26583
• Published
• 111
Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal
Perception and Generation
Paper
• 2510.24821
• Published
• 41
Paper
• 2511.05491
• Published
• 52
DeepEyesV2: Toward Agentic Multimodal Model
Paper
• 2511.05271
• Published
• 45
OneThinker: All-in-one Reasoning Model for Image and Video
Paper
• 2512.03043
• Published
• 33
Kling-Omni Technical Report
Paper
• 2512.16776
• Published
• 170
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Paper
• 2601.10611
• Published
• 29