Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models Paper • 2505.24164 • Published May 30
UltraVideo: High-Quality UHD Video Dataset with Comprehensive Captions Paper • 2506.13691 • Published Jun 16 • 2
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology Paper • 2507.07999 • Published Jul 10 • 47
Towards Semantic Equivalence of Tokenization in Multimodal LLM Paper • 2406.05127 • Published Jun 7, 2024
So-Fake: Benchmarking and Explaining Social Media Image Forgery Detection Paper • 2505.18660 • Published May 24 • 1
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model Paper • 2505.23606 • Published May 29 • 15
Conditional Panoramic Image Generation via Masked Autoregressive Modeling Paper • 2505.16862 • Published May 22
MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query Paper • 2506.03144 • Published Jun 3 • 3
BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation Paper • 2505.12620 • Published May 19
CyberV: Cybernetics for Test-time Scaling in Video Understanding Paper • 2506.07971 • Published Jun 9 • 4
CyberV: Cybernetics for Test-time Scaling in Video Understanding Paper • 2506.07971 • Published Jun 9 • 4
DiffDecompose: Layer-Wise Decomposition of Alpha-Composited Images via Diffusion Transformers Paper • 2505.21541 • Published May 24 • 7
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model Paper • 2505.23606 • Published May 29 • 15
On Path to Multimodal Generalist: General-Level and General-Bench Paper • 2505.04620 • Published May 7 • 83
On Path to Multimodal Generalist: General-Level and General-Bench Paper • 2505.04620 • Published May 7 • 83
On Path to Multimodal Generalist: General-Level and General-Bench Paper • 2505.04620 • Published May 7 • 83