SAILViT: Towards Robust and Generalizable Visual Backbones for MLLMs via Gradual Feature Refinement Paper • 2507.01643 • Published Jul 2 • 1
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology Paper • 2507.07999 • Published 26 days ago • 46
Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment Paper • 2405.17871 • Published May 28, 2024 • 1
World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering Paper • 2409.20424 • Published Sep 30, 2024
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer Paper • 2504.10462 • Published Apr 14 • 15
Unveiling the Tapestry of Consistency in Large Vision-Language Models Paper • 2405.14156 • Published May 23, 2024