HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices Paper • 2512.14052 • Published 17 days ago • 39
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling Paper • 2512.14614 • Published 16 days ago • 67
DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling Paper • 2512.03000 • Published 30 days ago • 36
4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer Paper • 2512.05060 • Published 28 days ago • 18
StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos Paper • 2512.01707 • Published Dec 1, 2025 • 7
Accelerating Streaming Video Large Language Models via Hierarchical Token Compression Paper • 2512.00891 • Published Nov 30, 2025 • 14
CaptionQA: Is Your Caption as Useful as the Image Itself? Paper • 2511.21025 • Published Nov 26, 2025 • 27
VisPlay: Self-Evolving Vision-Language Models from Images Paper • 2511.15661 • Published Nov 19, 2025 • 42
WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning Paper • 2512.02425 • Published about 1 month ago • 23
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling Paper • 2511.20785 • Published Nov 25, 2025 • 181
view article Article Hugging Face to sell open-source robots thanks to Pollen Robotics acquisition 🤖 +1 Apr 14, 2025 • 48
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions Paper • 2409.18042 • Published Sep 26, 2024 • 39