view article Article Welcome GPT OSS, the new open-source model family from OpenAI! By reach-vb and 11 others • 10 days ago • 453
view article Article SmolLM3: smol, multilingual, long-context reasoner By loubnabnl and 22 others • Jul 8 • 625
view article Article TimeScope: How Long Can Your Video Large Multimodal Model Go? By orrzohar and 3 others • 23 days ago • 36
Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens Paper • 2506.17218 • Published Jun 20 • 27
Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance Paper • 2502.06145 • Published Feb 10 • 17
view article Article nanoVLM: The simplest repository to train your VLM in pure PyTorch By ariG23498 and 6 others • May 21 • 204
view article Article Vision Language Models (Better, Faster, Stronger) By merve and 4 others • May 12 • 503
view article Article Cohere on Hugging Face Inference Providers 🔥 By burtenshaw and 6 others • Apr 16 • 131
view article Article SmolVLM2: Bringing Video Understanding to Every Device By orrzohar and 6 others • Feb 20 • 293
view article Article SmolVLM Grows Smaller – Introducing the 250M & 500M Models! By andito and 2 others • Jan 23 • 182
view article Article Announcing NVIDIA Cosmos World Foundation Models By mingyuliutw and 1 other • Jan 7 • 26
Apollo: An Exploration of Video Understanding in Large Multimodal Models Paper • 2412.10360 • Published Dec 13, 2024 • 147
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding Paper • 2410.17434 • Published Oct 22, 2024 • 30
view article Article Docmatix - a huge dataset for Document Visual Question Answering By andito and 1 other • Jul 18, 2024 • 76
view article Article Scaling robotics datasets with video encoding By aliberts and 2 others • Aug 27, 2024 • 40
Building and better understanding vision-language models: insights and future directions Paper • 2408.12637 • Published Aug 22, 2024 • 132