view article Article Welcome Gemma 3: Google's all new multimodal, multilingual, long context open LLM By ariG23498 and 3 others • Mar 12 • 447
view post Post 3680 The new Qwen-2 VL models seem to perform quite well in object detection. You can prompt them to respond with bounding boxes in a reference frame of 1k x 1k pixels and scale those boxes to the original image size.You can try it out with my space maxiw/Qwen2-VL-Detection 6 replies · 👍 14 14 👀 5 5 🤗 1 1 + Reply
view article Article Welcome the NVIDIA Llama Nemotron Nano VLM to Hugging Face Hub By nvidia and 11 others • Jun 27 • 27
view article Article Vision Language Models (Better, Faster, Stronger) By merve and 4 others • May 12 • 492
view article Article Gemma 3n fully available in the open-source ecosystem! By ariG23498 and 7 others • Jun 26 • 113
view article Article Fine-tune Llama 3.1 Ultra-Efficiently with Unsloth By mlabonne • Jul 29, 2024 • 352
Running on CPU Upgrade 216 216 MMLU-Pro Leaderboard 🥇 More advanced and challenging multi-task evaluation
Describe Anything Collection Multimodal Large Language Models for Detailed Localized Image and Video Captioning • 7 items • Updated 13 days ago • 54
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion Paper • 2503.11576 • Published Mar 14 • 114