Smolvencoder

non-profit

AI & ML interests

None defined yet.

Recent Activity

anditoย 
posted an update 14 days ago
view post
Post
2708
Many VLMs claim to process hours of video. But can they follow the story?๐Ÿค”
Today, we introduce TimeScope: The benchmark that separates true temporal understanding from marketing hype. Let's see how much VLMs really understand!โณ

We test three skills that matter for real-world use:
๐Ÿ”Ž Localized Retrieval: Find a specific action.
๐Ÿงฉ Information Synthesis: Piece together scattered clues.
๐Ÿƒ Fine-Grained Perception: Analyze detailed motion (e.g., count how many times a person swings an axe).

The results are in, and they're revealing. Only Gemini 2.5 pro handles 1-hour-long videos.
Performance drops sharply with duration, proving that long video understanding is still challenging. We've found the breaking pointsโ€”now the community can start fixing them.๐Ÿ“ˆ

Want to learn more? TimeScope is 100% open-source. Benchmark your model and help us build the next generation of video AI.

๐Ÿ“– Blog:
https://huggingface.co/blog/timescope-video-lmm-benchmark
๐Ÿ‘ฉโ€๐Ÿ’ป Leaderboard & Demo: Apollo-LMMs/TimeScope
๐Ÿ“Š Dataset: Apollo-LMMs/TimeScope
โš™๏ธ Eval Code: https://github.com/EvolvingLMMs-Lab/lmms-eval
anditoย 
posted an update about 1 month ago
view post
Post
3973
๐Ÿง ๐Ÿ‘๏ธ Can AI visualize solutions?

Humans often solve visual problems by sketching ideas in our minds. What if Vision-Language Models (VLMs) could do something similar, not by generating full images, but by using internal โ€œmental sketchesโ€?

Thatโ€™s the idea behind Mirage, a new framework that empowers VLMs to reason using latent visual tokens. Instead of just thinking in words, Mirage mixes in abstract visual representations that help the model solve complex tasks.

These aren't photorealistic images. They're compact, internal representations optimized purely to support reasoning.

๐Ÿ”ง Mirage is trained in two phases:

1) Grounding: It learns to produce latent tokens anchored in real images.
2) Refinement: The model drops the images and learns to generate visual tokens on its own.

๐Ÿ“ˆ And yes, it works!
On challenging benchmarks like Visual Spatial Planning, Jigsaw puzzles, and Spatial Attention Tasks, Mirage clearly outperforms GPT-4o and other strong baselines.
Smart sketches > empty words.

By mimicking the way humans visualize solutions, Mirage gives AI a new kind of imagination, one thatโ€™s faster, more efficient, and more human-like.
Kudos to the teams at UMass Amherst and MIT behind this exciting work.
Check the paper: Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (2506.17218)
ยท
anditoย 
posted an update 5 months ago
view post
Post
2982
Extremely bullish on @CohereForAI 's Aya Vision (8B & 32B) - new SOTA open-weight VLMs

- 8B wins up to 81% of the time in its class, better than Gemini Flash
- 32B beats Llama 3.2 90B!
- Covers 23 languages, excels in image captioning, VQA & more
- Integrated on transformers from Day 0!

Efficient multimodal models are here to stay!!๐Ÿ”ฅ
Check out their blog! https://huggingface.co/blog/aya-vision
anditoย 
posted an update 6 months ago
view post
Post
1663
๐—œ๐—ป๐˜๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐—ถ๐—ป๐—ด ๐˜๐—ต๐—ฒ ๐˜„๐—ผ๐—ฟ๐—น๐—ฑ'๐˜€ ๐˜€๐—บ๐—ฎ๐—น๐—น๐—ฒ๐˜€๐˜ ๐˜ƒ๐—ถ๐˜€๐—ถ๐—ผ๐—ป ๐—น๐—ฎ๐—ป๐—ด๐˜‚๐—ฎ๐—ด๐—ฒ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น!

Weโ€™re thrilled to share ๐—ฆ๐—บ๐—ผ๐—น๐—ฉ๐—Ÿ๐—  (256M & 500M)โ€”the smallest Visual Language Models ever built. Think: running on <1GB of GPU memoryโ€”you can fine-tune it on your laptop and run it on your toaster!

Why Itโ€™s Game-Changing:
- ๐—ข๐˜‚๐˜๐—ฝ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐˜€ ๐—Ÿ๐—ฎ๐—ฟ๐—ด๐—ฒ๐—ฟ ๐— ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€: Even the 256M model surpasses our SOTA 80B-parameter model from just 17 months ago. Over 300x reduction!
๐— ๐—ถ๐—ด๐—ต๐˜๐˜† ๐—˜๐—ณ๐—ณ๐—ถ๐—ฐ๐—ถ๐—ฒ๐—ป๐—ฐ๐˜†: The 256M version delivers 80% of our 2.2B modelโ€™s performance, and the 500M version hits 90%
๐—Ÿ๐—ถ๐—ด๐—ต๐˜๐—ป๐—ถ๐—ป๐—ด-๐—™๐—ฎ๐˜€๐˜ ๐—ฆ๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต: SmolVLM integrates with ColiPali for state-of-the-art retrieval speedsโ€”on par with models 10x bigger. That means cheaper, faster indexing and real-world impact.

Whatโ€™s New Under the Hood:
- ๐—ก๐—ฒ๐˜„ ๐—ฉ๐—ถ๐˜€๐—ถ๐—ผ๐—ป ๐—˜๐—ป๐—ฐ๐—ผ๐—ฑ๐—ฒ๐—ฟ: Smaller overall size (400M -> 93M), but with higher resolution.
- ๐—›๐—ถ๐—ด๐—ต๐—ฒ๐—ฟ ๐—ฃ๐—ถ๐˜…๐—ฒ๐—น๐˜€/๐—ง๐—ผ๐—ธ๐—ฒ๐—ป: 4096 vs. 1820โ€”more efficient image processing.
- ๐—ฆ๐—บ๐—ฎ๐—ฟ๐˜ ๐—ง๐—ผ๐—ธ๐—ฒ๐—ป๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป: Faster training and a performance boost.

Check our blog: https://huggingface.co/blog/smolervlm
The models: HuggingFaceTB/smolvlm-256m-and-500m-6791fafc5bb0ab8acc960fb0
The demo: HuggingFaceTB/SmolVLM-256M-Demo
  • 1 reply
ยท
anditoย 
posted an update 8 months ago
view post
Post
1987
SmolVLM speeding locally on a laptop thanks to mlx-vlm and
@Gradio ! Try it with two lines:
pip install git+https://github.com/andimarafioti/mlx-vlm.git@stream-generate-fix
python -m mlx_vlm.chat_ui --model mlx-community/SmolVLM-Instruct-8bit

Gotta love the MLX community! Big thanks to @pcuenq and @prince_canuma !
anditoย 
posted an update 8 months ago
view post
Post
3409
Let's go! We are releasing SmolVLM, a smol 2B VLM built for on-device inference that outperforms all models at similar GPU RAM usage and tokens throughputs.

- SmolVLM generates tokens 7.5 to 16 times faster than Qwen2-VL! ๐Ÿคฏ
- Other models at this size crash a laptop, but SmolVLM comfortably generates 17 tokens/sec on a macbook! ๐Ÿš€
- SmolVLM can be fine-tuned on a Google collab! Or process millions of documents with a consumer GPU!
- SmolVLM even outperforms larger models in video benchmarks, despite not even being trained on videos!

Check out more!
Demo: HuggingFaceTB/SmolVLM
Blog: https://huggingface.co/blog/smolvlm
Model: HuggingFaceTB/SmolVLM-Instruct
Fine-tuning script: https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb
anditoย 
posted an update 11 months ago
view post
Post
1100
Hugging face presents FineVideo ๐ŸŽฅ! Unlocking the next generation of Video understanding ๐Ÿš€

๐Ÿคฏ3400 hours of annotated Creative Common videos with rich character descriptions, scene splits, mood, and content descriptions per scene as well as QA pairs.
๐Ÿ”ฅ
@mfarre processed over 2M videos of Youtube-CC to make this incredibly powerful selection.

Very psyched to fine-tune idefics on this dataset. โšก๏ธ
Explore the videos: HuggingFaceFV/FineVideo-Explorer
anditoย 
posted an update 11 months ago
view post
Post
1653
๐Ÿš€ Introducing Hugging Face's Multilingual Speech-to-Speech! ๐ŸŽค
๐Ÿ’ฌOur modular, cross-platform pipeline to run GPT4o-like experiences on device can now seamlessly switch languages mid-conversation with an imperceptible 100ms delay.

๐ŸŒŸ Building on an amazing early reception with 2600 stars on GitHub ๐ŸŒŸ
๐Ÿš€ We are expanding the library to support multiple languages
๐Ÿ”ฅ Try it out with a flag: --language fr
๐Ÿคฏ Or don't set the flag and let the system detect the language

๐Ÿ’ก What feature should we add next?
  • 1 reply
ยท