SmolvencoderOrg (Smolvencoder)

paultltc

authored 4 papers 15 days ago

andito

authored a paper 3 months ago

FineVision: Open Data Is All You Need

Paper • 2510.17269 • Published Oct 20, 2025 • 73

andito

posted an update 3 months ago

Post

1940

Finally, our new paper is out! "𝗙𝗶𝗻𝗲𝗩𝗶𝘀𝗶𝗼𝗻: 𝗢𝗽𝗲𝗻 𝗗𝗮𝘁𝗮 𝗜𝘀 𝗔𝗹𝗹 𝗬𝗼𝘂 𝗡𝗲𝗲𝗱"! 🥳
FineVision: Open Data Is All You Need (2510.17269)

If you've ever trained a VLM, you know this problem: nobody shares their data mixtures. It's a black box, making replicating SOTA work impossible.
We wanted to change that.

FineVision unifies 200 sources into 24 million samples. With 17.3 million images and 9.5 billion answer tokens, it's the largest open resource of its kind.

In the paper, we share how we built it:
🔍 finding and cleaning data at scale
🧹 removing excessive duplicates across sources
🤗 decontaminating against 66 public benchmarks

My favorite part is Figure 6 (in the video!). It's our visual diversity analysis. It shows that FineVision isn't just bigger; it's more balanced and conceptually richer than other open datasets.
NVIDIA's Eagle 2 paper highlighted just how critical this visual diversity is, and our results confirm it: models trained on FineVision consistently outperform those trained on any other open dataset on 11 benchmarks!

🎉 To celebrate the paper, I’m also releasing a concatenated and shuffled version of the full dataset! 👉HuggingFaceM4/FineVision_full_shuffled

It’s ready to stream, so you can start training your own models right away:

from datasets import load_dataset
d = load_dataset("HuggingFaceM4/FineVision_full_shuffled", split="train", streaming=True)
print(next(iter(d)))

A big shoutout to the first authors: Luis Wiedmann and Orr Zohar. They are rockstars!

manu

authored 2 papers 3 months ago

EuroLLM-9B: Technical Report

Paper • 2506.04079 • Published Jun 4, 2025 • 1

ModernVBERT: Towards Smaller Visual Document Retrievers

Paper • 2510.01149 • Published Oct 1, 2025 • 30

andito

posted an update 6 months ago

Post

3054

Many VLMs claim to process hours of video. But can they follow the story?🤔
Today, we introduce TimeScope: The benchmark that separates true temporal understanding from marketing hype. Let's see how much VLMs really understand!⏳

We test three skills that matter for real-world use:
🔎 Localized Retrieval: Find a specific action.
🧩 Information Synthesis: Piece together scattered clues.
🏃 Fine-Grained Perception: Analyze detailed motion (e.g., count how many times a person swings an axe).

The results are in, and they're revealing. Only Gemini 2.5 pro handles 1-hour-long videos.
Performance drops sharply with duration, proving that long video understanding is still challenging. We've found the breaking points—now the community can start fixing them.📈

Want to learn more? TimeScope is 100% open-source. Benchmark your model and help us build the next generation of video AI.

📖 Blog:
https://huggingface.co/blog/timescope-video-lmm-benchmark
👩‍💻 Leaderboard & Demo: Apollo-LMMs/TimeScope
📊 Dataset: Apollo-LMMs/TimeScope
⚙️ Eval Code: https://github.com/EvolvingLMMs-Lab/lmms-eval

manu

authored a paper 6 months ago

Should We Still Pretrain Encoders with Masked Language Modeling?

Paper • 2507.00994 • Published Jul 1, 2025 • 80

andito

posted an update 6 months ago

Post

4068

🧠👁️ Can AI visualize solutions?

Humans often solve visual problems by sketching ideas in our minds. What if Vision-Language Models (VLMs) could do something similar, not by generating full images, but by using internal “mental sketches”?

That’s the idea behind Mirage, a new framework that empowers VLMs to reason using latent visual tokens. Instead of just thinking in words, Mirage mixes in abstract visual representations that help the model solve complex tasks.

These aren't photorealistic images. They're compact, internal representations optimized purely to support reasoning.

🔧 Mirage is trained in two phases:

1) Grounding: It learns to produce latent tokens anchored in real images.
2) Refinement: The model drops the images and learns to generate visual tokens on its own.

📈 And yes, it works!
On challenging benchmarks like Visual Spatial Planning, Jigsaw puzzles, and Spatial Attention Tasks, Mirage clearly outperforms GPT-4o and other strong baselines.
Smart sketches > empty words.

By mimicking the way humans visualize solutions, Mirage gives AI a new kind of imagination, one that’s faster, more efficient, and more human-like.
Kudos to the teams at UMass Amherst and MIT behind this exciting work.
Check the paper: Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (2506.17218)

4 replies

·

andito

authored a paper 7 months ago

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Paper • 2506.01844 • Published Jun 2, 2025 • 148

manu

authored a paper 7 months ago

ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval

Paper • 2505.17166 • Published May 22, 2025

andito

authored a paper 9 months ago

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7, 2025 • 203

andito

authored a paper 10 months ago

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

Paper • 2503.11576 • Published Mar 14, 2025 • 130

manu

authored 2 papers 10 months ago

EuroBERT: Scaling Multilingual Encoders for European Languages

Paper • 2503.05500 • Published Mar 7, 2025 • 80

MMTEB: Massive Multilingual Text Embedding Benchmark

Paper • 2502.13595 • Published Feb 19, 2025 • 43

andito

posted an update 10 months ago

Post

2991

Extremely bullish on @CohereForAI 's Aya Vision (8B & 32B) - new SOTA open-weight VLMs

- 8B wins up to 81% of the time in its class, better than Gemini Flash
- 32B beats Llama 3.2 90B!
- Covers 23 languages, excels in image captioning, VQA & more
- Integrated on transformers from Day 0!

Efficient multimodal models are here to stay!!🔥
Check out their blog! https://huggingface.co/blog/aya-vision

andito

authored a paper 11 months ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4, 2025 • 253

andito

posted an update 12 months ago

Post

1668

𝗜𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗶𝗻𝗴 𝘁𝗵𝗲 𝘄𝗼𝗿𝗹𝗱'𝘀 𝘀𝗺𝗮𝗹𝗹𝗲𝘀𝘁 𝘃𝗶𝘀𝗶𝗼𝗻 𝗹𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗺𝗼𝗱𝗲𝗹!

We’re thrilled to share 𝗦𝗺𝗼𝗹𝗩𝗟𝗠 (256M & 500M)—the smallest Visual Language Models ever built. Think: running on <1GB of GPU memory—you can fine-tune it on your laptop and run it on your toaster!

Why It’s Game-Changing:
- 𝗢𝘂𝘁𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝘀 𝗟𝗮𝗿𝗴𝗲𝗿 𝗠𝗼𝗱𝗲𝗹𝘀: Even the 256M model surpasses our SOTA 80B-parameter model from just 17 months ago. Over 300x reduction!
𝗠𝗶𝗴𝗵𝘁𝘆 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆: The 256M version delivers 80% of our 2.2B model’s performance, and the 500M version hits 90%
𝗟𝗶𝗴𝗵𝘁𝗻𝗶𝗻𝗴-𝗙𝗮𝘀𝘁 𝗦𝗲𝗮𝗿𝗰𝗵: SmolVLM integrates with ColiPali for state-of-the-art retrieval speeds—on par with models 10x bigger. That means cheaper, faster indexing and real-world impact.

What’s New Under the Hood:
- 𝗡𝗲𝘄 𝗩𝗶𝘀𝗶𝗼𝗻 𝗘𝗻𝗰𝗼𝗱𝗲𝗿: Smaller overall size (400M -> 93M), but with higher resolution.
- 𝗛𝗶𝗴𝗵𝗲𝗿 𝗣𝗶𝘅𝗲𝗹𝘀/𝗧𝗼𝗸𝗲𝗻: 4096 vs. 1820—more efficient image processing.
- 𝗦𝗺𝗮𝗿𝘁 𝗧𝗼𝗸𝗲𝗻𝗶𝘇𝗮𝘁𝗶𝗼𝗻: Faster training and a performance boost.

Check our blog: https://huggingface.co/blog/smolervlm
The models: HuggingFaceTB/smolvlm-256m-and-500m-6791fafc5bb0ab8acc960fb0
The demo: HuggingFaceTB/SmolVLM-256M-Demo

1 reply

·

Smolvencoder

AI & ML interests

Recent Activity

LLaMa-SciQ: An Educational Chatbot for Answering Science MCQ

Enhancing Inflation Nowcasting with LLM: Sentiment Analysis on News

Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

ModernVBERT: Towards Smaller Visual Document Retrievers

FineVision: Open Data Is All You Need

EuroLLM-9B: Technical Report

ModernVBERT: Towards Smaller Visual Document Retrievers

Should We Still Pretrain Encoders with Masked Language Modeling?

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval

SmolVLM: Redefining small and efficient multimodal models

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

EuroBERT: Scaling Multilingual Encoders for European Languages

MMTEB: Massive Multilingual Text Embedding Benchmark

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

AI & ML interests

Recent Activity

Team members 3

SmolvencoderOrg's activity