Docling

Enterprise

https://github.com/docling-project

docling-project

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

PeterWJStaar published a model 3 days ago

ds4sd/CodeFormulaV2

MatteoOmenetti updated a collection 3 days ago

Docling

MatteoOmenetti updated a model 3 days ago

ds4sd/CodeFormulaV2

View all activity

PeterWJStaar

published a model 3 days ago

ds4sd/CodeFormulaV2

0.3B • Updated 3 days ago • 16 • 1

MatteoOmenetti

updated a collection 3 days ago

Docling

Collection

5 items • Updated 3 days ago • 6

MatteoOmenetti

updated a model 3 days ago

ds4sd/CodeFormulaV2

0.3B • Updated 3 days ago • 16 • 1

MatteoOmenetti

updated a collection 3 days ago

Docling

Collection

5 items • Updated 3 days ago • 6

MatteoOmenetti

updated a dataset 14 days ago

ds4sd/SynthFormulaNet

Viewer • Updated 14 days ago • 6.45M • 814 • 11

PeterWJStaar

published 3 datasets 14 days ago

auerchristoph

in ds4sd/docling-models 21 days ago

Is the layout model coming back?

#19 opened 22 days ago by

hegghammer

andito

posted an update 22 days ago

Post

2750

Many VLMs claim to process hours of video. But can they follow the story?🤔
Today, we introduce TimeScope: The benchmark that separates true temporal understanding from marketing hype. Let's see how much VLMs really understand!⏳

We test three skills that matter for real-world use:
🔎 Localized Retrieval: Find a specific action.
🧩 Information Synthesis: Piece together scattered clues.
🏃 Fine-Grained Perception: Analyze detailed motion (e.g., count how many times a person swings an axe).

The results are in, and they're revealing. Only Gemini 2.5 pro handles 1-hour-long videos.
Performance drops sharply with duration, proving that long video understanding is still challenging. We've found the breaking points—now the community can start fixing them.📈

Want to learn more? TimeScope is 100% open-source. Benchmark your model and help us build the next generation of video AI.

📖 Blog:
https://huggingface.co/blog/timescope-video-lmm-benchmark
👩‍💻 Leaderboard & Demo: Apollo-LMMs/TimeScope
📊 Dataset: Apollo-LMMs/TimeScope
⚙️ Eval Code: https://github.com/EvolvingLMMs-Lab/lmms-eval

nlivathinos

in ds4sd/docling-models 22 days ago

Is the layout model coming back?

#19 opened 22 days ago by

hegghammer

auerchristoph

in ds4sd/docling-models 22 days ago

Delete layout model from shared model repo

#18 opened 22 days ago by

auerchristoph

Update README.md

#14 opened 4 months ago by

aki-008

Delete layout model from shared model repo

#17 opened 22 days ago by

auerchristoph

MatteoOmenetti

updated a dataset 22 days ago

HuggingFaceM4/DoclingMatix

Viewer • Updated 14 days ago • 1.27M • 5.55k • 32

MatteoOmenetti

updated a dataset 29 days ago

ds4sd/SynthCodeNet

Viewer • Updated 29 days ago • 9.33M • 1.68k • 6

MatteoOmenetti

updated a dataset about 1 month ago

ds4sd/SynthChartNet

Viewer • Updated about 1 month ago • 1.98M • 1.24k • 8

andito

posted an update about 1 month ago

Post

3980

🧠👁️ Can AI visualize solutions?

Humans often solve visual problems by sketching ideas in our minds. What if Vision-Language Models (VLMs) could do something similar, not by generating full images, but by using internal “mental sketches”?

That’s the idea behind Mirage, a new framework that empowers VLMs to reason using latent visual tokens. Instead of just thinking in words, Mirage mixes in abstract visual representations that help the model solve complex tasks.

These aren't photorealistic images. They're compact, internal representations optimized purely to support reasoning.

🔧 Mirage is trained in two phases:

1) Grounding: It learns to produce latent tokens anchored in real images.
2) Refinement: The model drops the images and learns to generate visual tokens on its own.

📈 And yes, it works!
On challenging benchmarks like Visual Spatial Planning, Jigsaw puzzles, and Spatial Attention Tasks, Mirage clearly outperforms GPT-4o and other strong baselines.
Smart sketches > empty words.

By mimicking the way humans visualize solutions, Mirage gives AI a new kind of imagination, one that’s faster, more efficient, and more human-like.
Kudos to the teams at UMass Amherst and MIT behind this exciting work.
Check the paper: Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (2506.17218)