AI & ML interests

We research Diffusions, LLMs and other ML.

Recent Activity

prithivMLmods 
posted an update 1 day ago
view post
Post
1929
Try Liquid AI's all-new multimodal models: LFM2-VL-1.6B & LFM2-VL-450M! Demo with the Gradio UI and ReportLab support and both models are runnable on T4 GPU!

↗ LFM2-VL-1.6B-LiquidAI : https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/LFM2-VL-1.6B-LiquidAI/LFM2-VL-1.6B_ReportLab.ipynb

↗ LFM2-VL-450M-LiquidAI : https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/LFM2-VL-450M-LiquidAI/LFM2-VL-450M_ReportLab.ipynb

.
.
.
To know more about it, visit the multimodal outpost notebooks !!
  • 1 reply
·
prithivMLmods 
posted an update 5 days ago
view post
Post
4298
On the verge of releasing Poseidon-Reasoning-5M, a dataset built to excel in general thought processes, mathematics, and science across a diverse mixture of domains, I’m also dropping the Gargantua-R1-Compact dataset, a collection of over six million high-quality reasoning QA pair traces. 🤗🚀

✦ Gargantua-R1-Compact : prithivMLmods/Gargantua-R1-Compact

from datasets import load_dataset

dataset = load_dataset("prithivMLmods/Gargantua-R1-Compact", split="train")

Additionally, I’m adding the mini version of Gargantua — the Gargantua-R1-Wee : prithivMLmods/Gargantua-R1-Wee

from datasets import load_dataset

dataset = load_dataset("prithivMLmods/Gargantua-R1-Wee", split="train")

The composition spans 73.93% core mathematical reasoning involving problems, proofs, and computational challenges, 12.11% across diverse scientific domains such as physics, chemistry, biology, and interdisciplinary topics, 11.35% in competitive coding covering algorithms and data structures, 1.37% in academic science focusing on research-level methodology, 0.95% in creative and analytical reasoning through logic puzzles and problem-solving tasks, 0.25% in specialized technical areas like MLOps, LLMs, diffusion models, and CUDA, and 0.06% involving data from graphs and charts converted into structured JSON formats. Designed with both rich contextual depth and formal structural clarity, Gargantua-R1-Compact is an optimal resource for advancing research in symbolic reasoning, interpretability, and high-precision question answering in mathematical domains.

✦ Collection : prithivMLmods/gargantua-r1-mod-6896bfd7834e82b89ad2b38b


To know more about it, visit the dataset card of the respective dataset. !!
prithivMLmods 
posted an update 6 days ago
view post
Post
2153
I've added the demo of the openbmb/MiniCPM-V-4 model to the Hugging Face Space:
prithivMLmods/Multimodal-VLM-Thinking

✨ MiniCPM-V 4.0 is the latest efficient model in the MiniCPM-V series. The model is built based on SigLIP2-400M and MiniCPM4-3B, with a total of 4.1B parameters. It inherits the strong single-image, multi-image, and video understanding performance of MiniCPM-V 2.6 with largely improved efficiency.

✨ With only 4.1B parameters, MiniCPM-V 4.0 achieves an average score of 69.0 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. This performance surpasses GPT-4.1-mini-20250414, MiniCPM-V 2.6 (8.1B parameters, OpenCompass 65.2), and Qwen2.5-VL-3B-Instruct (3.8B parameters, OpenCompass 64.5). It also shows good performance in multi-image and video understanding.

The community GPU grant was given by Hugging Face — special thanks to them. 🤗🚀

To know more about it, visit the model card of the respective model. !!
prithivMLmods 
posted an update 10 days ago
view post
Post
4175
Qwen Image – The Latest Image Generation Model🔥

Below are some samples generated using the Qwen Image Diffusion Model. Qwen-Image, a 20B MMDiT model for next-generation text-to-image generation, preserves typographic details, layout coherence, and contextual harmony with stunning accuracy. It is especially strong at creating stunning graphic posters with native text. The model is now open-source. [ 𝚀𝚠𝚎𝚗-𝙸𝚖𝚊𝚐𝚎 : Qwen/Qwen-Image ]

⤷ Try the Qwen Image demo here: prithivMLmods/Qwen-Image-Diffusion

⤷ Qwen-Image Technical Report : Qwen-Image Technical Report (2508.02324)
⤷ Qwen Image [GitHub] : https://github.com/QwenLM/Qwen-Image

Even more impressively, it demonstrates a strong ability to understand images. The model supports a wide range of vision-related tasks such as object detection, semantic segmentation, depth and edge (Canny) estimation, novel view synthesis, and image super-resolution. While each task is technically distinct, they can all be viewed as advanced forms of intelligent image editing driven by deep visual understanding. Collectively, these capabilities position Qwen-Image as more than just a tool for generating appealing visuals, it serves as a versatile foundation model for intelligent visual creation and transformation, seamlessly blending language, layout, and imagery.

Qwen-Image uses a dual-stream MMDiT architecture with a frozen Qwen2.5-VL, VAE encoder, RMSNorm for QK-Norm, LayerNorm elsewhere, and a custom MSRoPE scheme for joint image-text positional encoding.

.
.
.
To know more about it, visit the model card of the respective model. !!
prithivMLmods 
posted an update 13 days ago
view post
Post
3148
Introducing Camel-Doc-OCR-080125(v2), a document content-structure retrieval VLM designed for content extraction and summarization. This is the second model in the Camel Doc OCR VLM series, following Camel-Doc-OCR-062825(v1). The new version fixes formal table reconstruction issues in both en and zh language, achieving optimal performance for long-context inferences.🤗🐪

⤷ Camel-Doc-OCR(v2) : prithivMLmods/Camel-Doc-OCR-080125
⤷ Camel-Doc-OCR(v1) : prithivMLmods/Camel-Doc-OCR-062825
⤷ Demo : prithivMLmods/core-OCR

Multimodal Model Collections and Spaces:

➝ Camel-Doc-OCR : prithivMLmods/camel-doc-ocr-080125-688c0c61c5dba648756f31f8
➝ Vision-Language (VLr) : prithivMLmods/vision-language-for-reasoning-vlr-6889b3f45917352b5e3a6f7a
➝ Multimodal Spaces : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0
➝ Multimodal VLMs : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027

.
.
.
To know more about it, visit the model card of the respective model. !!
  • 2 replies
·
prithivMLmods 
posted an update 15 days ago
view post
Post
1073
Exciting to bring the explicitly grounded experimental reasoning model, Lumian-VLR-7B-Thinking, built on top of Qwen2.5-VL, featuring reasoning-aware trajectories with enhanced spatial perception. Along with this, we’ve also added a demo for the model while bringing some of the latest and most interesting models available on the hub to make full use of the remaining resources.

✨ Multimodal-VLM-Thinking : prithivMLmods/Multimodal-VLM-Thinking
✨ Multimodal-VLM-OCR : https://huggingface.co/spaces/prithivMLmods/Multimodal-VLM-OCR

✦ Models used in these spaces:

✨ Lumian-VLR-7B-Thinking : prithivMLmods/Lumian-VLR-7B-Thinking
✨ Enesidaon-VLR-7B-no-Thinking : prithivMLmods/Enesidaon-VLR-7B-no-Thinking
✨ GLM-4.1V-9B-Thinking : zai-org/GLM-4.1V-9B-Thinking
✨ DREX-062225-exp : prithivMLmods/DREX-062225-exp & more ...

✦ Multimodal Model Collections and Spaces:

✨ Vision-Language (VLr) : prithivMLmods/vision-language-for-reasoning-vlr-6889b3f45917352b5e3a6f7a
✨ Multimodal Spaces : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0
✨ Multimodal VLMs : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027

.
.
.
To know more about it, visit the model card of the respective model. !!
prithivMLmods 
posted an update 18 days ago
view post
Post
4818
Explore OCR, Captioning, and Visual Understanding with Cutting-Edge Models on Hugging Face. 🤗🧪

I’ve put together a collection of Google Colab notebooks to experiment with some of the most exciting models available on the Hugging Face Hub focused on OCR, image captioning, and visual understanding tasks. [Image-to-Text] / [Image-Text-to-Text]

> 📖 OCR-ReportLab-Notebooks : prithivMLmods/OCR-ReportLab-Notebooks

These notebooks are built for quick prototyping and run on free T4 GPUs, making them perfect for experimentation, testing ideas, or just exploring what’s possible with modern vision-language models.

Note: The experimental notebooks are compiled with models that fit within the T4 GPU (free-tier) limits. More models along with their notebooks will be added over time.
prithivMLmods 
posted an update 21 days ago
view post
Post
2379
Excited to introduce the new experimental model "Qwen2.5-VL-7B-Abliterated-Caption-it", which is performing exceptionally well on image captioning tasks. This variant is specifically tailored for Abliterated Captioning and Uncensored Image Captioning. It is designed to generate highly detailed and descriptive captions across a broad range of visual categories including images with complex, sensitive, or nuanced content while handling varying aspect ratios and resolutions.🧪🤗

✨ Try the demo here : https://huggingface.co/spaces/prithivMLmods/Qwen2.5-VL
✨ Qwen2.5-VL-7B-Abliterated-Caption-it : prithivMLmods/Qwen2.5-VL-7B-Abliterated-Caption-it
✨ Multimodal VLMs : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027
✨ Multimodal Implementations : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

.
.
.
To know more about it, visit the model card of the respective model. !!
prithivMLmods 
posted an update 22 days ago
view post
Post
2376
olmOCR [Allen AI] just got an upgrade! 📈🧑‍🍳

The allenai/olmOCR-7B-0725 — fine-tuned with allenai/olmOCR-mix-0225 on top of Qwen/Qwen2.5-VL-7B-Instruct, pushing the boundaries of OCR technology. It takes a single document image as input, with the longest side resized to 1288 pixels. High-quality, openly available approach to parsing pdfs and other complex documents optical character recognition.

Try the demo here: prithivMLmods/Multimodal-OCR

✨ Model: allenai/olmOCR-7B-0725
✨ Model [fp8]: allenai/olmOCR-7B-0725-FP8
✨ Multimodal Implementations Space Collection: prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

.
.
.
To know more about it, visit the model card of the respective model. !!
AtAndDev 
posted an update 23 days ago
view post
Post
362
Qwen 3 Coder is a personal attack to k2, and I love it.
It achieves near SOTA on LCB while not having reasoning.
Finally people are understanding that reasoning isnt necessary for high benches...

Qwen ftw!

DECENTRALIZE DECENTRALIZE DECENTRALIZE
prithivMLmods 
posted an update 25 days ago
view post
Post
5117
Upgraded the step-by-step notebook for fine-tuning SigLIP2 on domain-specific image classification tasks. The notebook supports both datasets with predefined train/test splits and those with only a train split, making it suitable for low-resource, custom, and real-world classification scenarios. 📢👉

➺ FineTuning-SigLIP2-Notebook : prithivMLmods/FineTuning-SigLIP2-Notebook

➺ GitHub : https://github.com/PRITHIVSAKTHIUR/FineTuning-SigLIP-2

➺ In the first, datasets include predefined train and test splits, enabling conventional supervised learning and generalization evaluation : prithivMLmods/FineTuning-SigLIP2-Notebook (.ipynb)

➺ In the second scenario, only a training split is available; in such cases, the training set is either partially reserved for validation or reused entirely for evaluation : prithivMLmods/FineTuning-SigLIP2-Notebook (.ipynb)

This flexibility supports experimentation in constrained or domain-specific settings, where standard test annotations may not exist.
prithivMLmods 
posted an update 27 days ago
view post
Post
4080
Dropping the general-purpose reasoning dataset Poseidon-Reasoning-5M, which supports general thought processes, math, and science — featuring a diverse mixture of domains 🌊 : prithivMLmods/Poseidon-Reasoning-5M

from datasets import load_dataset

dataset = load_dataset("prithivMLmods/Poseidon-Reasoning-5M", split="data")

The compact version is as follows — Poseidon-Reasoning-Mini-300K : prithivMLmods/Poseidon-Reasoning-Mini-300K


from datasets import load_dataset

dataset = load_dataset("prithivMLmods/Poseidon-Reasoning-Mini-300K", split="train")


Collection : prithivMLmods/poseidon-reasoning-6879ca98e118b307c781a9ba
prithivMLmods 
posted an update about 1 month ago
view post
Post
2178
Open Omega Ω (Forge, Atom, Explora):
A Fusion of Math, Science, and Coding 🧪🤗

Datasets :
⌯⌲ Open-Omega-Forge-1M [Mathematics, Coding, and Science]: prithivMLmods/Open-Omega-Forge-1M
⌯⌲ Open-Omega-Atom-1.5M [Mathematics and Science]: prithivMLmods/Open-Omega-Atom-1.5M
⌯⌲ Open-Omega-Explora-2.5M [Forge + Atom]: prithivMLmods/Open-Omega-Explora-2.5M
⌯⌲ Others [Subordinate portion] - Curated and blended modular dataset.

Models :
> Omega-Qwen3-Atom-8B : prithivMLmods/Omega-Qwen3-Atom-8B
> Omega-Qwen2.5-Coder-3B : prithivMLmods/Omega-Qwen2.5-Coder-3B

Dataset Collection: prithivMLmods/open-omega-a-fusion-of-math-science-and-coding-68756c37769fa39c4055cc0e

.
.
.
For more information, refer to the dataset card(s).

prithivMLmods 
posted an update about 1 month ago
view post
Post
3855
Excited to bring the new models that are performing exceptionally well in document OCR, image captioning, and visual understanding tasks. Megalodon-OCR and Perseus-Doc-VL have both demonstrated significant improvements across key areas. You can explore live demos on Hugging Face Spaces to compare their performance with other top-tier models available on the hub. 🤗📄

Models & Spaces :
> Megalodon-OCR (3B) : prithivMLmods/Megalodon-OCR-Sync-0713
> Perseus-Doc-vl (7B): prithivMLmods/Perseus-Doc-vl-0712
> Doc-VLMs-OCR : https://huggingface.co/spaces/prithivMLmods/Multimodal-VLM-OCR
> core-OCR : prithivMLmods/core-OCR


Datasets Caption Mix :
> Corvus-OCR-Caption-Mix : prithivMLmods/Corvus-OCR-Caption-Mix
> Corvus-OCR-Caption-Mini-Mix : prithivMLmods/Corvus-OCR-Caption-Mini-Mix

Collections :
> Corvus OCR Caption Mix: prithivMLmods/corvus-ocr-caption-mix-687349bfaceffbd10976f0cc
> Captioning / OCR / DocTable : prithivMLmods/captioning-ocr-doctable-687382e1da822008bb5c06f2

GitHub :
> OCR-ReportLab : https://github.com/PRITHIVSAKTHIUR/OCR-ReportLab/blob/main/Megalodon-OCR-Sync-0713-ColabNotebook/Megalodon_OCR_Sync_0713_ReportLab.ipynb

Others Spaces :
> Multimodal-OCR : prithivMLmods/Multimodal-OCR
> Multimodal-VLMs : https://huggingface.co/spaces/prithivMLmods/Multimodal-OCR-Outpost
> Multimodal-OCR2 : prithivMLmods/Multimodal-OCR2
> Florence-2-Image-Caption : prithivMLmods/Florence-2-Image-Caption
> VisionScope-R2 : prithivMLmods/VisionScope-R2
> DocScope-R1 : prithivMLmods/DocScope-R1

.
.
.
To know more about it, visit the model card of the respective model. !!
prithivMLmods 
posted an update about 1 month ago
view post
Post
2397
Demo of OCR & Math QA using multi-capable VLMs like MonkeyOCR-pro-1.2B, R1-One-Vision, VisionaryR1, Vision Matters-7B, and VIGAL-7B, all running together with support for both image and video inference. 🪐

✦ GitHub : https://github.com/PRITHIVSAKTHIUR/OCR-ReportLab

✦ Models :
⤷ Visionary R1 : maifoundations/Visionary-R1
⤷ MonkeyOCR [1.2B] : echo840/MonkeyOCR-pro-1.2B
⤷ ViGaL 7B : yunfeixie/ViGaL-7B
⤷ Lh41-1042-Magellanic-7B-0711 : prithivMLmods/Lh41-1042-Magellanic-7B-0711
⤷ Vision Matters 7B : Yuting6/Vision-Matters-7B
⤷ WR30a-Deep-7B-0711 : prithivMLmods/WR30a-Deep-7B-0711

✦ MonkeyOCR-pro-1.2B Colab T4 Demo [ notebook ]
⤷ MonkeyOCR-pro-1.2B-ReportLab : https://github.com/PRITHIVSAKTHIUR/OCR-ReportLab/blob/main/MonkeyOCR-0709/MonkeyOCR-pro-1.2B-ReportLab.ipynb


The community GPU grant was given by Hugging Face — special thanks to them.🤗🚀

.
.
.
To know more about it, visit the model card of the respective model. !!
zamal 
posted an update about 1 month ago
view post
Post
4007
Hey all
Finally it's happening. DeepGit lite is back now, running on cpu only devices. Just smartly search across Github and spin up conversational agents in the background and have grounded conversation with repositories
Try it out now!!!! zamal/DeepGit
  • 1 reply
·
prithivMLmods 
posted an update about 1 month ago
view post
Post
3569
Multimodal OCR with ReportLab? On Colab T4? (Nanonets OCR, Monkey OCR, OCRFlux 3B, Typhoo OCR 3B?) .. Yeah, it’s possible. I’ve made a dedicated Colab notebook to experiment with these models (all built on top of Qwen2.5 VL). 🤗🚀

Download notebooks here :

✦︎ NanonetsOCR : https://colab.research.google.com/drive/1VvA-amvSVxGdWgIsh4_by6KWOtEs_Iqp
✦︎ MonkeyOCR : https://colab.research.google.com/drive/1vPCojbmlXjDFUt06FJ1tjgnj_zWK4mUo
✦︎ OCRFluxOCR : https://colab.research.google.com/drive/1TDoCXzWdF2hxVLbISqW6DjXAzOyI7pzf
✦︎ TyphoonOCR : https://colab.research.google.com/drive/1_59zvLNnn1kvbiSFxzA1WiqhpbW8RKbz

🜲 Github : https://github.com/PRITHIVSAKTHIUR/OCR-ReportLab-Notebooks

What does it do?

1. Performs OCR on the input image
2. Generates a DOCX or PDF file with the input image and the extracted text

.
.
.
To know more about it, visit the model card of the respective model. !!
prithivMLmods 
posted an update about 1 month ago
view post
Post
1696
The bunch of comparable demos for Multimodal VLMs (excels in OCR, cinematography understanding, spatial reasoning, etc.) now up on the Hub 🤗 — max recent till Jun'25.

✦ Demo Spaces —

> [Nanonets-OCR-s, MonkeyOCR, Typhoon-OCR-7B, SmolDocling] : prithivMLmods/Multimodal-OCR2
> [GLM-4.1v, docscopeOCR-7B, MonkeyOCR, coreOCR-7B] : prithivMLmods/core-OCR
> [Camel-Doc-OCR, ViLaSR-7B, OCRFlux-3B, ShotVL-7B] : https://huggingface.co/spaces/prithivMLmods/Multimodal-VLM-OCR
> [SkyCaptioner-V1, SpaceThinker-3B, coreOCR-7B, SpaceOm-3B] : prithivMLmods/VisionScope-R2
> [RolmOCR-7B, Qwen2-VL-OCR-2B, Aya-Vision-8B, Nanonets-OCR-s] : prithivMLmods/Multimodal-OCR
> [DREX-062225-7B, Typhoon-OCR-3B, olmOCR-7B-0225, VIREX-062225-7B] : prithivMLmods/Multimodal-VLM-Thinking
> [Cosmos-Reason1-7B, docscopeOCR-7B, Captioner-7B, visionOCR-3B] : prithivMLmods/DocScope-R1

✦ Space Collection : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

.
.
.
To know more about it, visit the model card of the respective model. !!
  • 1 reply
·