We just shipped a blog on everything latest on vision language models, including 🤖 GUI agents, agentic VLMs, omni models 📑 multimodal RAG ⏯️ video LMs 🤏🏻 smol models ..and more! https://huggingface.co/blog/vlms-2025
Dropping some image classification models for content moderation, balancers, and classifiers trained on synthetic datasets—along with others based on datasets available on the Hub. Also loaded a few low-rank datasets for realistic gender portrait classification and document-type classifiers, all fine-tuned on the SigLIP-2 Patch-16 224 backbone. Models and datasets are listed below:
Well, here’s the updated version with the 20,000+ entry sampled dataset for Watermark Filter Content Moderation models incl. [Food25, Weather, Watermark, Marathi/Hindi Sign Language Detection], post-trained from the base models: sigLip2 patch16 224 — now with mixed aspect ratios for better performance and reduced misclassification. 🔥
💬 Qwen made it rain! They released Qwen3: new dense and MoE models ranging from 0.6B to 235B 🤯 as well as Qwen2.5-Omni, any-to-any model in 3B and 7B! > Microsoft AI released Phi4 reasoning models (that also come in mini and plus sizes) > NVIDIA released new CoT reasoning datasets 🖼️ > ByteDance released UI-TARS-1.5, native multimodal UI parsing agentic model > Meta released EdgeTAM, an on-device object tracking model (SAM2 variant) 🗣️ NVIDIA released parakeet-tdt-0.6b-v2, a smol 600M automatic speech recognition model > Nari released Dia, a 1.6B text-to-speech model > Moonshot AI released Kimi Audio, a new audio understanding, generation, conversation model 👩🏻💻 JetBrains released Melium models in base and SFT for coding > Tesslate released UIGEN-T2-7B, a new text-to-frontend-code model 🤩
The new versions of Midjourney Mix adapters have been dropped in stranger zone hf. These adapters excel in studio lighting portraits and painterly styles, trained using the style of strangerzonehf/Flux-Midjourney-Mix2-LoRA. They leverage 24-bit colored synthetic images generated form midjourney v6 to achieve high-quality image reproducibility and support adaptable aspect ratios, using Flux.1 as the base model. 🥳
The best dimensions and inference settings for optimal results are as follows: A resolution of 1280 x 832 with a 3:2 aspect ratio is recommended for the best quality, while 1024 x 1024 with a 1:1 aspect ratio serves as the default option. For inference, the recommended number of steps ranges between 30 and 35 to achieve optimal output.
you can easily fine-tune, quantize, play with sota vision LM InternVL3 now 🔥 we have recently merged InternVL3 to Hugging Face transformers and released converted checkpoints 🤗
Expansion of Global and Dense Open Embeddings Dataset of Earth 🌍
We updated our previous embeddings release with three models MMEarth and DeCUR-S2, DeCUR-S1 of the Major TOM embeddings dataset, developed in collaboration with CloudFerro S.A. asterisk labs and Φ-lab, European Space Agency - ESA. Together with @mikonvergence , Jędrzej S. Bojanowski, we extend the open-access collection of open dataset of Copernicus embeddings built at global scale, providing dense coverage across the entire acquisition area of Sentinel-1 and Sentinel-2 sensors.
Total embedding resources after the update: - 51 TB of AI-embeddings generated from processed Sentinel data, - over 40 billion embedding vectors, - processing of 147 TB of raw satellite data, - analysis covering more than 15 million Sentinel-1 and Sentinel-2 scenes and more than 16 trillion pixels.
This project delivers open and free vectorized expansions of Major TOM datasets available on CREODIAS and Hugging Face, setting a new standard for embedding releases and enabling lightweight, scalable ingestion of Earth Observation (EO) data for countless applications.
Dropping downstream tasks using newly initialized parameters and weights supports domain-specific image classification post-training, based on the SigLIP-2 models: Patch-16/224, Patch-16/256, and Patch-32/256. For more details, please refer to the respective model cards : 🤗
Meta released Llama Guard 4 and new Prompt Guard 2 models 🔥
Llama Guard 4 is a new model to filter model inputs/outputs both text-only and image 🛡️ use it before and after LLMs/VLMs! meta-llama/Llama-Guard-4-12B
Bringing out style-intermixing adapters for Flux.Dev, including Aura Glow, Fallen Ink Art, Cardboard Paper Arts, Black & White Expressions, and Glitter Gem Touch. For more details, visit the model card of the LoRA. 🥳
The best dimensions and inference settings for optimal results are as follows: A resolution of 1280 x 832 with a 3:2 aspect ratio is recommended for the best quality, while 1024 x 1024 with a 1:1 aspect ratio serves as the default option. For inference, the recommended number of steps ranges between 30 and 35 to achieve optimal output.
Meta dropped swiss army knives for vision with A2.0 license 👏 > image/video encoders for vision language modelling and spatial understanding (object detection etc) 👏 > The vision LM outperforms InternVL3 and Qwen2.5VL 👏 > They also release gigantic video and image datasets
The authors attempt to come up with single versatile vision encoder to align on diverse set of tasks.
They trained Perception Encoder (PE) Core: a new state-of-the-art family of vision encoders that can be aligned for both vision-language and spatial tasks. For zero-shot image tasks, it outperforms latest sota SigLIP2 👏
> Among fine-tuned ones, first one is PE-Spatial. It's a model to detect bounding boxes, segmentation, depth estimation and it outperforms all other models 😮
> Second one is PLM, Perception Language Model, where they combine PE-Core with Qwen2.5 LM 7B. it outperforms all other models (including InternVL3 which was trained with Qwen2.5LM too!)
The authors release the following checkpoints in sizes base, large and giant:
Authors release following datasets 📑 > PE Video: Gigantic video datasete of 1M videos with 120k expert annotations ⏯️ > PLM-Video and PLM-Image: Human and auto-annotated image and video datasets on region-based tasks > PLM-VideoBench: New video benchmark on MCQA
Dropping the domain-specific downstream image classification content moderation models, including the anime image type classification, GeoSceneNet, indoor-outdoor scene classification, and black-and-white vs. colored image classification models, along with the datasets. 🔥
Most of the vision LMs focus on image as a whole, lacking localized references in captions, and not taking in visual prompts (points, boxes, drawings around objects)
DAM addresses this on two levels: new vision backbone that takes in focal crops and the image itself, and a large scale dataset 👀
They generate a dataset by extending existing segmentation and referring expression generation datasets like REFCOCO, by passing in the images and classes to VLMs and generating captions.
Lastly, they also release a new benchmark again with self-supervision, they use an LLM to evaluate the detailed captions focusing on localization 👏
Dropping an entire collection of Style Intermixing Adapters on StrangerZone HF — including Realism, Anime, Sketch, Texture-Rich 3D Experimentals, Automotive Concept Images, and LoRA models based on Flux.1, SD 3.5 Turbo/Large, Stable Diffusion XL 🎨
Try out the demo for Multimodal OCR featuring the implementation of models including RolmOCR and Qwen2VL OCR. The use case showcases image-text-to-text conversion and video understanding support for the RolmOCR model ! 🚀
multimodal > Moonshot AI released Kimi VL Thinking, first working open-source multimodal reasoning model and Kimi VL Instruct, both 16B MoEs with 3B active params (OS) > InternVL3 released based on Qwen2.5VL, 7 ckpts with various sizes (1B to 78B)
LLMs > NVIDIA released Llama-3_1-Nemotron-Ultra-253B-v1 an LLM built on Llama 405B for reasoning, chat and tool use > Agentica released DeepCoder-14B-Preview, fine-tuned version of DeepSeek-R1-Distilled-Qwen-14B on problem-test pairs, along with the compiled dataset > Zyphra/ZR1-1.5B is a new small reasoning LLM built on R1-Distill-1.5B (OS) > Skywork-OR1-32B-Preview is a new reasoning model by Skywork
Image Generation > HiDream releases three new models, HiDream I1 Dev, I1 Full, and I1 fast for image generation (OS)
Loaded some domain-specific downstream image classification content moderation models, which is essentially the practice of monitoring and filtering user-generated content on platforms, based on SigLIP-2 Base Patch16 with newly initialized trainable parameters. 🥠
ChatGPT-4o’s image generation goes wild for a week—featuring everything from Studio Ghibli-style art and image colorization to style intermixing. Here are some examples showcasing the generation of highly detailed images from freestyle design templates. Want to know more? Check out the blog 🚀