They released Seed1.5-VL, A vision-language model for general-purpose multimodal reasoning. It’s not open-source, but the paper and demo are available here👇
✨ 17B with MIT licensed ✨ Diffusion-based image-to-world video generation via keyboard & mouse input ✨ GameWorld Score benchmark for Minecraft world models ✨ Massive Matrix Game Dataset with fine-grained action labels
We just shipped a blog on everything latest on vision language models, including 🤖 GUI agents, agentic VLMs, omni models 📑 multimodal RAG ⏯️ video LMs 🤏🏻 smol models ..and more! https://huggingface.co/blog/vlms-2025
Ever notice how some AI assistants feel like tools while others feel like companions? Turns out, it's not always about fancy tech upgrades, because sometimes it's just clever design.
Our latest blog post at Hugging Face dives into how minimal design choices can completely transform how users experience AI. We've seen our community turn the same base models into everything from swimming coaches to interview prep specialists with surprisingly small tweaks.
The most fascinating part? When we tested identical models with different "personalities" in our Inference Playground, the results were mind-blowing.
Want to experiment yourself? Our Inference Playground lets anyone (yes, even non-coders!) test these differences in real-time. You can:
- Compare multiple models side-by-side - Customize system prompts - Adjust parameters like temperature - Test multi-turn conversations
It's fascinating how a few lines of instruction text can transform the same AI from strictly professional to seemingly caring and personal, without changing a single line of code in the model itself.
✨ M2-Base: 3.5TB web data (EN/ZH), with LLM-augmented content, APACHE2.0 ✨ M2-CoT: 4.2TB of auto-synthesized CoT reasoning data ✨ M2-Extra: domain-specific knowledge
💬 Qwen made it rain! They released Qwen3: new dense and MoE models ranging from 0.6B to 235B 🤯 as well as Qwen2.5-Omni, any-to-any model in 3B and 7B! > Microsoft AI released Phi4 reasoning models (that also come in mini and plus sizes) > NVIDIA released new CoT reasoning datasets 🖼️ > ByteDance released UI-TARS-1.5, native multimodal UI parsing agentic model > Meta released EdgeTAM, an on-device object tracking model (SAM2 variant) 🗣️ NVIDIA released parakeet-tdt-0.6b-v2, a smol 600M automatic speech recognition model > Nari released Dia, a 1.6B text-to-speech model > Moonshot AI released Kimi Audio, a new audio understanding, generation, conversation model 👩🏻💻 JetBrains released Melium models in base and SFT for coding > Tesslate released UIGEN-T2-7B, a new text-to-frontend-code model 🤩
FramePack is hands down one of the best OS releases in video generation 🙇🏻♀️🤯 ✅ fully open sourced + amazing quality + reduced memory + improved speed but more even - its gonna facilitate *soooo* many downstream applications like this version adapted for landscape rotation 👇https://huggingface.co/spaces/tori29umai/FramePack_rotate_landscape
you can easily fine-tune, quantize, play with sota vision LM InternVL3 now 🔥 we have recently merged InternVL3 to Hugging Face transformers and released converted checkpoints 🤗
DeepSeek, Alibaba, Skywork, Xiaomi, Bytedance..... And that’s just part of the companies from the Chinese community that released open models in April 🤯
🎬 Video > MAGI-1 by SandAI > SkyReels-A2 & SkyReels-V2 by Skywork > Wan2.1-FLF2V by Alibaba-Wan
🎨 Image > HiDream-I1 by Vivago AI > Kimi-VL by Moonshot AI > InstantCharacter by InstantX & Tencent-Hunyuan > Step1X-Edit by StepFun > EasyControl by Shanghai Jiaotong University
🧠 Reasoning > MiMo by Xiaomi > Skywork-R1V 2.0 by Skywork > ChatTS by ByteDance > Kimina by Moonshot AI & Numina > GLM-Z1 by Zhipu AI > Skywork OR1 by Skywork > Kimi-VL-Thinking by Moonshot AI
🔊 Audio > Kimi-Audio by Moonshot AI > IndexTTS by BiliBili > MegaTTS3 by ByteDance > Dolphin by DataOceanAI
🔢 Math > DeepSeek Prover V2 by Deepseek
🌍 LLM > Qwen by Alibaba-Qwen > InternVL3 by Shanghai AI lab > Ernie4.5 (demo) by Baidu
📊 Dataset > PHYBench by Eureka-Lab > ChildMandarin & Seniortalk by BAAI
Meta released Llama Guard 4 and new Prompt Guard 2 models 🔥
Llama Guard 4 is a new model to filter model inputs/outputs both text-only and image 🛡️ use it before and after LLMs/VLMs! meta-llama/Llama-Guard-4-12B