⨠17B with MIT licensed ⨠Diffusion-based image-to-world video generation via keyboard & mouse input ⨠GameWorld Score benchmark for Minecraft world models ⨠Massive Matrix Game Dataset with fine-grained action labels
We just shipped a blog on everything latest on vision language models, including š¤ GUI agents, agentic VLMs, omni models š multimodal RAG āÆļø video LMs š¤š» smol models ..and more! https://huggingface.co/blog/vlms-2025