Next-Embedding Prediction Makes Strong Vision Learners Paper • 2512.16922 • Published 7 days ago • 77
Towards Scalable Pre-training of Visual Tokenizers for Generation Paper • 2512.13687 • Published 10 days ago • 93
From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images Paper • 2511.22805 • Published 28 days ago • 3
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer Paper • 2511.22699 • Published 28 days ago • 212
REASONEDIT: Towards Reasoning-Enhanced Image Editing Models Paper • 2511.22625 • Published 28 days ago • 46
Kimi Linear: An Expressive, Efficient Attention Architecture Paper • 2510.26692 • Published Oct 30 • 119
The End of Manual Decoding: Towards Truly End-to-End Language Models Paper • 2510.26697 • Published Oct 30 • 116
Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models Paper • 2510.08492 • Published Oct 9 • 8
Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training Paper • 2509.26625 • Published Sep 30 • 43
Residual Off-Policy RL for Finetuning Behavior Cloning Policies Paper • 2509.19301 • Published Sep 23 • 18
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer Paper • 2509.16197 • Published Sep 19 • 56