Papers
arxiv:2603.12793

Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

Published on Mar 13
· Submitted by
PengDa
on Mar 16
#2 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Cheers is a unified multimodal model that decouples visual details from semantic representations using a vision tokenizer, LLM-based Transformer, and cascaded flow matching head to achieve efficient joint optimization for both visual understanding and generation tasks.

AI-generated summary

A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Cheers also achieves 4x token compression, enabling more efficient high-resolution image encoding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling. We will release all code and data for future research.

Community

Paper author Paper submitter

head

A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling.

model

For any questions or collaborations, feel free to contact us : )

📧 guozonghao96@outlook.com    |    📧 yichen0zhang@gmail.com    |    📧 MetaPDa@gmail.com   

If you find Cheers useful, please cite Cheers technical report using this BibTeX.

 @article {zhang2026cheers,
  title={Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation},
  author={Yichen Zhang, Da Peng, Zonghao Guo, Zijian Zhang, Xuesong Yang, Tong Sun, Shichu Sun, Yidan Zhang, Yanghao Li, Haiyan Zhao, Wang Xu, Qi Shi, Yangang Sun, Chi Chen, Shuo Wang, Yukun Yan, Xu Han, Qiang Ma, Wei Ke, Liang Wang, Zhiyuan Liu, Maosong Sun},
  journal={arXiv preprint arXiv:2603.12793},
  year={2026}
}

the cascaded flow matching head, which first builds a low-res semantic canvas and then injects semantically gated detail residuals from the vision tokenizer, is the part that really stands out to me. the 4x token compression for the vision tokens plus the semantic gating feels like a practical win, but i’m curious how robust that gating stays when the semantic tokens get noisy or when fine textures matter most. btw, the arxivlens breakdown helped me parse the method details—there’s a solid walkthrough covering the two-stage generation flow that matches their description, arxivlens has a solid walkthrough here: https://arxivlens.com/PaperView/Details/cheers-decoupling-patch-details-from-semantic-representations-enables-unified-multimodal-comprehension-and-generation-2877-4250787b. a quick ablation on removing the cfm head would be a nice sanity check to confirm where the bulk of the gains actually come from.

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.12793 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.12793 in a Space README.md to link it from this page.

Collections including this paper 1