arxiv:2505.03981

X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains

Published on May 6

· Submitted by

shengz on May 9

Upvote

Authors:

Qianchu Liu ,

Sheng Zhang ,

Guanghui Qin ,

Sid Kiblawi ,

Sam Preston ,

Paul Vozila ,

Tristan Naumann ,

Hoifung Poon

Abstract

Recent proprietary models (e.g., o3) have begun to demonstrate strong multimodal reasoning capabilities. Yet, most existing open-source research concentrates on training text-only reasoning models, with evaluations limited to mainly mathematical and general-domain tasks. Therefore, it remains unclear how to effectively extend reasoning capabilities beyond text input and general domains. This paper explores a fundamental research question: Is reasoning generalizable across modalities and domains? Our findings support an affirmative answer: General-domain text-based post-training can enable such strong generalizable reasoning. Leveraging this finding, we introduce X-Reasoner, a vision-language model post-trained solely on general-domain text for generalizable reasoning, using a two-stage approach: an initial supervised fine-tuning phase with distilled long chain-of-thoughts, followed by reinforcement learning with verifiable rewards. Experiments show that X-Reasoner successfully transfers reasoning capabilities to both multimodal and out-of-domain settings, outperforming existing state-of-the-art models trained with in-domain and multimodal data across various general and medical benchmarks (Figure 1). Additionally, we find that X-Reasoner's performance in specialized domains can be further enhanced through continued training on domain-specific text-only data. Building upon this, we introduce X-Reasoner-Med, a medical-specialized variant that achieves new state of the art on numerous text-only and multimodal medical benchmarks.

View arXiv page View PDF Project page Add to collection

Community

shengz

Paper author Paper submitter 5 days ago

🧠 X-Reasoner — a 7B vision-language model post-trained for reasoning purely on general-domain text, without any images or domain-specific data.
X-Reasoner achieves the state of the art 🏆 on challenging multimodal tasks (e.g., 43.0 on MMMU-Pro) and medical benchmarks (e.g., 45.7 on the NEJM Image Challenge).

🧵 Most open-source work on reasoning models focuses on text inputs and general domains. But real-world reasoning often spans multiple modalities (like vision+text) and specialized domains (like healthcare). We ask:
👉Can reasoning be made generalizable with only text-based post-training?

Key idea → A two-stage recipe:
🔹 SFT on text-only general-domain long CoTs
🔹 RL with verifiable rewards on text-only math Qs
No images, no domain-specific data—just general text.

This recipe powers X-Reasoner, a 7B-scale vision-language model. Despite being trained only on general-domain text, it:
✅ Transfers to multimodal tasks (e.g., MathVista, MMMU-Pro)
✅ Outperforms 7B SOTA models trained with multimodal supervision
✅ Excels in unseen domains like medicine

💡 Why it works
🔑 Math as an anchor—RL on maths yields reasoning chains that generalise better than domain-specific RL alone.
🔑 Forced-exit token prevents “infinite thinking,” boosting reliability.

Ablation ☑️: Remove every example solvable by text-only… gains persist. The model is truly reading the image, not gaming the benchmark.

🩺We then add a dash of medical text → X-Reasoner-Med. No images needed—just additional MedQA SFT + RL—and we set new 7 B SOTA on MedQA, OmniMedVQA, MMMU-Health, MedXpertQA-MM, and NEJM Image Challenge.

🔬 TL;DR:
General-domain text-based reasoning is more powerful than we thought.
With X-Reasoner, we show that high-quality reasoning models can be trained without costly multimodal or domain-specific supervision—and still outperform those that do.

📌 Paper: https://arxiv.org/abs/2505.03981
🔗 Models: https://github.com/microsoft/x-reasoner (coming soon)
📊 Benchmarks: MMMU, MathVista, MedQA, NEJM, and more
🤖 Model size: 7B