X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains
Abstract
Recent proprietary models (e.g., o3) have begun to demonstrate strong multimodal reasoning capabilities. Yet, most existing open-source research concentrates on training text-only reasoning models, with evaluations limited to mainly mathematical and general-domain tasks. Therefore, it remains unclear how to effectively extend reasoning capabilities beyond text input and general domains. This paper explores a fundamental research question: Is reasoning generalizable across modalities and domains? Our findings support an affirmative answer: General-domain text-based post-training can enable such strong generalizable reasoning. Leveraging this finding, we introduce X-Reasoner, a vision-language model post-trained solely on general-domain text for generalizable reasoning, using a two-stage approach: an initial supervised fine-tuning phase with distilled long chain-of-thoughts, followed by reinforcement learning with verifiable rewards. Experiments show that X-Reasoner successfully transfers reasoning capabilities to both multimodal and out-of-domain settings, outperforming existing state-of-the-art models trained with in-domain and multimodal data across various general and medical benchmarks (Figure 1). Additionally, we find that X-Reasoner's performance in specialized domains can be further enhanced through continued training on domain-specific text-only data. Building upon this, we introduce X-Reasoner-Med, a medical-specialized variant that achieves new state of the art on numerous text-only and multimodal medical benchmarks.
Community
🧠 X-Reasoner — a 7B vision-language model post-trained for reasoning purely on general-domain text, without any images or domain-specific data.
X-Reasoner achieves the state of the art 🏆 on challenging multimodal tasks (e.g., 43.0 on MMMU-Pro) and medical benchmarks (e.g., 45.7 on the NEJM Image Challenge).
🧵 Most open-source work on reasoning models focuses on text inputs and general domains. But real-world reasoning often spans multiple modalities (like vision+text) and specialized domains (like healthcare). We ask:
👉Can reasoning be made generalizable with only text-based post-training?
Key idea → A two-stage recipe:
🔹 SFT on text-only general-domain long CoTs
🔹 RL with verifiable rewards on text-only math Qs
No images, no domain-specific data—just general text.
This recipe powers X-Reasoner, a 7B-scale vision-language model. Despite being trained only on general-domain text, it:
✅ Transfers to multimodal tasks (e.g., MathVista, MMMU-Pro)
✅ Outperforms 7B SOTA models trained with multimodal supervision
✅ Excels in unseen domains like medicine
💡 Why it works
🔑 Math as an anchor—RL on maths yields reasoning chains that generalise better than domain-specific RL alone.
🔑 Forced-exit token prevents “infinite thinking,” boosting reliability.
Ablation ☑️: Remove every example solvable by text-only… gains persist. The model is truly reading the image, not gaming the benchmark.
🩺We then add a dash of medical text → X-Reasoner-Med. No images needed—just additional MedQA SFT + RL—and we set new 7 B SOTA on MedQA, OmniMedVQA, MMMU-Health, MedXpertQA-MM, and NEJM Image Challenge.
🔬 TL;DR:
General-domain text-based reasoning is more powerful than we thought.
With X-Reasoner, we show that high-quality reasoning models can be trained without costly multimodal or domain-specific supervision—and still outperform those that do.
📌 Paper: https://arxiv.org/abs/2505.03981
🔗 Models: https://github.com/microsoft/x-reasoner (coming soon)
📊 Benchmarks: MMMU, MathVista, MedQA, NEJM, and more
🤖 Model size: 7B
thanks
very interesting
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models (2025)
- Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 (2025)
- OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement (2025)
- Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models (2025)
- VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning (2025)
- NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation (2025)
- Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1) (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper