On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training
Abstract
Supervised fine-tuning and reinforcement learning in large language model post-training cannot be decoupled without performance degradation, as each method negatively impacts the other's effectiveness when separated.
Post-training of large language models routinely interleaves supervised fine-tuning (SFT) with reinforcement learning (RL). These two methods have different objectives: SFT minimizes the cross-entropy loss between model outputs and expert responses, while RL maximizes reward signals derived from human preferences or rule-based verifiers. Modern reasoning models have widely adopted the practice of alternating SFT and RL training. However, there is no theoretical account of whether they can be decoupled. We prove that decoupling is impossible in either order: (1) SFT-then-RL coupling: RL increases SFT loss under SFT optimality and (2) RL-then-SFT coupling: SFT lowers the reward achieved by RL. Experiments on Qwen3-0.6B confirm the predicted degradation, verifying that SFT and RL cannot be separated without loss of prior performance in the post-training
Community
Post-training of large language models routinely interleaves supervised fine-tuning (SFT) with reinforcement learning (RL). These two methods have different objectives: SFT minimizes the cross-entropy loss between model outputs and expert responses, while RL maximizes reward signals derived from human preferences or rule-based verifiers. Modern reasoning models have widely adopted the practice of alternating SFT and RL training. However, there is no theoretical account of whether they can be decoupled. We prove that decoupling is impossible in either order: (1) SFT-then-RL coupling: RL increases SFT loss under SFT optimality and (2) RL-then-SFT coupling: SFT lowers the reward achieved by RL. Experiments on Qwen3-0.6B confirm the predicted degradation, verifying that SFT and RL cannot be separated without loss of prior performance in the post-training pipeline.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SkillFactory: Self-Distillation For Learning Cognitive Behaviors (2025)
- Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models (2025)
- Learning Dynamics in RL Post-Training for Language Models (2026)
- Reassessing the Role of Supervised Fine-Tuning: An Empirical Study in VLM Reasoning (2025)
- Puzzle Curriculum GRPO for Vision-Centric Reasoning (2025)
- Bootstrapped Mixed Rewards for RL Post-Training: Injecting Canonical Action Order (2025)
- Rethinking Expert Trajectory Utilization in LLM Post-training (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper