GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
Abstract
Multi-reward reinforcement learning suffers from reward normalization collapse in GRPO, which GDPO addresses by decoupling reward normalization for improved training stability and performance across reasoning tasks.
As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
Community
GDPO is a drop-in replacement for GRPO in verl and TRL β only minor code changes needed.
We release a slurm-free, easy-to-run implementation supporting multiple RL frameworks (verl / TRL / NeMo-RL) so you can quickly validate GDPO on tool-calling and math reasoning tasks.
β±οΈ Each run can be completed in ~1 hour on 8ΓA100s, or ~2.5 hours on a single A100.
π Switching from GRPO to GDPO is easy.
π Try it yourself: https://github.com/NVlabs/GDPO
Really cool paper!
I've created a podcast that explains the key concepts:
https://researchpod-share.vercel.app/episode/c83f1820-279a-4cc0-afe1-b927a0c20ec8
I enjoyed listening to the AI paper podcast!
When you RL models for real-world use, you care about more than one thing: accuracy, conciseness, alignment, faithfulness, etc.
But most RL pipelines still compress all of that into one scalar advantage in the loss function β and a lot of preference signal gets washed out.
Weβre introducing GDPO, a simple fix that lets you express multi-dimensional preferences with a single advantage. Key idea: swap the order of reward normalization and aggregation.
Works out-of-the-box as a GRPO add-onβcode is provided for veRL, TRL, and NeMo-RL.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization (2025)
- Empowering Multi-Turn Tool-Integrated Reasoning with Group Turn Policy Optimization (2025)
- Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning (2025)
- Multi-Agent Collaborative Reward Design for Enhancing Reasoning in Reinforcement Learning (2025)
- Leash: Adaptive Length Penalty and Reward Shaping for Efficient Large Reasoning Model (2025)
- ADHint: Adaptive Hints with Difficulty Priors for Reinforcement Learning (2025)
- AMIR-GRPO: Inducing Implicit Preference Signals into GRPO (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Core Problem Identified
The paper reveals that GRPO suffers from reward signal collapse in multi-reward settings. When different combinations of rollout rewards are normalized together, they collapse into identical advantage values, losing critical distinctions that guide learning.
Figure 2: GRPO maps different reward combinations into only two distinct advantage groups, whereas GDPO normalizes each reward independently and retains three distinct groups of advantage values.
Proposed Solution: GDPO
GDPO (Group reward-Decoupled Normalization Policy Optimization) solves this by:
- Decoupled normalization: Normalizing each reward separately within groups
- Batch-wise advantage normalization: Ensuring stable numerical range as rewards increase
Figure 3: GDPO consistently preserves substantially more distinct advantage groups than GRPO as rollouts or rewards increase.
Experimental Results
1. Tool Calling Task
- Models: Qwen2.5-Instruct (1.5B and 3B)
- Rewards: Format correctness + Tool-calling accuracy
- Results: GDPO achieves ~5% improvement on Live/non-Live tasks and 2.7% overall accuracy improvement
Figure 4: GDPO consistently converges to higher correctness and format rewards, while GRPO w/o std matches correctness but fails on format reward.
2. Mathematical Reasoning Task
- Models: DeepSeek-R1-1.5B, DeepSeek-R1-7B, Qwen3-4B-Instruct
- Rewards: Accuracy + Length constraint
- Results: GDPO yields up to 6.3% higher accuracy on AIME while maintaining better length constraints
Figure 5: GRPO's correctness declines after ~400 steps while GDPO continues improving. GDPO also maintains better length constraint adherence.
3. Coding Reasoning Task
- Model: DeepSeek-R1-7B
- Rewards: Code pass rate + Length constraint + Bug ratio
- Results: GDPO achieves better balance across all three objectives, maintaining pass rates while reducing length violations and bugs
Key Insights on Reward Priority
The paper also reveals important findings about incorporating human preferences:
- Weight adjustment alone is insufficient: When objectives differ in difficulty, simple weight adjustments may not achieve intended prioritization
- Conditioned rewards are more effective: Conditioning easier rewards on harder ones (e.g., length reward only given if answer is correct) provides better control
Figure 6: Conditioned length rewards lead to more predictable behavior when adjusting weights compared to unconditioned rewards.
Summary of Main Advantages
| Aspect | GRPO | GDPO |
|---|---|---|
| Training Stability | Prone to collapse after 400+ steps | Stable convergence throughout training |
| Signal Preservation | Collapses distinct advantage groups | Preserves fine-grained differences |
| Multi-reward Optimization | Suboptimal convergence | Consistently superior across 2-3 rewards |
| Priority Control | Limited effectiveness | More faithful to intended preferences |
GDPO establishes itself as a superior foundation for multi-reward reinforcement learning in language models, providing better training stability, more accurate optimization, and stronger alignment with diverse human preferences.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper




