arxiv:2601.05242

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Published on Jan 8

· Submitted by

LIU Shih-yang on Jan 9

#1 Paper of the day

NVIDIA

Upvote

166

Authors:

Shih-Yang Liu ,

Xin Dong ,

Ximing Lu ,

Peter Belcak ,

Min-Hung Chen ,

Hongxu Yin ,

Yejin Choi ,

Pavlo Molchanov

Abstract

Multi-reward reinforcement learning suffers from reward normalization collapse in GRPO, which GDPO addresses by decoupling reward normalization for improved training stability and performance across reasoning tasks.

AI-generated summary

As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.

View arXiv page View PDF Project page GitHub 205 Add to collection

Community

sliuau

Paper author Paper submitter 5 days ago

GDPO is a drop-in replacement for GRPO in verl and TRL — only minor code changes needed.

We release a slurm-free, easy-to-run implementation supporting multiple RL frameworks (verl / TRL / NeMo-RL) so you can quickly validate GDPO on tool-calling and math reasoning tasks.

⏱️ Each run can be completed in ~1 hour on 8×A100s, or ~2.5 hours on a single A100.

🔄 Switching from GRPO to GDPO is easy.
👉 Try it yourself: https://github.com/NVlabs/GDPO

noahml

4 days ago

Really cool paper!
I've created a podcast that explains the key concepts:
https://researchpod-share.vercel.app/episode/c83f1820-279a-4cc0-afe1-b927a0c20ec8

taimurs

4 days ago

I enjoyed listening to the AI paper podcast!

AlexRadch

4 days ago

This comment has been hidden

SimonX

Paper author 4 days ago

When you RL models for real-world use, you care about more than one thing: accuracy, conciseness, alignment, faithfulness, etc.

But most RL pipelines still compress all of that into one scalar advantage in the loss function — and a lot of preference signal gets washed out.

We’re introducing GDPO, a simple fix that lets you express multi-dimensional preferences with a single advantage. Key idea: swap the order of reward normalization and aggregation.

Works out-of-the-box as a GRPO add-on—code is provided for veRL, TRL, and NeMo-RL.

librarian-bot

4 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

mishig

about 3 hours ago

Core Problem Identified

The paper reveals that GRPO suffers from reward signal collapse in multi-reward settings. When different combinations of rollout rewards are normalized together, they collapse into identical advantage values, losing critical distinctions that guide learning.

Figure 2: GRPO maps different reward combinations into only two distinct advantage groups, whereas GDPO normalizes each reward independently and retains three distinct groups of advantage values.

Proposed Solution: GDPO

GDPO (Group reward-Decoupled Normalization Policy Optimization) solves this by:

Decoupled normalization: Normalizing each reward separately within groups
Batch-wise advantage normalization: Ensuring stable numerical range as rewards increase

Figure 3: GDPO consistently preserves substantially more distinct advantage groups than GRPO as rollouts or rewards increase.

Experimental Results

1. Tool Calling Task

Models: Qwen2.5-Instruct (1.5B and 3B)
Rewards: Format correctness + Tool-calling accuracy
Results: GDPO achieves ~5% improvement on Live/non-Live tasks and 2.7% overall accuracy improvement

Figure 4: GDPO consistently converges to higher correctness and format rewards, while GRPO w/o std matches correctness but fails on format reward.

2. Mathematical Reasoning Task

Models: DeepSeek-R1-1.5B, DeepSeek-R1-7B, Qwen3-4B-Instruct
Rewards: Accuracy + Length constraint
Results: GDPO yields up to 6.3% higher accuracy on AIME while maintaining better length constraints

Figure 5: GRPO's correctness declines after ~400 steps while GDPO continues improving. GDPO also maintains better length constraint adherence.

3. Coding Reasoning Task

Model: DeepSeek-R1-7B
Rewards: Code pass rate + Length constraint + Bug ratio
Results: GDPO achieves better balance across all three objectives, maintaining pass rates while reducing length violations and bugs

Key Insights on Reward Priority

The paper also reveals important findings about incorporating human preferences:

Weight adjustment alone is insufficient: When objectives differ in difficulty, simple weight adjustments may not achieve intended prioritization
Conditioned rewards are more effective: Conditioning easier rewards on harder ones (e.g., length reward only given if answer is correct) provides better control

Figure 6: Conditioned length rewards lead to more predictable behavior when adjusting weights compared to unconditioned rewards.

Summary of Main Advantages

Aspect	GRPO	GDPO
Training Stability	Prone to collapse after 400+ steps	Stable convergence throughout training
Signal Preservation	Collapses distinct advantage groups	Preserves fine-grained differences
Multi-reward Optimization	Suboptimal convergence	Consistently superior across 2-3 rewards
Priority Control	Limited effectiveness	More faithful to intended preferences

GDPO establishes itself as a superior foundation for multi-reward reinforcement learning in language models, providing better training stability, more accurate optimization, and stronger alignment with diverse human preferences.