chenwuml commited on
Commit
7846dca
·
1 Parent(s): 2cf8efe

initial commit

Browse files
Files changed (1) hide show
  1. README.md +17 -4
README.md CHANGED
@@ -213,7 +213,7 @@ CodeFu employs a 2-stage curriculum learning approach:
213
  **Stages 2 (Linear without public scores)** - The system shifts to linear rewards based solely on private test case pass ratio (without raising to the power of 1.5). Since hard problems are much harder and high pass ratios are difficult to achieve, the linear structure ensures that incremental progress is proportionally rewarded. This approach removes public test case feedback and encourages robust problem-solving strategies that generalize better to unseen scenarios.
214
 
215
  ### Data Selection ###
216
- Training data is sourced from CodeForces problems within the [DeepMind CodeContest](https://huggingface.co/datasets/deepmind/code_contests) dataset, chosen for their reliable CF rating system. Easy problems (~1000 samples, CF rating 800-1000) are used in Stage 1 for basic algorithmic reasoning, while relatively Hard problems (~4000 samples, CF rating 1100-2200) are used in Stages 2 for intermediate to advanced challenges. Both the *Easy* and *Hard* datasets were trained for approximately 2 epochs.
217
 
218
 
219
  ### Training Stability ###
@@ -229,13 +229,13 @@ Despite attempts to mitigate this through increased batch sizes (up to 416) as s
229
 
230
  *Figure 3 - Mean response length and reward plummet despite the much larger batch size*
231
 
232
- We eventually resolved this issue by using a true off-policy PPO learning configuration. This is achieved by reducing the mini-batch size to at least 4× smaller than the global batch size, resulting in multiple clipped "mini-" updates as per the [original PPO paper](https://arxiv.org/abs/1707.06347). This approach has since stabilized response length and prevented reward collapse (Figure 4), allowing us to pass multiple epochs on the dataset.
233
 
234
  ![Mean Reward](resp_len_bsz_256_fix.png)
235
 
236
  *Figure 4 - Mean response length and reward do not collapse during true off-policy learning*
237
 
238
- We are preparing a paper to provide details on solving the training stability issue, along with an overview of related interesting papers (such as [DAPO](https://arxiv.org/abs/2503.14476), [OPO](https://arxiv.org/abs/2505.23585), [Dr.GRPO](https://arxiv.org/pdf/2503.20783), [GSPO](https://arxiv.org/abs/2507.18071), etc.) on policy stability, efficiency, and optimization for training reasoning models and agents.
239
 
240
 
241
  ## Citation
@@ -250,4 +250,17 @@ CodeFu is developed by the **AWS WWSO Prototyping** Team. If you find CodeFu hel
250
  publisher={Hugging Face},
251
  url={https://huggingface.co/aws-prototyping/codefu-7b-v0.1},
252
  version={0.1}
253
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
213
  **Stages 2 (Linear without public scores)** - The system shifts to linear rewards based solely on private test case pass ratio (without raising to the power of 1.5). Since hard problems are much harder and high pass ratios are difficult to achieve, the linear structure ensures that incremental progress is proportionally rewarded. This approach removes public test case feedback and encourages robust problem-solving strategies that generalize better to unseen scenarios.
214
 
215
  ### Data Selection ###
216
+ Training data is sourced from CodeForces problems within the [DeepMind CodeContest](https://huggingface.co/datasets/deepmind/code_contests) dataset, chosen for their reliable CF rating system. Easy problems (CF rating 800-1000) are used in Stage 1 for basic algorithmic reasoning, while relatively Hard problems (CF rating 1100-2200) are used in Stages 2 for intermediate to advanced challenges. Both the *Easy* and *Hard* datasets were trained for approximately 2 epochs.
217
 
218
 
219
  ### Training Stability ###
 
229
 
230
  *Figure 3 - Mean response length and reward plummet despite the much larger batch size*
231
 
232
+ We eventually resolved this issue by using a true off-policy PPO learning configuration. This is achieved by reducing the mini-batch size to at least 4× smaller than the global batch size, resulting in multiple clipped "mini-" updates as per the original PPO algorithm [1]. This approach has since stabilized response length and prevented reward collapse (Figure 4), allowing us to pass multiple epochs on the dataset.
233
 
234
  ![Mean Reward](resp_len_bsz_256_fix.png)
235
 
236
  *Figure 4 - Mean response length and reward do not collapse during true off-policy learning*
237
 
238
+ A detailed paper is in preparation that will describe our training stability solutions and review related work on policy optimization for reasoning models, including recent methods like DAPO [2], OPO [3], Dr.GRPO [4], and GSPO [5].
239
 
240
 
241
  ## Citation
 
250
  publisher={Hugging Face},
251
  url={https://huggingface.co/aws-prototyping/codefu-7b-v0.1},
252
  version={0.1}
253
+ }
254
+ ```
255
+
256
+ ## References
257
+ [1] - Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
258
+
259
+ [2] - Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., ... & Wang, M. (2025). DAPO: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476.
260
+
261
+ [3] - Hao, Y., Dong, L., Wu, X., Huang, S., Chi, Z., & Wei, F. (2025). On-Policy RL with Optimal Reward Baseline. arXiv preprint arXiv:2505.23585.
262
+
263
+ [4] - Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., ... & Lin, M. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783.
264
+
265
+ [5] - Zheng, C., Liu, S., Li, M., Chen, X. H., Yu, B., Gao, C., ... & Lin, J. (2025). Group Sequence Policy Optimization. arXiv preprint arXiv:2507.18071.
266
+