chenwuml commited on
Commit
1c7860f
·
1 Parent(s): 188283a

initial commit

Browse files
Files changed (1) hide show
  1. README.md +34 -3
README.md CHANGED
@@ -194,7 +194,7 @@ We extended the [TinyZero](https://github.com/Jiayi-Pan/TinyZero) code repositor
194
  CodeFu employs a 2-stage curriculum learning approach:
195
 
196
 
197
- | Stage | Data | Max resp token | Batch sz | Mini batch sz | # of Rollouts | Reward | Focus | # of nodes |
198
  |-------|------|-------|------------|--------------|---------|--------|-------|------------|
199
  | **1** | *easy* problems | 28K | 8 | 8 | 5 | Exp smooth w. public scores | Basic algorithmic reasoning | 1 |
200
  | **2** | *hard* problems | 20K | 256 | 32 | 8 | Linear w/o public scores | Quality and Robustness | 4 |
@@ -202,9 +202,40 @@ CodeFu employs a 2-stage curriculum learning approach:
202
 
203
  ### Reward ###
204
 
205
- **Stages 1 (Exponential smooth with public scores)** - The reward system evaluates solutions using both public and private test cases, with penalties for non-executable code (-1), compilation failures (0), or EXCEEDS_TIME_LIMIT (0) during execution. For successful runs, the reward equals the test case pass ratio raised to the power of 1.5. This exponential smoothing amplifies differences between partially and highly correct solutions, making it particularly effective for easier problems where it provides clearer learning signals and encourages focus on correctness and completeness.
206
 
207
- **Stages 2 (Linear without public scores)** - The system shifts to linear rewards based solely on private test case pass ratio (without raising to the power of 1.5). Since hard problems are much harder and high pass ratios are difficult to achieve, the linear structure ensures that incremental progress is proportionally rewarded. This approach removes public test case feedback and encourages robust problem-solving strategies that generalize better to unseen scenarios.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
208
 
209
  ### Data Selection ###
210
  Training data is sourced from CodeForces problems within the [DeepMind CodeContest](https://huggingface.co/datasets/deepmind/code_contests) dataset, chosen for their reliable CF rating system. Easy problems (CF rating 800-1000) are used in Stage 1 for basic algorithmic reasoning, while relatively Hard problems (CF rating 1100-2200) are used in Stages 2 for intermediate to advanced challenges. Both the *Easy* and *Hard* datasets were trained for approximately 2 epochs.
 
194
  CodeFu employs a 2-stage curriculum learning approach:
195
 
196
 
197
+ | Stage | Data | Max resp token | Batch size | Mini batch size | # of Rollouts | Reward | Focus | # of nodes |
198
  |-------|------|-------|------------|--------------|---------|--------|-------|------------|
199
  | **1** | *easy* problems | 28K | 8 | 8 | 5 | Exp smooth w. public scores | Basic algorithmic reasoning | 1 |
200
  | **2** | *hard* problems | 20K | 256 | 32 | 8 | Linear w/o public scores | Quality and Robustness | 4 |
 
202
 
203
  ### Reward ###
204
 
205
+ **Stage 1 (Exponential smooth with public scores)** - Uses exponential smoothing and both public/private test cases for clearer learning signals on easier problems.
206
 
207
+ **Stage 2 (Linear without public scores)** - Shifts to linear rewards using only private test cases to encourage robust problem-solving on harder problems.
208
+
209
+ Here is the pseudocode for the reward calculation across both training stages:
210
+ ```python
211
+ def compute_reward(code_output, public_tests, private_tests, stage):
212
+ # Handle execution failures (same for both stages)
213
+ if not is_executable(code_output):
214
+ return -1
215
+
216
+ if compilation_failed(code_output) or exceeds_time_limit(code_output):
217
+ return 0
218
+
219
+ # Stage-specific reward calculation for successful execution
220
+ if stage == 1:
221
+ # Exponential smoothing with public + private tests
222
+ passed_public = count_passed(code_output, public_tests)
223
+ passed_private = count_passed(code_output, private_tests)
224
+ total_tests = len(public_tests) + len(private_tests)
225
+ passed_tests = passed_public + passed_private
226
+
227
+ pass_ratio = passed_tests / total_tests
228
+ reward = pass_ratio ** 1.5
229
+
230
+ elif stage == 2:
231
+ # Linear reward with private tests only
232
+ passed_private = count_passed(code_output, private_tests)
233
+ total_private = len(private_tests)
234
+
235
+ reward = passed_private / total_private
236
+
237
+ return reward
238
+ ```
239
 
240
  ### Data Selection ###
241
  Training data is sourced from CodeForces problems within the [DeepMind CodeContest](https://huggingface.co/datasets/deepmind/code_contests) dataset, chosen for their reliable CF rating system. Easy problems (CF rating 800-1000) are used in Stage 1 for basic algorithmic reasoning, while relatively Hard problems (CF rating 1100-2200) are used in Stages 2 for intermediate to advanced challenges. Both the *Easy* and *Hard* datasets were trained for approximately 2 epochs.