initial commit
Browse files
README.md
CHANGED
@@ -194,7 +194,7 @@ We extended the [TinyZero](https://github.com/Jiayi-Pan/TinyZero) code repositor
|
|
194 |
CodeFu employs a 2-stage curriculum learning approach:
|
195 |
|
196 |
|
197 |
-
| Stage | Data | Max resp token | Batch
|
198 |
|-------|------|-------|------------|--------------|---------|--------|-------|------------|
|
199 |
| **1** | *easy* problems | 28K | 8 | 8 | 5 | Exp smooth w. public scores | Basic algorithmic reasoning | 1 |
|
200 |
| **2** | *hard* problems | 20K | 256 | 32 | 8 | Linear w/o public scores | Quality and Robustness | 4 |
|
@@ -202,9 +202,40 @@ CodeFu employs a 2-stage curriculum learning approach:
|
|
202 |
|
203 |
### Reward ###
|
204 |
|
205 |
-
**
|
206 |
|
207 |
-
**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
208 |
|
209 |
### Data Selection ###
|
210 |
Training data is sourced from CodeForces problems within the [DeepMind CodeContest](https://huggingface.co/datasets/deepmind/code_contests) dataset, chosen for their reliable CF rating system. Easy problems (CF rating 800-1000) are used in Stage 1 for basic algorithmic reasoning, while relatively Hard problems (CF rating 1100-2200) are used in Stages 2 for intermediate to advanced challenges. Both the *Easy* and *Hard* datasets were trained for approximately 2 epochs.
|
|
|
194 |
CodeFu employs a 2-stage curriculum learning approach:
|
195 |
|
196 |
|
197 |
+
| Stage | Data | Max resp token | Batch size | Mini batch size | # of Rollouts | Reward | Focus | # of nodes |
|
198 |
|-------|------|-------|------------|--------------|---------|--------|-------|------------|
|
199 |
| **1** | *easy* problems | 28K | 8 | 8 | 5 | Exp smooth w. public scores | Basic algorithmic reasoning | 1 |
|
200 |
| **2** | *hard* problems | 20K | 256 | 32 | 8 | Linear w/o public scores | Quality and Robustness | 4 |
|
|
|
202 |
|
203 |
### Reward ###
|
204 |
|
205 |
+
**Stage 1 (Exponential smooth with public scores)** - Uses exponential smoothing and both public/private test cases for clearer learning signals on easier problems.
|
206 |
|
207 |
+
**Stage 2 (Linear without public scores)** - Shifts to linear rewards using only private test cases to encourage robust problem-solving on harder problems.
|
208 |
+
|
209 |
+
Here is the pseudocode for the reward calculation across both training stages:
|
210 |
+
```python
|
211 |
+
def compute_reward(code_output, public_tests, private_tests, stage):
|
212 |
+
# Handle execution failures (same for both stages)
|
213 |
+
if not is_executable(code_output):
|
214 |
+
return -1
|
215 |
+
|
216 |
+
if compilation_failed(code_output) or exceeds_time_limit(code_output):
|
217 |
+
return 0
|
218 |
+
|
219 |
+
# Stage-specific reward calculation for successful execution
|
220 |
+
if stage == 1:
|
221 |
+
# Exponential smoothing with public + private tests
|
222 |
+
passed_public = count_passed(code_output, public_tests)
|
223 |
+
passed_private = count_passed(code_output, private_tests)
|
224 |
+
total_tests = len(public_tests) + len(private_tests)
|
225 |
+
passed_tests = passed_public + passed_private
|
226 |
+
|
227 |
+
pass_ratio = passed_tests / total_tests
|
228 |
+
reward = pass_ratio ** 1.5
|
229 |
+
|
230 |
+
elif stage == 2:
|
231 |
+
# Linear reward with private tests only
|
232 |
+
passed_private = count_passed(code_output, private_tests)
|
233 |
+
total_private = len(private_tests)
|
234 |
+
|
235 |
+
reward = passed_private / total_private
|
236 |
+
|
237 |
+
return reward
|
238 |
+
```
|
239 |
|
240 |
### Data Selection ###
|
241 |
Training data is sourced from CodeForces problems within the [DeepMind CodeContest](https://huggingface.co/datasets/deepmind/code_contests) dataset, chosen for their reliable CF rating system. Easy problems (CF rating 800-1000) are used in Stage 1 for basic algorithmic reasoning, while relatively Hard problems (CF rating 1100-2200) are used in Stages 2 for intermediate to advanced challenges. Both the *Easy* and *Hard* datasets were trained for approximately 2 epochs.
|