aws-prototyping
/

codefu-7b-v0.1

Safetensors

qwen2

Model card Files Files and versions

xet

Community

chenwuml commited on Aug 5

Commit

1c7860f

1 Parent(s): 188283a

initial commit

Browse files

Files changed (1) hide show

README.md +34 -3

README.md CHANGED Viewed

@@ -194,7 +194,7 @@ We extended the [TinyZero](https://github.com/Jiayi-Pan/TinyZero) code repositor
 CodeFu employs a 2-stage curriculum learning approach:
-| Stage | Data | Max resp token | Batch sz | Mini batch sz | # of Rollouts | Reward | Focus | # of nodes |
 |-------|------|-------|------------|--------------|---------|--------|-------|------------|
 | **1** | *easy* problems | 28K | 8 | 8 | 5 | Exp smooth w. public scores | Basic algorithmic reasoning | 1 |
 | **2** | *hard* problems | 20K | 256 | 32 | 8 | Linear w/o public scores | Quality and Robustness | 4 |
@@ -202,9 +202,40 @@ CodeFu employs a 2-stage curriculum learning approach:
 ### Reward ###
-**Stages 1 (Exponential smooth with public scores)** - The reward system evaluates solutions using both public and private test cases, with penalties for non-executable code (-1), compilation failures (0), or  EXCEEDS_TIME_LIMIT (0) during execution. For successful runs, the reward equals the test case pass ratio raised to the power of 1.5. This exponential smoothing amplifies differences between partially and highly correct solutions, making it particularly effective for easier problems where it provides clearer learning signals and encourages focus on correctness and completeness.
-**Stages 2 (Linear without public scores)** - The system shifts to linear rewards based solely on private test case pass ratio (without raising to the power of 1.5). Since hard problems are much harder and high pass ratios are difficult to achieve, the linear structure ensures that incremental progress is proportionally rewarded. This approach removes public test case feedback and encourages robust problem-solving strategies that generalize better to unseen scenarios.
 ### Data Selection ###
 Training data is sourced from CodeForces problems within the [DeepMind CodeContest](https://huggingface.co/datasets/deepmind/code_contests) dataset, chosen for their reliable CF rating system. Easy problems (CF rating 800-1000) are used in Stage 1 for basic algorithmic reasoning, while relatively Hard problems (CF rating 1100-2200) are used in Stages 2 for intermediate to advanced challenges. Both the *Easy* and *Hard* datasets were trained for approximately 2 epochs.

 CodeFu employs a 2-stage curriculum learning approach:
+| Stage | Data | Max resp token | Batch size | Mini batch size | # of Rollouts | Reward | Focus | # of nodes |
 |-------|------|-------|------------|--------------|---------|--------|-------|------------|
 | **1** | *easy* problems | 28K | 8 | 8 | 5 | Exp smooth w. public scores | Basic algorithmic reasoning | 1 |
 | **2** | *hard* problems | 20K | 256 | 32 | 8 | Linear w/o public scores | Quality and Robustness | 4 |
 ### Reward ###
+**Stage 1 (Exponential smooth with public scores)** - Uses exponential smoothing and both public/private test cases for clearer learning signals on easier problems.
+**Stage 2 (Linear without public scores)** - Shifts to linear rewards using only private test cases to encourage robust problem-solving on harder problems.
+Here is the pseudocode for the reward calculation across both training stages:
+```python
+def compute_reward(code_output, public_tests, private_tests, stage):
+    # Handle execution failures (same for both stages)
+    if not is_executable(code_output):
+        return -1
+    if compilation_failed(code_output) or exceeds_time_limit(code_output):
+        return 0
+    # Stage-specific reward calculation for successful execution
+    if stage == 1:
+        # Exponential smoothing with public + private tests
+        passed_public = count_passed(code_output, public_tests)
+        passed_private = count_passed(code_output, private_tests)
+        total_tests = len(public_tests) + len(private_tests)
+        passed_tests = passed_public + passed_private
+        pass_ratio = passed_tests / total_tests
+        reward = pass_ratio ** 1.5
+    elif stage == 2:
+        # Linear reward with private tests only
+        passed_private = count_passed(code_output, private_tests)
+        total_private = len(private_tests)
+        reward = passed_private / total_private
+    return reward
+```
 ### Data Selection ###
 Training data is sourced from CodeForces problems within the [DeepMind CodeContest](https://huggingface.co/datasets/deepmind/code_contests) dataset, chosen for their reliable CF rating system. Easy problems (CF rating 800-1000) are used in Stage 1 for basic algorithmic reasoning, while relatively Hard problems (CF rating 1100-2200) are used in Stages 2 for intermediate to advanced challenges. Both the *Easy* and *Hard* datasets were trained for approximately 2 epochs.