Spaces:

Dishaaa25
/

meta-rl-dsa-solver

Running

App Files Files Community

kaustubhg73 commited on Apr 25

Commit

5b695bd

1 Parent(s): 267d60a

v4

Browse files

Files changed (12) hide show

README.md +165 -111
client.py +13 -3
env/adapt_env.py +301 -67
env/generator.py +990 -131
env/test_cases.py +1 -3
models.py +11 -3
openenv.yaml +3 -31
scripts/test_env.py +42 -15
server/app.py +71 -15
training/plot_results.py +139 -0
training/train_grpo.py +436 -60
verifier/metrics.py +55 -16

README.md CHANGED Viewed

@@ -8,196 +8,250 @@ tags:
   - openenv
   - reinforcement-learning
   - code-generation
 ---
-# ADAPT DSA Tutor OpenEnv
-ADAPT, the Adversarial DSA Tutor, is an OpenEnv-compliant RLVR environment for training code-generation agents on small DSA tasks. The agent receives a problem prompt, examples, and visible tests, then submits Python code. The environment runs the code against visible and hidden tests and returns reward, pass-rate metrics, execution status, and feedback.
-This repo includes the environment, verifier helpers, a baseline inference runner, and a GRPO training entrypoint so the full submission flow can be exercised from one codebase.
-## Why This Environment
-The hackathon asks for OpenEnv environments that can improve LLM behavior through verifiable interaction. ADAPT targets a simple but useful skill loop:
 ```text
-agent writes code -> environment executes it -> hidden tests and reward signals score it -> trainer improves the agent
 ```
-The differentiator is curriculum-ready DSA practice: each episode carries a problem id and difficulty tier so training can track per-tier success instead of only aggregate reward.
-## OpenEnv Interface
-The environment uses the latest OpenEnv API shape:
-- `AdaptEnvironment(Environment[AdaptAction, AdaptObservation, AdaptState])`
-- `reset()` returns a typed observation.
-- `step(action)` accepts an `AdaptAction` with a Python `code` string.
-- `state` exposes episode id, step count, current problem id, difficulty, and recent metrics.
-`openenv.yaml` points to:
-```yaml
-app: server.app:app
-port: 7860
 ```
-## Action
 ```python
-{
-    "code": "n = int(input())\nprint(n * 2)"
-}
 ```
-## Observation
-Reset and step observations include:
-- problem statement
-- input format
-- constraints
-- examples
-- visible tests
-- problem id
-- difficulty tier
-- feedback
-- pass rate, visible pass rate, and hidden pass rate
-- syntax/runtime/timeout status
-- reward components
-Hidden test inputs and expected outputs are never returned in observations.
-## Reward
-Reward is clipped to `[0.0, 1.0]` and combines multiple environment-level signals:
-- correctness from visible and hidden pass rate
-- syntax validity
-- clean execution
-- output format compliance
-- timeout penalty
-- runtime error penalty
-- static safety rejection for dangerous imports such as `os`, `subprocess`, `socket`, `pathlib`, and `shutil`
-If `verifier.verifier.verify(code, test_cases)` exists, the environment can use it as an optional reward augmentation. If the verifier is absent, the environment still works using executor-derived reward.
-## Local Setup
-Use Python `3.10+`.
-```powershell
-cd C:\Users\kaust\PycharmProjects\meta-rl-dsa-solver
-python -m venv .venv
-.\.venv\Scripts\pip install -e .
-```
-For this local machine, the existing checked-out OpenEnv repo can also be used during development:
-```powershell
-$env:PYTHONPATH="C:\Users\kaust\PycharmProjects\OpenEnv\src;$PWD"
-```
-## Smoke Tests
-Run the local smoke test:
-```powershell
-python test.py
 ```
-Check syntax:
-```powershell
-python -m py_compile models.py env\adapt_env.py env\executor.py env\test_cases.py server\app.py
 ```
-Start the OpenEnv server:
-```powershell
-uvicorn server.app:app --host 0.0.0.0 --port 7860
-```
-Useful endpoints:
-- `GET /`
-- `GET /health`
-- `GET /metadata`
-- `GET /tasks`
-- `GET /schema`
-- `POST /reset`
-- `POST /step`
-- `GET /state`
-- `POST /mcp`
-Example step request:
 ```powershell
-curl -X POST http://localhost:7860/step -H "Content-Type: application/json" -d "{\"action\":{\"code\":\"n=int(input())\nprint(n*2)\"}}"
 ```
-You can also send the raw action body:
 ```powershell
-curl -X POST http://localhost:7860/step -H "Content-Type: application/json" -d "{\"code\":\"n=int(input())\nprint(n*2)\"}"
 ```
-Validate with OpenEnv once dependencies are installed:
 ```powershell
-openenv validate .
 ```
-Run the verifier smoke test:
 ```powershell
-python scripts\test_verifier.py
 ```
-Run the environment smoke test:
 ```powershell
-python scripts\test_env.py
 ```
-Run the baseline model loop:
 ```powershell
-$env:HF_TOKEN="..."
-$env:API_BASE_URL="https://router.huggingface.co/v1"
-$env:MODEL_NAME="openai/gpt-oss-120b"
-python inference.py
 ```
-Run GRPO training:
 ```powershell
-python training\train_grpo.py --output-dir outputs_v2 --bf16
 ```
-## Hugging Face Spaces
-This repo is Docker Space ready:
 ```powershell
 openenv push --repo-id <your-hf-username>/adapt-dsa-tutor
 ```
-Before final submission, add:
-- live Hugging Face Space link
-- training reward/loss plots from Disha's run
-- before/after code example showing a problem the model failed before training and solved after training
-- mini-blog or short video link
-## Current Problem Bank
-The environment includes a lightweight curated bank:
-- `easy_double`
-- `easy_sum_two`
-- `medium_maximum`
-- `medium_count_even`
-- `hard_reverse_words`
-This is intentionally small for submission-minimum stability. Later work can expand it to 30-50 tiered problems without changing the OpenEnv API.

   - openenv
   - reinforcement-learning
   - code-generation
+  - llm-training
 ---
+# ADAPT: Adversarial DSA Programming Tutor
+LLMs are getting better at one-shot code generation, but they still struggle with the thing real engineers do all day: read feedback, debug, and repair. ADAPT closes that gap by turning algorithm practice into a self-repair RL environment where the model must improve over multiple attempts instead of guessing once.
+## Why ADAPT exists
+Most code-generation benchmarks test whether a model can land the answer immediately. They do not test whether the model can recover from partial failure, use examples productively, or adapt as the task distribution changes.
+ADAPT is built to stress exactly those capabilities:
+- adaptive difficulty across easy, medium, and hard DSA families
+- visible examples plus hidden evaluation tests
+- multi-step repair with feedback between attempts
+- reward-aware problem generation that shifts toward the most educational families
+## Architecture
 ```text
++------------+     +-----------+     +----------+     +-----------+
+| Generator  | --> | Problem   | --> | Solver   | --> | Execution |
++------------+     +-----------+     +----------+     +-----------+
+      ^                                                        |
+      |                                                        v
+      +------------- Curriculum <- Reward <- Verification -----+
 ```
+## What the agent sees, does, and gets rewarded for
+The agent sees a plain-English programming problem, the stdin format, constraints, and two worked examples. It writes Python code that reads from stdin and prints to stdout.
+The environment executes that code on 10 tests per problem:
+- 2 visible tests shown as examples
+- 8 hidden tests used for the real pass-rate reward
+After each attempt, the environment returns:
+- hidden pass rate
+- visible pass rate
+- execution status such as `completed`, `wrong_answer`, `runtime_error`, or `timeout`
+- a compact list of which tests failed
+- enough context to try again on the same problem
+## Multi-step repair loop
+Each episode allows up to 3 attempts on the same problem.
+1. Attempt 1: the agent submits a first solution.
+2. Feedback: ADAPT reports the current execution status, hidden pass rate, visible pass rate, and which visible/hidden tests failed.
+3. Attempt 2 or 3: the agent repairs its code using that feedback.
+4. The episode ends early if all hidden tests pass.
+Concrete example:
+```text
+Problem family: running_total
+Attempt 1 code:
+print(sum(nums))
+Feedback:
+Attempt 1/3
+Previous attempt status: ready
+Current execution status: wrong_answer
+Hidden pass rate: 0.25
+Visible pass rate: 0.50
+Failed tests:
+- Visible test #2: wrong_answer (expected=5 3 10, got=10)
+- Hidden test #1: wrong_answer
+- Hidden test #4: wrong_answer
+Attempt 2 code:
+running = 0
+for x in nums:
+    running += x
+    out.append(str(running))
+print(" ".join(out))
 ```
+That repair loop is the core novelty of ADAPT: the model is rewarded for debugging, not just for lucky first drafts.
+## Reward function
+ADAPT uses a clean reward signal driven by hidden correctness:
 ```python
+reward = hidden_pass_rate * step_discount
 ```
+Where:
+- `step_discount = 1.00` on attempt 1
+- `step_discount = 0.85` on attempt 2
+- `step_discount = 0.70` on attempt 3
+Additional shaping for the repair loop:
+- if a failed non-terminal attempt improves hidden pass rate, reward = `0.1 * delta_pass_rate`
+- if the final attempt still fails, reward = `0.0`
+- timeouts and syntax errors always get `0.0`
+Examples:
+- attempt 1 solves all 8 hidden tests: reward = `1.0`
+- attempt 2 solves all 8 hidden tests: reward = `0.85`
+- attempt 1 improves from `0.25` to `0.50` hidden pass rate on a retry trajectory: reward = `0.025`
+- attempt 3 still fails: reward = `0.0`
+## Problem families
+ADAPT now covers 20 algorithmic families instead of a tiny fixed bank:
+- Easy: `sum_even_numbers`, `range_span`, `count_vowels`, `max_consecutive_ones`, `fizzbuzz_variant`, `running_total`
+- Medium: `count_local_peaks`, `longest_non_decreasing_run`, `two_sum_count`, `max_subarray_sum`, `group_anagrams_count`, `balanced_brackets`, `matrix_diagonal_sum`
+- Hard: `smallest_most_frequent`, `reverse_words`, `longest_common_subsequence`, `word_ladder_steps`, `merge_intervals`, `min_coins`, `rotate_matrix_90`
+Every family has:
+- its own randomized case generator
+- 2 visible example tests
+- 8 hidden evaluation tests
+- a reference solver that auto-generates expected outputs
+## Self-improving curriculum
+ADAPT uses one curriculum authority in training: the `CurriculumManager` inside `training/train_grpo.py`.
+- promote threshold: `0.70`
+- demote threshold: `0.30`
+- moving-average window: `10` episodes
+On top of that, the generator tracks `family_productivity`, an EMA of how educational each family is:
+```text
+family_productivity[family] = 0.9 * old + 0.1 * generator_reward
 ```
+Families that produce pass rates near the learning sweet spot, around `0.5`, become more likely to be sampled via a softmax distribution. This creates a closed loop:
+```text
+productive families -> more samples -> better learning signal -> updated family productivity
 ```
+That makes ADAPT more than a static benchmark. The environment actively searches for the problems that teach the model the most.
+## Results
+[INSERT: reward curve plot]
+[INSERT: baseline vs trained table]
+Recommended artifacts to include here:
+- reward curve from `training/reward_curve.csv`
+- `reward_curve.png`
+- `pass_rate_by_difficulty.png`
+- `family_productivity.png`
+- one before/after repair example from baseline vs trained evaluation
+## How to run
+### 1. Install dependencies
+```powershell
+cd C:\Users\kaust\PycharmProjects\meta-rl-dsa-solver
+python -m venv .venv
+.\.venv\Scripts\pip install -e .
+```
+For training and plotting, also install your training extras:
 ```powershell
+.\.venv\Scripts\pip install trl unsloth matplotlib wandb
 ```
+### 2. Start the OpenEnv server
 ```powershell
+python server\app.py
 ```
+### 3. Reset an environment session
 ```powershell
+curl -X POST http://localhost:7860/reset ^
+  -H "Content-Type: application/json" ^
+  -d "{\"difficulty\":\"easy\"}"
 ```
+The response includes a `session_id`. Reuse it for `step` and `state`.
+### 4. Submit code to `/step`
 ```powershell
+curl -X POST http://localhost:7860/step ^
+  -H "Content-Type: application/json" ^
+  -d "{\"session_id\":\"<SESSION_ID>\",\"code\":\"n=int(input())\nnums=list(map(int,input().split()))\nprint(sum(x for x in nums if x % 2 == 0))\"}"
 ```
+### 5. Inspect current state
 ```powershell
+curl "http://localhost:7860/state?session_id=<SESSION_ID>"
 ```
+### 6. Run training
 ```powershell
+python training\train_grpo.py ^
+  --generator-mode reward_aware ^
+  --baseline-eval ^
+  --output-dir outputs_v3
 ```
+### 7. Plot the training curves
 ```powershell
+python training\plot_results.py outputs_v3\reward_curve.csv
 ```
+## Hugging Face Space
+This repo is designed to be hosted as an OpenEnv FastAPI Space.
 ```powershell
 openenv push --repo-id <your-hf-username>/adapt-dsa-tutor
 ```
+## Submission checklist
+- OpenEnv environment with `Environment`, `reset`, `step`, and `state`
+- valid `openenv.yaml`
+- Hugging Face Space deployment
+- GRPO training script with Unsloth + TRL
+- reward and pass-rate plots from a real run
+- baseline vs trained evaluation summary
+- Colab notebook link for reproducibility
+## Links
+- HuggingFace Space URL: [HuggingFace Space URL]
+- Colab Training Notebook: [Colab Training Notebook]
+- HF Blog Post: [HF Blog Post]
+- YouTube Demo: [YouTube Demo]

client.py CHANGED Viewed

@@ -11,6 +11,7 @@ class AdaptEnvClient:
     def __init__(self, base_url: str = "http://localhost:7860") -> None:
         self.base_url = base_url.rstrip("/")
         self._client = httpx.Client(base_url=self.base_url, timeout=30.0)
     def close(self) -> None:
         self._client.close()
@@ -18,15 +19,24 @@ class AdaptEnvClient:
     def reset(self, **params: Any) -> dict[str, Any]:
         response = self._client.post("/reset", json=params)
         response.raise_for_status()
-        return response.json()
     def step(self, code: str) -> dict[str, Any]:
-        response = self._client.post("/step", json=AdaptAction(code=code).model_dump())
         response.raise_for_status()
         return response.json()
     def state(self) -> dict[str, Any]:
-        response = self._client.get("/state")
         response.raise_for_status()
         return response.json()

     def __init__(self, base_url: str = "http://localhost:7860") -> None:
         self.base_url = base_url.rstrip("/")
         self._client = httpx.Client(base_url=self.base_url, timeout=30.0)
+        self.session_id: str | None = None
     def close(self) -> None:
         self._client.close()
     def reset(self, **params: Any) -> dict[str, Any]:
         response = self._client.post("/reset", json=params)
         response.raise_for_status()
+        payload = response.json()
+        self.session_id = payload.get("session_id")
+        return payload
     def step(self, code: str) -> dict[str, Any]:
+        if not self.session_id:
+            raise RuntimeError("Call reset() before step() so the client has a session_id.")
+        response = self._client.post(
+            "/step",
+            json=AdaptAction(session_id=self.session_id, code=code).model_dump(),
+        )
         response.raise_for_status()
         return response.json()
     def state(self) -> dict[str, Any]:
+        if not self.session_id:
+            raise RuntimeError("Call reset() before state() so the client has a session_id.")
+        response = self._client.get("/state", params={"session_id": self.session_id})
         response.raise_for_status()
         return response.json()

env/adapt_env.py CHANGED Viewed

@@ -6,6 +6,7 @@ from uuid import uuid4
 from env.generator import DIFFICULTY_LABELS, GeneratorAgent, generator_reward, validate_problem
 from models import AdaptAction, AdaptObservation, AdaptState
 try:
     from openenv.core.env_server.interfaces import Environment
@@ -22,6 +23,7 @@ except ImportError:
 FORBIDDEN_IMPORTS = {"os", "pathlib", "shutil", "socket", "subprocess"}
 class AdaptEnvironment(Environment[AdaptAction, AdaptObservation, AdaptState]):
@@ -31,14 +33,16 @@ class AdaptEnvironment(Environment[AdaptAction, AdaptObservation, AdaptState]):
         self,
         generator: GeneratorAgent | None = None,
         generator_mode: str = "heuristic",
     ) -> None:
         super().__init__()
         self.generator = generator or GeneratorAgent()
         self.generator_mode = generator_mode
         self.problem: dict[str, Any] = {}
-        self.test_cases: list[dict[str, str]] = []
         self.last_results: list[dict[str, Any]] = []
-        self.max_history = 20
         self.min_difficulty = 1
         self.max_difficulty = 3
         self.difficulty: int = 1
@@ -49,7 +53,17 @@ class AdaptEnvironment(Environment[AdaptAction, AdaptObservation, AdaptState]):
             "problem_signatures": [],
             "episode_index": 0,
         }
-        self._state = AdaptState(episode_id=str(uuid4()), step_count=0, generator_mode=self.generator_mode)
     def reset(
         self,
@@ -59,35 +73,52 @@ class AdaptEnvironment(Environment[AdaptAction, AdaptObservation, AdaptState]):
         difficulty: str | None = None,
         generated_problem: dict[str, Any] | None = None,
         generator_mode: str | None = None,
         **_: Any,
     ) -> AdaptObservation:
         del seed
         if generator_mode is not None:
             self.generator_mode = generator_mode
         if difficulty is not None:
             self.difficulty = self._difficulty_to_tier(difficulty)
-        elif self.history["recent_pass_rates"]:
-            self.difficulty = self._recommend_next_difficulty()
         self.problem = self._load_problem(
             generated_problem=generated_problem,
             problem_id=problem_id,
         )
         self.test_cases = [dict(test_case) for test_case in self.problem["test_cases"]]
         self.last_results = []
         self._state = AdaptState(
             episode_id=episode_id or str(uuid4()),
             step_count=0,
             problem_id=self.problem["problem_id"],
             problem_type=self.problem.get("problem_type", ""),
             difficulty=self.problem.get("difficulty_label", self._tier_to_difficulty(self.difficulty)),
             generator_mode=self.generator_mode,
             generated_problem=self._public_problem_view(),
         )
         return self._build_observation(
             reward=0.0,
             done=False,
-            feedback="Submit Python code that reads stdin and prints the required answer.",
             execution_status="ready",
         )
@@ -98,59 +129,130 @@ class AdaptEnvironment(Environment[AdaptAction, AdaptObservation, AdaptState]):
         **_: Any,
     ) -> AdaptObservation:
         del timeout_s
         if not self.problem:
-            self.reset()
         self._state.step_count += 1
         syntax_ok, syntax_error = self._check_syntax(action.code)
         if not syntax_ok:
             observation = self._build_observation(
                 reward=0.0,
-                done=True,
-                feedback=f"Syntax error: {syntax_error}",
                 syntax_valid=False,
                 execution_status="syntax_error",
-                reward_components={"correctness": 0.0, "format": 0.0},
             )
-            self._finalize_episode(observation)
             return observation
         safety_ok, safety_error = self._check_safety(action.code)
         if not safety_ok:
             observation = self._build_observation(
                 reward=0.0,
-                done=True,
-                feedback=safety_error,
                 syntax_valid=True,
                 execution_status="safety_violation",
-                reward_components={"correctness": 0.0, "format": 0.0},
             )
-            self._finalize_episode(observation)
             return observation
-        reward, metadata = self._verify_submission(action.code)
         self.last_results = list(metadata.get("results", []))
         observation = self._build_observation(
             reward=reward,
-            done=True,
-            feedback=str(metadata.get("feedback", "Evaluation complete.")),
-            pass_rate=float(metadata.get("pass_rate", 0.0)),
-            visible_pass_rate=0.0,
-            hidden_pass_rate=float(metadata.get("pass_rate", 0.0)),
             syntax_valid=True,
-            execution_status=str(metadata.get("execution_status", "completed")),
             timeout_count=int(metadata.get("timeout_count", 0)),
             runtime_error_count=int(metadata.get("runtime_error_count", 0)),
             invalid_output_count=int(metadata.get("invalid_output_count", 0)),
             wrong_answer_count=int(metadata.get("wrong_answer_count", 0)),
             format_compliance=float(metadata.get("format_compliance", 0.0)),
-            reward_components={
-                key: round(float(value), 4)
-                for key, value in dict(metadata.get("reward_components", {})).items()
-            },
             generator_reward_signal=float(metadata.get("generator_reward", 0.0)),
         )
-        self._finalize_episode(observation)
         return observation
     @property
@@ -177,23 +279,26 @@ class AdaptEnvironment(Environment[AdaptAction, AdaptObservation, AdaptState]):
     ) -> AdaptObservation:
         public_problem = self._public_problem_view()
         return AdaptObservation(
             problem_id=self.problem.get("problem_id", ""),
             problem_type=self.problem.get("problem_type", ""),
             difficulty=self.problem.get("difficulty_label", self._tier_to_difficulty(self.difficulty)),
             problem=public_problem.get("problem", ""),
             input_format=public_problem.get("input_format", ""),
             constraints=public_problem.get("constraints", ""),
             feedback=feedback,
-            pass_rate=pass_rate,
-            visible_pass_rate=visible_pass_rate,
-            hidden_pass_rate=hidden_pass_rate,
             syntax_valid=syntax_valid,
             execution_status=execution_status,
             timeout_count=timeout_count,
             runtime_error_count=runtime_error_count,
             invalid_output_count=invalid_output_count,
             wrong_answer_count=wrong_answer_count,
-            format_compliance=format_compliance,
             reward_components=reward_components or {},
             generator_reward_signal=round(float(generator_reward_signal), 4),
             reward=round(max(0.0, min(1.0, reward)), 4),
@@ -204,15 +309,22 @@ class AdaptEnvironment(Environment[AdaptAction, AdaptObservation, AdaptState]):
         self,
         generated_problem: dict[str, Any] | None,
         problem_id: str | None,
     ) -> dict[str, Any]:
-        candidate = generated_problem or self.generator.generate(
             self.difficulty,
             self.history,
             problem_id=problem_id,
         )
         if validate_problem(candidate):
             return candidate
-        fallback = self.generator.generate(self.difficulty, self.history, problem_id=problem_id)
         if not validate_problem(fallback):
             raise ValueError("Generator produced an invalid problem twice in a row.")
         return fallback
@@ -221,53 +333,155 @@ class AdaptEnvironment(Environment[AdaptAction, AdaptObservation, AdaptState]):
         try:
             from verifier.verifier import verify
         except ImportError as exc:
-            return 0.0, {"feedback": f"Verifier unavailable: {exc}", "execution_status": "verifier_error"}
         try:
             reward, metadata = verify(code, self.test_cases)
         except Exception as exc:
-            return 0.0, {"feedback": f"Verifier crashed: {exc}", "execution_status": "verifier_error"}
         metadata = dict(metadata or {})
         diversity_bonus = self._diversity_bonus(self.problem.get("problem_type", ""))
         validity_bonus = float(self.problem.get("validity_bonus", 0.0))
         metadata["generator_reward"] = generator_reward(
-            float(metadata.get("pass_rate", 0.0)),
             diversity_bonus=diversity_bonus,
             validity_bonus=validity_bonus,
         )
         return float(reward), metadata
-    def _finalize_episode(self, observation: AdaptObservation) -> None:
-        self._update_history(observation.pass_rate, observation.generator_reward_signal)
-        self._record_metrics(observation)
-    def _update_history(self, pass_rate: float, generator_signal: float) -> None:
-        self.history["recent_pass_rates"].append(round(float(pass_rate), 4))
-        self.history["problem_types"].append(self.problem.get("problem_type", ""))
-        self.history["problem_signatures"].append(self.problem.get("problem_id", ""))
-        self.history["generator_rewards"].append(round(float(generator_signal), 4))
-        self.history["episode_index"] = int(self.history.get("episode_index", 0)) + 1
-        for key in ("recent_pass_rates", "problem_types", "problem_signatures", "generator_rewards"):
-            values = self.history[key]
-            if len(values) > self.max_history:
-                del values[:-self.max_history]
     def _record_metrics(self, observation: AdaptObservation) -> None:
         self._state.last_reward = float(observation.reward or 0.0)
         self._state.last_pass_rate = observation.pass_rate
         self._state.last_feedback = observation.feedback
         self._state.generator_reward_signal = observation.generator_reward_signal
-        self._state.history = {
-            "recent_pass_rates": list(self.history["recent_pass_rates"]),
-            "problem_types": list(self.history["problem_types"]),
-            "generator_rewards": list(self.history["generator_rewards"]),
-        }
         self._state.recent_metrics = {
             "difficulty_tier": self.difficulty,
             "difficulty_label": self.problem.get("difficulty_label", self._tier_to_difficulty(self.difficulty)),
-            "history_size": len(self.history["recent_pass_rates"]),
             "pass_rate": observation.pass_rate,
             "execution_status": observation.execution_status,
             "timeout_count": observation.timeout_count,
@@ -278,27 +492,47 @@ class AdaptEnvironment(Environment[AdaptAction, AdaptObservation, AdaptState]):
             "reward_components": dict(observation.reward_components),
         }
-    def _recommend_next_difficulty(self) -> int:
-        recent = [float(value) for value in self.history["recent_pass_rates"][-5:]]
-        if not recent:
-            return self.difficulty
-        moving_average = sum(recent) / len(recent)
-        if moving_average > 0.75:
-            return min(self.max_difficulty, self.difficulty + 1)
-        if moving_average < 0.25:
-            return max(self.min_difficulty, self.difficulty - 1)
-        return self.difficulty
     def _public_problem_view(self) -> dict[str, str]:
         visible = dict(self.problem.get("visible_problem", {}))
         return {
-            "problem": visible.get("problem", self.problem.get("problem", "")),
             "input_format": visible.get("input_format", self.problem.get("input_format", "")),
             "constraints": visible.get("constraints", self.problem.get("constraints", "")),
         }
     def _diversity_bonus(self, problem_type: str) -> float:
-        recent_types = list(self.history.get("problem_types", [])[-4:])
         if not recent_types:
             return 0.1
         if problem_type in recent_types:

 from env.generator import DIFFICULTY_LABELS, GeneratorAgent, generator_reward, validate_problem
 from models import AdaptAction, AdaptObservation, AdaptState
+from verifier.metrics import compute_reward
 try:
     from openenv.core.env_server.interfaces import Environment
 FORBIDDEN_IMPORTS = {"os", "pathlib", "shutil", "socket", "subprocess"}
+MAX_STEPS_PER_EPISODE = 3
 class AdaptEnvironment(Environment[AdaptAction, AdaptObservation, AdaptState]):
         self,
         generator: GeneratorAgent | None = None,
         generator_mode: str = "heuristic",
+        session_id: str | None = None,
     ) -> None:
         super().__init__()
         self.generator = generator or GeneratorAgent()
         self.generator_mode = generator_mode
+        self.session_id = session_id or str(uuid4())
         self.problem: dict[str, Any] = {}
+        self.test_cases: list[dict[str, Any]] = []
         self.last_results: list[dict[str, Any]] = []
+        self.max_history = 50
         self.min_difficulty = 1
         self.max_difficulty = 3
         self.difficulty: int = 1
             "problem_signatures": [],
             "episode_index": 0,
         }
+        self.attempt_history: list[dict[str, Any]] = []
+        self.previous_execution_status = "ready"
+        self.episode_done = False
+        self._state = AdaptState(
+            session_id=self.session_id,
+            episode_id=str(uuid4()),
+            step_count=0,
+            generator_mode=self.generator_mode,
+            max_steps=MAX_STEPS_PER_EPISODE,
+            history={"attempts": []},
+        )
     def reset(
         self,
         difficulty: str | None = None,
         generated_problem: dict[str, Any] | None = None,
         generator_mode: str | None = None,
+        session_id: str | None = None,
+        family_weights: dict[str, float] | None = None,
         **_: Any,
     ) -> AdaptObservation:
         del seed
+        if session_id:
+            self.session_id = session_id
         if generator_mode is not None:
             self.generator_mode = generator_mode
         if difficulty is not None:
             self.difficulty = self._difficulty_to_tier(difficulty)
+        elif generated_problem is not None:
+            generated_label = str(generated_problem.get("difficulty_label", "")).strip().lower()
+            if generated_label:
+                self.difficulty = self._difficulty_to_tier(generated_label)
         self.problem = self._load_problem(
             generated_problem=generated_problem,
             problem_id=problem_id,
+            family_weights=family_weights,
         )
         self.test_cases = [dict(test_case) for test_case in self.problem["test_cases"]]
         self.last_results = []
+        self.attempt_history = []
+        self.previous_execution_status = "ready"
+        self.episode_done = False
         self._state = AdaptState(
+            session_id=self.session_id,
             episode_id=episode_id or str(uuid4()),
             step_count=0,
             problem_id=self.problem["problem_id"],
             problem_type=self.problem.get("problem_type", ""),
             difficulty=self.problem.get("difficulty_label", self._tier_to_difficulty(self.difficulty)),
             generator_mode=self.generator_mode,
+            max_steps=MAX_STEPS_PER_EPISODE,
             generated_problem=self._public_problem_view(),
+            history={"attempts": []},
         )
         return self._build_observation(
             reward=0.0,
             done=False,
+            feedback=(
+                "You have up to 3 attempts. Submit Python code that reads stdin and prints the required answer. "
+                "Use the examples to infer the expected behavior."
+            ),
             execution_status="ready",
         )
         **_: Any,
     ) -> AdaptObservation:
         del timeout_s
         if not self.problem:
+            self.reset(session_id=action.session_id or self.session_id)
+        if self.episode_done:
+            return self._build_observation(
+                reward=float(self._state.last_reward or 0.0),
+                done=True,
+                feedback="This episode is finished. Call reset() to start a new problem.",
+                pass_rate=float(self._state.last_pass_rate or 0.0),
+                visible_pass_rate=float(self._state.recent_metrics.get("visible_pass_rate", 0.0)),
+                hidden_pass_rate=float(self._state.last_pass_rate or 0.0),
+                syntax_valid=self._state.last_execution_status != "syntax_error",
+                execution_status=self._state.last_execution_status or "completed",
+                timeout_count=int(self._state.recent_metrics.get("timeout_count", 0)),
+                runtime_error_count=int(self._state.recent_metrics.get("runtime_error_count", 0)),
+                invalid_output_count=int(self._state.recent_metrics.get("invalid_output_count", 0)),
+                wrong_answer_count=int(self._state.recent_metrics.get("wrong_answer_count", 0)),
+                format_compliance=float(self._state.recent_metrics.get("format_compliance", 0.0)),
+                reward_components=dict(self._state.recent_metrics.get("reward_components", {})),
+                generator_reward_signal=float(self._state.generator_reward_signal or 0.0),
+            )
         self._state.step_count += 1
+        attempt_number = self._state.step_count
+        previous_status = self.previous_execution_status
+        previous_pass_rate = float(self._state.last_pass_rate or 0.0)
         syntax_ok, syntax_error = self._check_syntax(action.code)
         if not syntax_ok:
+            done = attempt_number >= MAX_STEPS_PER_EPISODE
             observation = self._build_observation(
                 reward=0.0,
+                done=done,
+                feedback=self._format_static_feedback(
+                    attempt_number=attempt_number,
+                    previous_status=previous_status,
+                    execution_status="syntax_error",
+                    details=f"Syntax error: {syntax_error}",
+                ),
                 syntax_valid=False,
                 execution_status="syntax_error",
+                reward_components={
+                    "correctness": 0.0,
+                    "step_discount": 1.0 if attempt_number == 1 else (0.85 if attempt_number == 2 else 0.70),
+                    "progress_delta": 0.0,
+                },
             )
+            self.last_results = []
+            self.previous_execution_status = observation.execution_status
+            self._record_metrics(observation)
+            if done:
+                self._finalize_episode(observation)
             return observation
         safety_ok, safety_error = self._check_safety(action.code)
         if not safety_ok:
+            done = attempt_number >= MAX_STEPS_PER_EPISODE
             observation = self._build_observation(
                 reward=0.0,
+                done=done,
+                feedback=self._format_static_feedback(
+                    attempt_number=attempt_number,
+                    previous_status=previous_status,
+                    execution_status="safety_violation",
+                    details=safety_error,
+                ),
                 syntax_valid=True,
                 execution_status="safety_violation",
+                reward_components={
+                    "correctness": 0.0,
+                    "step_discount": 1.0 if attempt_number == 1 else (0.85 if attempt_number == 2 else 0.70),
+                    "progress_delta": 0.0,
+                },
             )
+            self.last_results = []
+            self.previous_execution_status = observation.execution_status
+            self._record_metrics(observation)
+            if done:
+                self._finalize_episode(observation)
             return observation
+        _, metadata = self._verify_submission(action.code)
         self.last_results = list(metadata.get("results", []))
+        hidden_pass_rate = float(metadata.get("hidden_pass_rate", metadata.get("pass_rate", 0.0)))
+        visible_pass_rate = float(metadata.get("visible_pass_rate", 0.0))
+        execution_status = str(metadata.get("execution_status", "completed"))
+        done = hidden_pass_rate == 1.0 or attempt_number >= MAX_STEPS_PER_EPISODE
+        reward, reward_components = self._shape_reward(
+            pass_rate=hidden_pass_rate,
+            step_number=attempt_number,
+            execution_status=execution_status,
+            previous_pass_rate=previous_pass_rate,
+            done=done,
+        )
+        feedback = self._format_feedback(
+            results=self.last_results,
+            attempt_number=attempt_number,
+            previous_status=previous_status,
+            execution_status=execution_status,
+            hidden_pass_rate=hidden_pass_rate,
+            visible_pass_rate=visible_pass_rate,
+        )
         observation = self._build_observation(
             reward=reward,
+            done=done,
+            feedback=feedback,
+            pass_rate=hidden_pass_rate,
+            visible_pass_rate=visible_pass_rate,
+            hidden_pass_rate=hidden_pass_rate,
             syntax_valid=True,
+            execution_status=execution_status,
             timeout_count=int(metadata.get("timeout_count", 0)),
             runtime_error_count=int(metadata.get("runtime_error_count", 0)),
             invalid_output_count=int(metadata.get("invalid_output_count", 0)),
             wrong_answer_count=int(metadata.get("wrong_answer_count", 0)),
             format_compliance=float(metadata.get("format_compliance", 0.0)),
+            reward_components=reward_components,
             generator_reward_signal=float(metadata.get("generator_reward", 0.0)),
         )
+        self.previous_execution_status = observation.execution_status
+        self._record_metrics(observation)
+        if done:
+            self._finalize_episode(observation)
         return observation
     @property
     ) -> AdaptObservation:
         public_problem = self._public_problem_view()
         return AdaptObservation(
+            session_id=self.session_id,
             problem_id=self.problem.get("problem_id", ""),
             problem_type=self.problem.get("problem_type", ""),
             difficulty=self.problem.get("difficulty_label", self._tier_to_difficulty(self.difficulty)),
+            attempt_number=self._state.step_count,
+            max_steps=MAX_STEPS_PER_EPISODE,
             problem=public_problem.get("problem", ""),
             input_format=public_problem.get("input_format", ""),
             constraints=public_problem.get("constraints", ""),
             feedback=feedback,
+            pass_rate=round(float(pass_rate), 4),
+            visible_pass_rate=round(float(visible_pass_rate), 4),
+            hidden_pass_rate=round(float(hidden_pass_rate), 4),
             syntax_valid=syntax_valid,
             execution_status=execution_status,
             timeout_count=timeout_count,
             runtime_error_count=runtime_error_count,
             invalid_output_count=invalid_output_count,
             wrong_answer_count=wrong_answer_count,
+            format_compliance=round(float(format_compliance), 4),
             reward_components=reward_components or {},
             generator_reward_signal=round(float(generator_reward_signal), 4),
             reward=round(max(0.0, min(1.0, reward)), 4),
         self,
         generated_problem: dict[str, Any] | None,
         problem_id: str | None,
+        family_weights: dict[str, float] | None,
     ) -> dict[str, Any]:
+        candidate = generated_problem or self.generator.generate_problem(
             self.difficulty,
             self.history,
             problem_id=problem_id,
+            family_weights=family_weights,
         )
         if validate_problem(candidate):
             return candidate
+        fallback = self.generator.generate_problem(
+            self.difficulty,
+            self.history,
+            problem_id=problem_id,
+            family_weights=family_weights,
+        )
         if not validate_problem(fallback):
             raise ValueError("Generator produced an invalid problem twice in a row.")
         return fallback
         try:
             from verifier.verifier import verify
         except ImportError as exc:
+            return 0.0, {
+                "feedback": f"Verifier unavailable: {exc}",
+                "execution_status": "verifier_error",
+                "results": [],
+            }
         try:
             reward, metadata = verify(code, self.test_cases)
         except Exception as exc:
+            return 0.0, {
+                "feedback": f"Verifier crashed: {exc}",
+                "execution_status": "verifier_error",
+                "results": [],
+            }
         metadata = dict(metadata or {})
         diversity_bonus = self._diversity_bonus(self.problem.get("problem_type", ""))
         validity_bonus = float(self.problem.get("validity_bonus", 0.0))
+        hidden_pass_rate = float(metadata.get("hidden_pass_rate", metadata.get("pass_rate", 0.0)))
         metadata["generator_reward"] = generator_reward(
+            hidden_pass_rate,
             diversity_bonus=diversity_bonus,
             validity_bonus=validity_bonus,
         )
         return float(reward), metadata
+    def _shape_reward(
+        self,
+        pass_rate: float,
+        step_number: int,
+        execution_status: str,
+        previous_pass_rate: float,
+        done: bool,
+    ) -> tuple[float, dict[str, float]]:
+        step_discount = 1.0 if step_number == 1 else (0.85 if step_number == 2 else 0.70)
+        progress_delta = max(0.0, float(pass_rate) - float(previous_pass_rate))
+        if execution_status in {"timeout", "syntax_error", "safety_violation"}:
+            reward = 0.0
+        elif pass_rate == 1.0:
+            reward = compute_reward(
+                pass_rate=pass_rate,
+                step_number=step_number,
+                execution_status=execution_status,
+                format_compliance=0.0,
+            )
+        elif done:
+            reward = 0.0
+        else:
+            reward = round(0.1 * progress_delta, 4)
+        return reward, {
+            "correctness": round(float(pass_rate), 4),
+            "step_discount": round(step_discount, 4),
+            "progress_delta": round(progress_delta, 4),
+            "reward": round(float(reward), 4),
+        }
+    def _format_feedback(
+        self,
+        results: list[dict[str, Any]],
+        attempt_number: int,
+        previous_status: str,
+        execution_status: str,
+        hidden_pass_rate: float,
+        visible_pass_rate: float,
+    ) -> str:
+        lines = [
+            f"Attempt {attempt_number}/{MAX_STEPS_PER_EPISODE}.",
+            f"Previous attempt status: {previous_status}.",
+            f"Current execution status: {execution_status}.",
+            f"Hidden pass rate: {hidden_pass_rate:.2f}. Visible pass rate: {visible_pass_rate:.2f}.",
+        ]
+        failed_tests = self._summarize_failed_tests(results)
+        if failed_tests:
+            lines.append("Failed tests:")
+            lines.extend(failed_tests)
+        elif hidden_pass_rate == 1.0:
+            lines.append("All hidden tests passed.")
+        else:
+            lines.append("No failing test details were available.")
+        return "\n".join(lines)
+    def _format_static_feedback(
+        self,
+        attempt_number: int,
+        previous_status: str,
+        execution_status: str,
+        details: str,
+    ) -> str:
+        return "\n".join(
+            [
+                f"Attempt {attempt_number}/{MAX_STEPS_PER_EPISODE}.",
+                f"Previous attempt status: {previous_status}.",
+                f"Current execution status: {execution_status}.",
+                details,
+            ]
+        )
+    def _summarize_failed_tests(self, results: list[dict[str, Any]]) -> list[str]:
+        summaries: list[str] = []
+        for result in results:
+            if result.get("passed", False):
+                continue
+            visibility = str(result.get("visibility", "hidden"))
+            label = f"{visibility.title()} test #{int(result.get('index', 0)) + 1}"
+            status = str(result.get("status", "unknown"))
+            if visibility == "visible":
+                actual = str(result.get("stdout", "")).strip()
+                expected = str(result.get("expected", "")).strip()
+                details = []
+                if expected:
+                    details.append(f"expected={expected}")
+                if actual:
+                    details.append(f"got={actual}")
+                if result.get("stderr"):
+                    details.append("stderr_present")
+                if details:
+                    summaries.append(f"- {label}: {status} ({', '.join(details)})")
+                else:
+                    summaries.append(f"- {label}: {status}")
+            else:
+                summaries.append(f"- {label}: {status}")
+        return summaries
     def _record_metrics(self, observation: AdaptObservation) -> None:
+        attempt_record = {
+            "attempt_number": observation.attempt_number,
+            "reward": float(observation.reward or 0.0),
+            "pass_rate": float(observation.pass_rate),
+            "visible_pass_rate": float(observation.visible_pass_rate),
+            "execution_status": observation.execution_status,
+            "feedback": observation.feedback,
+            "done": bool(observation.done),
+        }
+        self.attempt_history.append(attempt_record)
         self._state.last_reward = float(observation.reward or 0.0)
         self._state.last_pass_rate = observation.pass_rate
         self._state.last_feedback = observation.feedback
+        self._state.last_execution_status = observation.execution_status
         self._state.generator_reward_signal = observation.generator_reward_signal
+        self._state.history = {"attempts": list(self.attempt_history)}
+        self._state.generated_problem = self._public_problem_view()
         self._state.recent_metrics = {
             "difficulty_tier": self.difficulty,
             "difficulty_label": self.problem.get("difficulty_label", self._tier_to_difficulty(self.difficulty)),
+            "visible_pass_rate": observation.visible_pass_rate,
             "pass_rate": observation.pass_rate,
             "execution_status": observation.execution_status,
             "timeout_count": observation.timeout_count,
             "reward_components": dict(observation.reward_components),
         }
+    def _finalize_episode(self, observation: AdaptObservation) -> None:
+        self.episode_done = True
+        self._update_history(observation.pass_rate, observation.generator_reward_signal)
+    def _update_history(self, pass_rate: float, generator_signal: float) -> None:
+        self.history["recent_pass_rates"].append(round(float(pass_rate), 4))
+        self.history["problem_types"].append(self.problem.get("problem_type", ""))
+        self.history["problem_signatures"].append(self.problem.get("problem_id", ""))
+        self.history["generator_rewards"].append(round(float(generator_signal), 4))
+        self.history["episode_index"] = int(self.history.get("episode_index", 0)) + 1
+        for key in ("recent_pass_rates", "problem_types", "problem_signatures", "generator_rewards"):
+            values = self.history[key]
+            if len(values) > self.max_history:
+                del values[:-self.max_history]
     def _public_problem_view(self) -> dict[str, str]:
         visible = dict(self.problem.get("visible_problem", {}))
+        base_problem = visible.get("problem", self.problem.get("problem", ""))
+        examples = self._format_examples()
+        if examples:
+            base_problem = f"{base_problem}\n\nExamples:\n{examples}"
         return {
+            "problem": base_problem,
             "input_format": visible.get("input_format", self.problem.get("input_format", "")),
             "constraints": visible.get("constraints", self.problem.get("constraints", "")),
         }
+    def _format_examples(self) -> str:
+        visible_cases = [test_case for test_case in self.test_cases if test_case.get("is_visible", False)]
+        if not visible_cases:
+            return ""
+        chunks = []
+        for test_case in visible_cases:
+            chunks.append(
+                f"Input:\n{test_case['input']}Expected Output:\n{test_case['output']}\n"
+            )
+        return "\n".join(chunks).rstrip()
     def _diversity_bonus(self, problem_type: str) -> float:
+        recent_types = list(self.history.get("problem_types", [])[-6:])
         if not recent_types:
             return 0.1
         if problem_type in recent_types:

env/generator.py CHANGED Viewed

@@ -1,13 +1,15 @@
 from __future__ import annotations
 import hashlib
-import math
 import random
 from dataclasses import dataclass
 from typing import Any, Callable
-VISIBLE_TEST_COUNT = 0
-MIN_TEST_CASES = 5
 @dataclass(frozen=True)
@@ -17,9 +19,9 @@ class ProblemTemplate:
     title: str
     input_format: str
     constraints: str
-    statement_builder: Callable[[dict[str, Any]], str]
     solver: Callable[[str], str]
-    case_builder: Callable[[random.Random, float], list[str]]
 def generator_reward(
@@ -54,14 +56,15 @@ def validate_problem(problem_dict: dict[str, Any]) -> bool:
         return False
     test_cases = problem_dict.get("test_cases")
-    if not isinstance(test_cases, list) or len(test_cases) < MIN_TEST_CASES:
         return False
     seen_inputs: set[str] = set()
     distinct_outputs: set[str] = set()
     visible_count = 0
-    for test_case in test_cases:
         if not isinstance(test_case, dict):
             return False
@@ -76,12 +79,26 @@ def validate_problem(problem_dict: dict[str, Any]) -> bool:
             return False
         seen_inputs.add(raw_input)
         distinct_outputs.add(raw_output.strip())
-        visible_count += 1 if is_visible else 0
-    if visible_count != VISIBLE_TEST_COUNT:
         return False
-    if len(distinct_outputs) < max(3, len(test_cases) // 3):
         return False
     return True
@@ -94,46 +111,46 @@ class GeneratorAgent:
         self.deterministic = deterministic
         self.templates = _build_templates()
-    def generate(
         self,
         difficulty_level: int | float | str,
         history: dict[str, Any] | None,
         problem_id: str | None = None,
     ) -> dict[str, Any]:
         history = history or {}
         target_tier = _difficulty_to_tier(difficulty_level)
-        adjusted_tier = self._adjust_tier(target_tier, history)
-        rng = self._rng_for(adjusted_tier, history, problem_id)
-        template = self._choose_template(adjusted_tier, history, rng, forced_problem_type=problem_id)
-        for attempt in range(10):
-            params = {
-                "window": 3 + adjusted_tier,
-                "modulus": 10 + 5 * adjusted_tier,
-                "max_n": 8 + adjusted_tier * 4,
-                "attempt": attempt,
-            }
-            raw_cases = template.case_builder(rng, 0.2 + adjusted_tier * 0.25)
             test_cases = [
                 {
                     "input": case_input,
                     "output": template.solver(case_input),
-                    "is_visible": False,
                 }
-                for case_input in raw_cases
             ]
             signature = self._problem_signature(template.problem_type, test_cases)
             problem = {
                 "problem_id": f"{template.problem_type}_{signature[:8]}",
                 "problem_type": template.problem_type,
-                "difficulty": round(self._tier_to_scalar(adjusted_tier), 4),
-                "difficulty_label": DIFFICULTY_LABELS[adjusted_tier],
-                "problem": template.statement_builder(params),
                 "input_format": template.input_format,
                 "constraints": template.constraints,
                 "test_cases": test_cases,
                 "visible_problem": {
-                    "problem": template.statement_builder(params),
                     "input_format": template.input_format,
                     "constraints": template.constraints,
                 },
@@ -145,17 +162,19 @@ class GeneratorAgent:
         raise ValueError(f"Unable to generate a valid problem for template {template.problem_type}")
-    def _adjust_tier(self, target_tier: int, history: dict[str, Any]) -> int:
-        recent_pass_rates = [float(value) for value in history.get("recent_pass_rates", [])[-5:]]
-        if not recent_pass_rates:
-            return target_tier
-        moving_average = sum(recent_pass_rates) / len(recent_pass_rates)
-        if moving_average > 0.8:
-            return min(3, target_tier + 1)
-        if moving_average < 0.2:
-            return max(1, target_tier - 1)
-        return target_tier
     def _choose_template(
         self,
@@ -163,6 +182,7 @@ class GeneratorAgent:
         history: dict[str, Any],
         rng: random.Random,
         forced_problem_type: str | None = None,
     ) -> ProblemTemplate:
         eligible = [template for template in self.templates if template.difficulty_tier == tier]
         if not eligible:
@@ -176,27 +196,37 @@ class GeneratorAgent:
                 if template.problem_type == forced_problem_type:
                     return template
-        recent_types = list(history.get("problem_types", [])[-4:])
-        weighted: list[tuple[float, ProblemTemplate]] = []
         for template in eligible:
-            repetition_penalty = 0.35 if template.problem_type in recent_types else 0.0
-            jitter = rng.random() * 0.2
-            weighted.append((1.0 - repetition_penalty + jitter, template))
-        weighted.sort(key=lambda item: item[0], reverse=True)
-        return weighted[0][1]
     def _rng_for(
         self,
         tier: int,
         history: dict[str, Any],
         problem_id: str | None,
     ) -> random.Random:
         seed_material = {
             "tier": tier,
             "problem_id": problem_id or "",
             "pass_rates": [round(float(value), 4) for value in history.get("recent_pass_rates", [])[-8:]],
             "problem_types": list(history.get("problem_types", [])[-8:]),
             "episode_index": int(history.get("episode_index", 0)),
         }
         digest = hashlib.sha256(repr(seed_material).encode("utf-8")).hexdigest()
         return random.Random(int(digest[:16], 16))
@@ -246,7 +276,7 @@ def _build_templates() -> list[ProblemTemplate]:
             title="Sum Even Numbers",
             input_format="The first line contains n. The second line contains n space-separated integers.",
             constraints="1 <= n <= 12; -100 <= values[i] <= 100",
-            statement_builder=lambda _: (
                 "Given a list of integers, print the sum of the numbers that are even. "
                 "If no number is even, print 0."
             ),
@@ -259,19 +289,71 @@ def _build_templates() -> list[ProblemTemplate]:
             title="Range Span",
             input_format="The first line contains n. The second line contains n space-separated integers.",
             constraints="2 <= n <= 12; -100 <= values[i] <= 100",
-            statement_builder=lambda _: (
                 "Given a list of integers, print the difference between the maximum and minimum value."
             ),
             solver=_solve_range_span,
             case_builder=_build_range_span_cases,
         ),
         ProblemTemplate(
             problem_type="count_local_peaks",
             difficulty_tier=2,
             title="Count Local Peaks",
             input_format="The first line contains n. The second line contains n space-separated integers.",
-            constraints="3 <= n <= 14; -100 <= values[i] <= 100",
-            statement_builder=lambda _: (
                 "Count how many indices i are local peaks, meaning values[i] is strictly greater than both "
                 "values[i-1] and values[i+1]. The first and last element can never be peaks."
             ),
@@ -283,20 +365,81 @@ def _build_templates() -> list[ProblemTemplate]:
             difficulty_tier=2,
             title="Longest Non-Decreasing Run",
             input_format="The first line contains n. The second line contains n space-separated integers.",
-            constraints="1 <= n <= 16; -100 <= values[i] <= 100",
-            statement_builder=lambda _: (
                 "Find the length of the longest contiguous subarray whose values are non-decreasing."
             ),
             solver=_solve_longest_non_decreasing_run,
             case_builder=_build_run_cases,
         ),
         ProblemTemplate(
             problem_type="smallest_most_frequent",
             difficulty_tier=3,
             title="Smallest Most Frequent",
             input_format="The first line contains n. The second line contains n space-separated integers.",
-            constraints="1 <= n <= 18; -30 <= values[i] <= 30",
-            statement_builder=lambda _: (
                 "Print the value that appears most often in the array. If several values have the same highest "
                 "frequency, print the smallest of them."
             ),
@@ -308,79 +451,343 @@ def _build_templates() -> list[ProblemTemplate]:
             difficulty_tier=3,
             title="Reverse Words",
             input_format="A single line containing one or more words separated by spaces.",
-            constraints="1 <= line length <= 80",
-            statement_builder=lambda _: (
                 "Read a line of text and print the words in reverse order. Multiple spaces in the input should "
                 "be treated as a single separator."
             ),
             solver=_solve_reverse_words,
             case_builder=_build_reverse_word_cases,
         ),
     ]
-def _build_sum_even_cases(rng: random.Random, difficulty_scalar: float) -> list[str]:
-    size = 5 + math.ceil(difficulty_scalar * 5)
-    cases = set()
-    while len(cases) < 6:
-        numbers = [rng.randint(-25, 25) for _ in range(size + rng.randint(0, 3))]
-        if all(number % 2 for number in numbers):
-            numbers[0] = 0
-        cases.add(_array_case(numbers))
-    return list(cases)
-def _build_range_span_cases(rng: random.Random, difficulty_scalar: float) -> list[str]:
-    size = 4 + math.ceil(difficulty_scalar * 6)
-    cases = set()
-    while len(cases) < 6:
-        numbers = [rng.randint(-40, 40) for _ in range(size + rng.randint(0, 3))]
-        if len(set(numbers)) == 1:
-            numbers[-1] += 3
-        cases.add(_array_case(numbers))
-    return list(cases)
-def _build_peak_cases(rng: random.Random, difficulty_scalar: float) -> list[str]:
-    size = 5 + math.ceil(difficulty_scalar * 6)
-    cases = set()
-    while len(cases) < 6:
-        numbers = []
-        current = rng.randint(-10, 10)
-        for index in range(size + rng.randint(0, 4)):
-            delta = rng.randint(-6, 6)
-            if index % 2 == 1:
-                delta = abs(delta) + 1
-            current += delta
-            numbers.append(current)
-        numbers[0] -= 5
-        numbers[-1] -= 5
-        cases.add(_array_case(numbers))
-    return list(cases)
-def _build_run_cases(rng: random.Random, difficulty_scalar: float) -> list[str]:
-    size = 6 + math.ceil(difficulty_scalar * 6)
-    cases = set()
-    while len(cases) < 6:
-        numbers = [rng.randint(-20, 20)]
-        for _ in range(size + rng.randint(0, 4) - 1):
-            numbers.append(numbers[-1] + rng.randint(-5, 5))
-        cases.add(_array_case(numbers))
-    return list(cases)
-def _build_frequency_cases(rng: random.Random, difficulty_scalar: float) -> list[str]:
-    size = 8 + math.ceil(difficulty_scalar * 6)
-    cases = set()
-    while len(cases) < 6:
-        numbers = [rng.randint(-6, 6) for _ in range(size + rng.randint(0, 5))]
-        numbers.extend([rng.choice(numbers), rng.choice(numbers)])
-        cases.add(_array_case(numbers))
-    return list(cases)
-def _build_reverse_word_cases(rng: random.Random, difficulty_scalar: float) -> list[str]:
     vocabulary = [
         "graph",
         "queue",
@@ -395,21 +802,170 @@ def _build_reverse_word_cases(rng: random.Random, difficulty_scalar: float) -> l
         "node",
         "edge",
     ]
-    word_count = 4 + math.ceil(difficulty_scalar * 4)
-    cases = set()
-    while len(cases) < 6:
-        words = [rng.choice(vocabulary) for _ in range(word_count + rng.randint(0, 2))]
-        spacer = " " * rng.randint(1, 3)
-        prefix = " " * rng.randint(0, 2)
-        suffix = " " * rng.randint(0, 2)
-        cases.add(f"{prefix}{spacer.join(words)}{suffix}\n")
-    return list(cases)
 def _array_case(numbers: list[int]) -> str:
     return f"{len(numbers)}\n{' '.join(str(number) for number in numbers)}\n"
 def _solve_sum_even_numbers(stdin: str) -> str:
     _, numbers = _parse_int_array(stdin)
     return str(sum(number for number in numbers if number % 2 == 0))
@@ -420,6 +976,47 @@ def _solve_range_span(stdin: str) -> str:
     return str(max(numbers) - min(numbers))
 def _solve_count_local_peaks(stdin: str) -> str:
     _, numbers = _parse_int_array(stdin)
     peaks = 0
@@ -442,11 +1039,60 @@ def _solve_longest_non_decreasing_run(stdin: str) -> str:
     return str(best)
 def _solve_smallest_most_frequent(stdin: str) -> str:
     _, numbers = _parse_int_array(stdin)
-    counts: dict[int, int] = {}
-    for number in numbers:
-        counts[number] = counts.get(number, 0) + 1
     best_count = max(counts.values())
     best_value = min(number for number, count in counts.items() if count == best_count)
     return str(best_value)
@@ -457,6 +1103,76 @@ def _solve_reverse_words(stdin: str) -> str:
     return " ".join(reversed(words))
 def _parse_int_array(stdin: str) -> tuple[int, list[int]]:
     lines = [line.strip() for line in stdin.strip().splitlines() if line.strip()]
     n = int(lines[0])
@@ -464,3 +1180,146 @@ def _parse_int_array(stdin: str) -> tuple[int, list[int]]:
     if len(numbers) != n:
         raise ValueError(f"Expected {n} integers, received {len(numbers)}")
     return n, numbers

 from __future__ import annotations
 import hashlib
 import random
+from collections import Counter, deque
 from dataclasses import dataclass
 from typing import Any, Callable
+VISIBLE_TEST_COUNT = 2
+HIDDEN_TEST_COUNT = 8
+TOTAL_TEST_CASES = VISIBLE_TEST_COUNT + HIDDEN_TEST_COUNT
+MIN_TEST_CASES = TOTAL_TEST_CASES
 @dataclass(frozen=True)
     title: str
     input_format: str
     constraints: str
+    statement_builder: Callable[[], str]
     solver: Callable[[str], str]
+    case_builder: Callable[[random.Random], list[str]]
 def generator_reward(
         return False
     test_cases = problem_dict.get("test_cases")
+    if not isinstance(test_cases, list) or len(test_cases) != TOTAL_TEST_CASES:
         return False
     seen_inputs: set[str] = set()
     distinct_outputs: set[str] = set()
     visible_count = 0
+    hidden_count = 0
+    for index, test_case in enumerate(test_cases):
         if not isinstance(test_case, dict):
             return False
             return False
         seen_inputs.add(raw_input)
         distinct_outputs.add(raw_output.strip())
+        if index < VISIBLE_TEST_COUNT and not is_visible:
+            return False
+        if index >= VISIBLE_TEST_COUNT and is_visible:
+            return False
+        if is_visible:
+            visible_count += 1
+        else:
+            hidden_count += 1
+    if visible_count != VISIBLE_TEST_COUNT or hidden_count != HIDDEN_TEST_COUNT:
         return False
+    normalized_outputs = {output.strip().lower() for output in distinct_outputs}
+    min_output_diversity = 2 if normalized_outputs.issubset({"yes", "no", "true", "false", "0", "1"}) else max(
+        3,
+        len(test_cases) // 3,
+    )
+    if len(distinct_outputs) < min_output_diversity:
         return False
     return True
         self.deterministic = deterministic
         self.templates = _build_templates()
+    def generate_problem(
         self,
         difficulty_level: int | float | str,
         history: dict[str, Any] | None,
         problem_id: str | None = None,
+        family_weights: dict[str, float] | None = None,
     ) -> dict[str, Any]:
         history = history or {}
         target_tier = _difficulty_to_tier(difficulty_level)
+        rng = self._rng_for(target_tier, history, problem_id, family_weights or {})
+        template = self._choose_template(
+            target_tier,
+            history,
+            rng,
+            forced_problem_type=problem_id,
+            family_weights=family_weights or {},
+        )
+        for _ in range(20):
+            raw_cases = template.case_builder(rng)
             test_cases = [
                 {
                     "input": case_input,
                     "output": template.solver(case_input),
+                    "is_visible": index < VISIBLE_TEST_COUNT,
                 }
+                for index, case_input in enumerate(raw_cases)
             ]
             signature = self._problem_signature(template.problem_type, test_cases)
             problem = {
                 "problem_id": f"{template.problem_type}_{signature[:8]}",
                 "problem_type": template.problem_type,
+                "difficulty": round(self._tier_to_scalar(target_tier), 4),
+                "difficulty_label": DIFFICULTY_LABELS[target_tier],
+                "problem": template.statement_builder(),
                 "input_format": template.input_format,
                 "constraints": template.constraints,
                 "test_cases": test_cases,
                 "visible_problem": {
+                    "problem": template.statement_builder(),
                     "input_format": template.input_format,
                     "constraints": template.constraints,
                 },
         raise ValueError(f"Unable to generate a valid problem for template {template.problem_type}")
+    def generate(
+        self,
+        difficulty_level: int | float | str,
+        history: dict[str, Any] | None,
+        problem_id: str | None = None,
+        family_weights: dict[str, float] | None = None,
+    ) -> dict[str, Any]:
+        return self.generate_problem(
+            difficulty_level=difficulty_level,
+            history=history,
+            problem_id=problem_id,
+            family_weights=family_weights,
+        )
     def _choose_template(
         self,
         history: dict[str, Any],
         rng: random.Random,
         forced_problem_type: str | None = None,
+        family_weights: dict[str, float] | None = None,
     ) -> ProblemTemplate:
         eligible = [template for template in self.templates if template.difficulty_tier == tier]
         if not eligible:
                 if template.problem_type == forced_problem_type:
                     return template
+        recent_types = list(history.get("problem_types", [])[-6:])
+        weights: list[float] = []
         for template in eligible:
+            base_weight = float((family_weights or {}).get(template.problem_type, 1.0))
+            base_weight = max(base_weight, 1e-6)
+            if template.problem_type in recent_types:
+                base_weight *= 0.35
+            weights.append(base_weight)
+        return rng.choices(eligible, weights=weights, k=1)[0]
     def _rng_for(
         self,
         tier: int,
         history: dict[str, Any],
         problem_id: str | None,
+        family_weights: dict[str, float],
     ) -> random.Random:
+        if not self.deterministic:
+            return random.Random()
         seed_material = {
             "tier": tier,
             "problem_id": problem_id or "",
             "pass_rates": [round(float(value), 4) for value in history.get("recent_pass_rates", [])[-8:]],
             "problem_types": list(history.get("problem_types", [])[-8:]),
             "episode_index": int(history.get("episode_index", 0)),
+            "family_weights": {
+                key: round(float(value), 4)
+                for key, value in sorted(family_weights.items())
+            },
         }
         digest = hashlib.sha256(repr(seed_material).encode("utf-8")).hexdigest()
         return random.Random(int(digest[:16], 16))
             title="Sum Even Numbers",
             input_format="The first line contains n. The second line contains n space-separated integers.",
             constraints="1 <= n <= 12; -100 <= values[i] <= 100",
+            statement_builder=lambda: (
                 "Given a list of integers, print the sum of the numbers that are even. "
                 "If no number is even, print 0."
             ),
             title="Range Span",
             input_format="The first line contains n. The second line contains n space-separated integers.",
             constraints="2 <= n <= 12; -100 <= values[i] <= 100",
+            statement_builder=lambda: (
                 "Given a list of integers, print the difference between the maximum and minimum value."
             ),
             solver=_solve_range_span,
             case_builder=_build_range_span_cases,
         ),
+        ProblemTemplate(
+            problem_type="count_vowels",
+            difficulty_tier=1,
+            title="Count Vowels",
+            input_format="A single line containing lowercase or uppercase letters and spaces.",
+            constraints="1 <= line length <= 80",
+            statement_builder=lambda: (
+                "Count how many vowels appear in the input line. Treat a, e, i, o, u as vowels "
+                "and ignore case."
+            ),
+            solver=_solve_count_vowels,
+            case_builder=_build_count_vowels_cases,
+        ),
+        ProblemTemplate(
+            problem_type="max_consecutive_ones",
+            difficulty_tier=1,
+            title="Max Consecutive Ones",
+            input_format="A single line containing a binary string.",
+            constraints="1 <= string length <= 40",
+            statement_builder=lambda: (
+                "Print the length of the longest contiguous block of '1' characters in the binary string."
+            ),
+            solver=_solve_max_consecutive_ones,
+            case_builder=_build_max_consecutive_ones_cases,
+        ),
+        ProblemTemplate(
+            problem_type="fizzbuzz_variant",
+            difficulty_tier=1,
+            title="FizzBuzz Variant",
+            input_format="The first line contains n a b. The second line contains label_a and label_b.",
+            constraints="1 <= n <= 25; 2 <= a, b <= 9; labels contain only letters",
+            statement_builder=lambda: (
+                "For each integer from 1 to n, print label_a if the number is divisible by a, "
+                "label_b if it is divisible by b, and the concatenation label_a+label_b if it is divisible "
+                "by both. Otherwise print the number itself. Output all tokens on one line separated by spaces."
+            ),
+            solver=_solve_fizzbuzz_variant,
+            case_builder=_build_fizzbuzz_variant_cases,
+        ),
+        ProblemTemplate(
+            problem_type="running_total",
+            difficulty_tier=1,
+            title="Running Total",
+            input_format="The first line contains n. The second line contains n space-separated integers.",
+            constraints="1 <= n <= 14; -50 <= values[i] <= 50",
+            statement_builder=lambda: (
+                "Print the running total after each element of the array. Output the cumulative sums on one line "
+                "separated by spaces."
+            ),
+            solver=_solve_running_total,
+            case_builder=_build_running_total_cases,
+        ),
         ProblemTemplate(
             problem_type="count_local_peaks",
             difficulty_tier=2,
             title="Count Local Peaks",
             input_format="The first line contains n. The second line contains n space-separated integers.",
+            constraints="3 <= n <= 16; -100 <= values[i] <= 100",
+            statement_builder=lambda: (
                 "Count how many indices i are local peaks, meaning values[i] is strictly greater than both "
                 "values[i-1] and values[i+1]. The first and last element can never be peaks."
             ),
             difficulty_tier=2,
             title="Longest Non-Decreasing Run",
             input_format="The first line contains n. The second line contains n space-separated integers.",
+            constraints="1 <= n <= 18; -100 <= values[i] <= 100",
+            statement_builder=lambda: (
                 "Find the length of the longest contiguous subarray whose values are non-decreasing."
             ),
             solver=_solve_longest_non_decreasing_run,
             case_builder=_build_run_cases,
         ),
+        ProblemTemplate(
+            problem_type="two_sum_count",
+            difficulty_tier=2,
+            title="Two Sum Count",
+            input_format="The first line contains n and target. The second line contains n space-separated integers.",
+            constraints="2 <= n <= 16; -50 <= values[i] <= 50",
+            statement_builder=lambda: (
+                "Count how many index pairs (i, j) with i < j have values[i] + values[j] equal to target."
+            ),
+            solver=_solve_two_sum_count,
+            case_builder=_build_two_sum_count_cases,
+        ),
+        ProblemTemplate(
+            problem_type="max_subarray_sum",
+            difficulty_tier=2,
+            title="Maximum Subarray Sum",
+            input_format="The first line contains n. The second line contains n space-separated integers.",
+            constraints="1 <= n <= 18; -50 <= values[i] <= 50",
+            statement_builder=lambda: (
+                "Print the maximum possible sum of a contiguous subarray."
+            ),
+            solver=_solve_max_subarray_sum,
+            case_builder=_build_max_subarray_sum_cases,
+        ),
+        ProblemTemplate(
+            problem_type="group_anagrams_count",
+            difficulty_tier=2,
+            title="Group Anagrams Count",
+            input_format="The first line contains n. The second line contains n space-separated lowercase words.",
+            constraints="1 <= n <= 12; each word length is between 1 and 8",
+            statement_builder=lambda: (
+                "Group words that are anagrams of each other. Print the number of distinct anagram groups."
+            ),
+            solver=_solve_group_anagrams_count,
+            case_builder=_build_group_anagrams_cases,
+        ),
+        ProblemTemplate(
+            problem_type="balanced_brackets",
+            difficulty_tier=2,
+            title="Balanced Brackets",
+            input_format="A single line containing only the characters ()[]{}.",
+            constraints="1 <= line length <= 50",
+            statement_builder=lambda: (
+                "Print YES if the bracket string is balanced and NO otherwise."
+            ),
+            solver=_solve_balanced_brackets,
+            case_builder=_build_balanced_brackets_cases,
+        ),
+        ProblemTemplate(
+            problem_type="matrix_diagonal_sum",
+            difficulty_tier=2,
+            title="Matrix Diagonal Sum",
+            input_format="The first line contains n. The next n lines each contain n space-separated integers.",
+            constraints="2 <= n <= 6; -20 <= matrix[i][j] <= 20",
+            statement_builder=lambda: (
+                "For the square matrix, print the sum of the primary diagonal and secondary diagonal. "
+                "If n is odd, count the center element only once."
+            ),
+            solver=_solve_matrix_diagonal_sum,
+            case_builder=_build_matrix_diagonal_sum_cases,
+        ),
         ProblemTemplate(
             problem_type="smallest_most_frequent",
             difficulty_tier=3,
             title="Smallest Most Frequent",
             input_format="The first line contains n. The second line contains n space-separated integers.",
+            constraints="1 <= n <= 20; -30 <= values[i] <= 30",
+            statement_builder=lambda: (
                 "Print the value that appears most often in the array. If several values have the same highest "
                 "frequency, print the smallest of them."
             ),
             difficulty_tier=3,
             title="Reverse Words",
             input_format="A single line containing one or more words separated by spaces.",
+            constraints="1 <= line length <= 120",
+            statement_builder=lambda: (
                 "Read a line of text and print the words in reverse order. Multiple spaces in the input should "
                 "be treated as a single separator."
             ),
             solver=_solve_reverse_words,
             case_builder=_build_reverse_word_cases,
         ),
+        ProblemTemplate(
+            problem_type="longest_common_subsequence",
+            difficulty_tier=3,
+            title="Longest Common Subsequence",
+            input_format="The first line contains string s. The second line contains string t.",
+            constraints="1 <= len(s), len(t) <= 18; strings contain lowercase letters",
+            statement_builder=lambda: (
+                "Print the length of the longest common subsequence of the two strings."
+            ),
+            solver=_solve_longest_common_subsequence,
+            case_builder=_build_lcs_cases,
+        ),
+        ProblemTemplate(
+            problem_type="word_ladder_steps",
+            difficulty_tier=3,
+            title="Word Ladder Steps",
+            input_format="The first line contains start and target. The second line contains n. The third line contains n space-separated words.",
+            constraints="All words have the same length between 3 and 5; 1 <= n <= 14",
+            statement_builder=lambda: (
+                "You may change one character at a time. Every intermediate word and the target word must appear "
+                "in the given word list. Print the minimum number of single-character changes needed to transform "
+                "start into target, or -1 if it is impossible."
+            ),
+            solver=_solve_word_ladder_steps,
+            case_builder=_build_word_ladder_cases,
+        ),
+        ProblemTemplate(
+            problem_type="merge_intervals",
+            difficulty_tier=3,
+            title="Merge Intervals",
+            input_format="The first line contains n. The next n lines each contain start and end.",
+            constraints="1 <= n <= 12; -20 <= start <= end <= 30",
+            statement_builder=lambda: (
+                "Merge all overlapping intervals and print how many intervals remain after merging."
+            ),
+            solver=_solve_merge_intervals,
+            case_builder=_build_merge_intervals_cases,
+        ),
+        ProblemTemplate(
+            problem_type="min_coins",
+            difficulty_tier=3,
+            title="Minimum Coins",
+            input_format="The first line contains n and target. The second line contains n distinct positive coin values.",
+            constraints="1 <= n <= 8; 1 <= target <= 40; 1 <= coin values <= 20",
+            statement_builder=lambda: (
+                "Print the minimum number of coins needed to make exactly target using unlimited copies of the given "
+                "coin values. Print -1 if it is impossible."
+            ),
+            solver=_solve_min_coins,
+            case_builder=_build_min_coins_cases,
+        ),
+        ProblemTemplate(
+            problem_type="rotate_matrix_90",
+            difficulty_tier=3,
+            title="Rotate Matrix 90 Degrees",
+            input_format="The first line contains n. The next n lines each contain n space-separated integers.",
+            constraints="2 <= n <= 5; -20 <= matrix[i][j] <= 20",
+            statement_builder=lambda: (
+                "Rotate the square matrix 90 degrees clockwise and print the rotated matrix flattened in row-major "
+                "order on one line separated by spaces."
+            ),
+            solver=_solve_rotate_matrix_90,
+            case_builder=_build_rotate_matrix_cases,
+        ),
+    ]
+def _build_sum_even_cases(rng: random.Random) -> list[str]:
+    visible_pool = [
+        _array_case([2, 3, 4]),
+        _array_case([1, 3, 5, 7]),
+        _array_case([0, -2, 5, 8]),
+    ]
+    return _cases_from_pool_and_factory(rng, visible_pool, _sum_even_hidden_case)
+def _sum_even_hidden_case(rng: random.Random) -> str:
+    length = rng.randint(5, 12)
+    numbers = [rng.randint(-50, 50) for _ in range(length)]
+    if all(number % 2 for number in numbers):
+        numbers[rng.randrange(length)] = rng.choice([-8, -2, 0, 6, 14])
+    return _array_case(numbers)
+def _build_range_span_cases(rng: random.Random) -> list[str]:
+    visible_pool = [
+        _array_case([1, 4, 9]),
+        _array_case([-2, -2, -2, 1]),
+        _array_case([8, 3]),
+    ]
+    return _cases_from_pool_and_factory(rng, visible_pool, _range_span_hidden_case)
+def _range_span_hidden_case(rng: random.Random) -> str:
+    length = rng.randint(4, 12)
+    numbers = [rng.randint(-60, 60) for _ in range(length)]
+    if len(set(numbers)) == 1:
+        numbers[-1] += 5
+    return _array_case(numbers)
+def _build_count_vowels_cases(rng: random.Random) -> list[str]:
+    visible_pool = [
+        "hello world\n",
+        "sky\n",
+        "AEIOU\n",
+    ]
+    return _cases_from_pool_and_factory(rng, visible_pool, _count_vowels_hidden_case)
+def _count_vowels_hidden_case(rng: random.Random) -> str:
+    word_bank = [
+        "algorithm",
+        "queue",
+        "stack",
+        "binary",
+        "graph",
+        "open env",
+        "unit test",
+        "dynamic programming",
+        "vowel heavy area",
+        "crypt rhythm",
+    ]
+    parts = [rng.choice(word_bank) for _ in range(rng.randint(1, 3))]
+    text = " ".join(parts)
+    return f"{text[:80]}\n"
+def _build_max_consecutive_ones_cases(rng: random.Random) -> list[str]:
+    visible_pool = ["1101110\n", "00000\n", "1\n"]
+    return _cases_from_pool_and_factory(rng, visible_pool, _max_consecutive_ones_hidden_case)
+def _max_consecutive_ones_hidden_case(rng: random.Random) -> str:
+    length = rng.randint(8, 40)
+    chars = [rng.choice(["0", "1"]) for _ in range(length)]
+    if "1" not in chars:
+        start = rng.randint(0, max(0, length - 3))
+        run_length = rng.randint(1, min(5, length - start))
+        for index in range(start, start + run_length):
+            chars[index] = "1"
+    return f"{''.join(chars)}\n"
+def _build_fizzbuzz_variant_cases(rng: random.Random) -> list[str]:
+    visible_pool = [
+        _fizzbuzz_case(8, 3, 5, "Fizz", "Buzz"),
+        _fizzbuzz_case(6, 2, 4, "Hop", "Pop"),
+        _fizzbuzz_case(10, 2, 3, "Up", "Go"),
+    ]
+    return _cases_from_pool_and_factory(rng, visible_pool, _fizzbuzz_hidden_case)
+def _fizzbuzz_hidden_case(rng: random.Random) -> str:
+    labels = [
+        ("Fizz", "Buzz"),
+        ("Ping", "Pong"),
+        ("Hop", "Skip"),
+        ("Alpha", "Beta"),
+        ("Red", "Blue"),
+    ]
+    label_a, label_b = rng.choice(labels)
+    a = rng.randint(2, 6)
+    b = rng.randint(2, 6)
+    while b == a:
+        b = rng.randint(2, 6)
+    n = rng.randint(10, 25)
+    return _fizzbuzz_case(n, a, b, label_a, label_b)
+def _build_running_total_cases(rng: random.Random) -> list[str]:
+    visible_pool = [
+        _array_case([1, 2, 3, 4]),
+        _array_case([5, -2, 7]),
+        _array_case([0, 0, 1]),
+    ]
+    return _cases_from_pool_and_factory(rng, visible_pool, _running_total_hidden_case)
+def _running_total_hidden_case(rng: random.Random) -> str:
+    length = rng.randint(5, 14)
+    numbers = [rng.randint(-20, 20) for _ in range(length)]
+    return _array_case(numbers)
+def _build_peak_cases(rng: random.Random) -> list[str]:
+    visible_pool = [
+        _array_case([1, 3, 2, 4, 1]),
+        _array_case([5, 4, 3, 2, 1]),
+        _array_case([2, 5, 1, 5, 2]),
+    ]
+    return _cases_from_pool_and_factory(rng, visible_pool, _peak_hidden_case)
+def _peak_hidden_case(rng: random.Random) -> str:
+    length = rng.randint(6, 16)
+    numbers = [rng.randint(-20, 20)]
+    for _ in range(length - 1):
+        numbers.append(numbers[-1] + rng.randint(-8, 8))
+    return _array_case(numbers)
+def _build_run_cases(rng: random.Random) -> list[str]:
+    visible_pool = [
+        _array_case([1, 2, 2, 1, 3]),
+        _array_case([5, 4, 3, 2]),
+        _array_case([1, 1, 1, 1]),
+    ]
+    return _cases_from_pool_and_factory(rng, visible_pool, _run_hidden_case)
+def _run_hidden_case(rng: random.Random) -> str:
+    length = rng.randint(6, 18)
+    numbers = [rng.randint(-20, 20)]
+    for _ in range(length - 1):
+        numbers.append(numbers[-1] + rng.randint(-6, 6))
+    return _array_case(numbers)
+def _build_two_sum_count_cases(rng: random.Random) -> list[str]:
+    visible_pool = [
+        _target_array_case(5, [1, 2, 3, 4]),
+        _target_array_case(2, [1, 1, 1, 1]),
+        _target_array_case(0, [-1, 1, 2, -2]),
+    ]
+    return _cases_from_pool_and_factory(rng, visible_pool, _two_sum_hidden_case)
+def _two_sum_hidden_case(rng: random.Random) -> str:
+    length = rng.randint(5, 16)
+    numbers = [rng.randint(-12, 12) for _ in range(length)]
+    target = rng.randint(-10, 10)
+    return _target_array_case(target, numbers)
+def _build_max_subarray_sum_cases(rng: random.Random) -> list[str]:
+    visible_pool = [
+        _array_case([1, -2, 3, 4, -1]),
+        _array_case([-5, -1, -8]),
+        _array_case([2, -1, 2, 3, 4, -5]),
     ]
+    return _cases_from_pool_and_factory(rng, visible_pool, _max_subarray_hidden_case)
+def _max_subarray_hidden_case(rng: random.Random) -> str:
+    length = rng.randint(6, 18)
+    numbers = [rng.randint(-20, 20) for _ in range(length)]
+    return _array_case(numbers)
+def _build_group_anagrams_cases(rng: random.Random) -> list[str]:
+    visible_pool = [
+        _word_list_case(["eat", "tea", "tan", "ate", "nat", "bat"]),
+        _word_list_case(["abc", "bca", "cab", "foo"]),
+        _word_list_case(["a", "b", "ab", "ba"]),
+    ]
+    return _cases_from_pool_and_factory(rng, visible_pool, _group_anagrams_hidden_case)
+def _group_anagrams_hidden_case(rng: random.Random) -> str:
+    base_words = ["stone", "tones", "notes", "silent", "listen", "enlist", "rat", "tar", "art"]
+    words: list[str] = []
+    for _ in range(rng.randint(4, 10)):
+        word = rng.choice(base_words)
+        if rng.random() < 0.4:
+            shuffled = list(word)
+            rng.shuffle(shuffled)
+            words.append("".join(shuffled))
+        else:
+            words.append(word)
+    return _word_list_case(words)
+def _build_balanced_brackets_cases(rng: random.Random) -> list[str]:
+    visible_pool = [
+        "([]{})\n",
+        "([)]\n",
+        "{[()]}\n",
+    ]
+    return _cases_from_pool_and_factory(rng, visible_pool, _balanced_brackets_hidden_case)
+def _balanced_brackets_hidden_case(rng: random.Random) -> str:
+    if rng.random() < 0.5:
+        return f"{_make_balanced_brackets(rng, rng.randint(3, 10))}\n"
+    return f"{_make_unbalanced_brackets(rng, rng.randint(3, 10))}\n"
+def _build_matrix_diagonal_sum_cases(rng: random.Random) -> list[str]:
+    visible_pool = [
+        _matrix_case([[1, 2], [3, 4]]),
+        _matrix_case([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
+        _matrix_case([[2, 0, 2], [1, 5, 1], [2, 0, 2]]),
+    ]
+    return _cases_from_pool_and_factory(rng, visible_pool, _matrix_diagonal_hidden_case)
+def _matrix_diagonal_hidden_case(rng: random.Random) -> str:
+    size = rng.randint(3, 6)
+    matrix = [[rng.randint(-9, 9) for _ in range(size)] for _ in range(size)]
+    return _matrix_case(matrix)
+def _build_frequency_cases(rng: random.Random) -> list[str]:
+    visible_pool = [
+        _array_case([1, 2, 2, 3, 3, 3]),
+        _array_case([4, 4, 1, 1]),
+        _array_case([-1, -1, -2, -2, -2, 3]),
+    ]
+    return _cases_from_pool_and_factory(rng, visible_pool, _frequency_hidden_case)
+def _frequency_hidden_case(rng: random.Random) -> str:
+    length = rng.randint(8, 20)
+    numbers = [rng.randint(-8, 8) for _ in range(length)]
+    numbers.extend([rng.choice(numbers), rng.choice(numbers)])
+    return _array_case(numbers)
+def _build_reverse_word_cases(rng: random.Random) -> list[str]:
+    visible_pool = [
+        "hello world here\n",
+        "  graph   search tree \n",
+        "one\n",
+    ]
+    return _cases_from_pool_and_factory(rng, visible_pool, _reverse_words_hidden_case)
+def _reverse_words_hidden_case(rng: random.Random) -> str:
     vocabulary = [
         "graph",
         "queue",
         "node",
         "edge",
     ]
+    words = [rng.choice(vocabulary) for _ in range(rng.randint(4, 9))]
+    spacer = " " * rng.randint(1, 3)
+    prefix = " " * rng.randint(0, 2)
+    suffix = " " * rng.randint(0, 2)
+    return f"{prefix}{spacer.join(words)}{suffix}\n"
+def _build_lcs_cases(rng: random.Random) -> list[str]:
+    visible_pool = [
+        _two_line_case("abcde", "ace"),
+        _two_line_case("abc", "abc"),
+        _two_line_case("abc", "def"),
+    ]
+    return _cases_from_pool_and_factory(rng, visible_pool, _lcs_hidden_case)
+def _lcs_hidden_case(rng: random.Random) -> str:
+    alphabet = "abcdxyz"
+    left = "".join(rng.choice(alphabet) for _ in range(rng.randint(6, 14)))
+    right = "".join(rng.choice(alphabet) for _ in range(rng.randint(6, 14)))
+    return _two_line_case(left, right)
+def _build_word_ladder_cases(rng: random.Random) -> list[str]:
+    visible_pool = [
+        _word_ladder_case("hit", "cog", ["hot", "dot", "dog", "lot", "log", "cog"]),
+        _word_ladder_case("same", "same", ["same", "lame", "came"]),
+        _word_ladder_case("cold", "warm", ["cord", "card", "ward", "sold"]),
+    ]
+    return _cases_from_pool_and_factory(rng, visible_pool, _word_ladder_hidden_case)
+def _word_ladder_hidden_case(rng: random.Random) -> str:
+    length = rng.randint(3, 5)
+    if rng.random() < 0.7:
+        path_length = rng.randint(2, 5)
+        path = _build_word_ladder_path(rng, length, path_length)
+        extras = _build_word_ladder_extras(rng, length, rng.randint(2, 7), set(path))
+        words = path[1:] + extras
+        rng.shuffle(words)
+        return _word_ladder_case(path[0], path[-1], words)
+    start = _random_word(rng, length)
+    target = _random_word(rng, length)
+    while target == start:
+        target = _random_word(rng, length)
+    extras = _build_word_ladder_extras(rng, length, rng.randint(4, 10), {start, target})
+    return _word_ladder_case(start, target, extras)
+def _build_merge_intervals_cases(rng: random.Random) -> list[str]:
+    visible_pool = [
+        _interval_case([(1, 3), (2, 4), (6, 8)]),
+        _interval_case([(1, 2), (3, 4), (5, 6)]),
+        _interval_case([(0, 5), (2, 3), (4, 10)]),
+    ]
+    return _cases_from_pool_and_factory(rng, visible_pool, _merge_intervals_hidden_case)
+def _merge_intervals_hidden_case(rng: random.Random) -> str:
+    intervals = []
+    for _ in range(rng.randint(4, 12)):
+        start = rng.randint(-10, 20)
+        end = start + rng.randint(0, 8)
+        intervals.append((start, end))
+    return _interval_case(intervals)
+def _build_min_coins_cases(rng: random.Random) -> list[str]:
+    visible_pool = [
+        _coin_case([1, 3, 4], 6),
+        _coin_case([2, 5], 3),
+        _coin_case([2, 5, 7], 14),
+    ]
+    return _cases_from_pool_and_factory(rng, visible_pool, _min_coins_hidden_case)
+def _min_coins_hidden_case(rng: random.Random) -> str:
+    coin_count = rng.randint(2, 6)
+    coins = sorted({rng.randint(1, 10) for _ in range(coin_count + 2)})
+    coins = coins[:coin_count]
+    target = rng.randint(5, 40)
+    return _coin_case(coins, target)
+def _build_rotate_matrix_cases(rng: random.Random) -> list[str]:
+    visible_pool = [
+        _matrix_case([[1, 2], [3, 4]]),
+        _matrix_case([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
+        _matrix_case([[5, 1], [0, -1]]),
+    ]
+    return _cases_from_pool_and_factory(rng, visible_pool, _rotate_matrix_hidden_case)
+def _rotate_matrix_hidden_case(rng: random.Random) -> str:
+    size = rng.randint(2, 5)
+    matrix = [[rng.randint(-9, 9) for _ in range(size)] for _ in range(size)]
+    return _matrix_case(matrix)
+def _cases_from_pool_and_factory(
+    rng: random.Random,
+    visible_pool: list[str],
+    hidden_factory: Callable[[random.Random], str],
+) -> list[str]:
+    cases: list[str] = []
+    seen: set[str] = set()
+    for case_input in rng.sample(visible_pool, k=VISIBLE_TEST_COUNT):
+        cases.append(case_input)
+        seen.add(case_input)
+    attempts = 0
+    while len(cases) < TOTAL_TEST_CASES:
+        candidate = hidden_factory(rng)
+        attempts += 1
+        if candidate in seen:
+            if attempts > 200:
+                raise ValueError("Unable to generate unique test cases.")
+            continue
+        seen.add(candidate)
+        cases.append(candidate)
+    return cases
 def _array_case(numbers: list[int]) -> str:
     return f"{len(numbers)}\n{' '.join(str(number) for number in numbers)}\n"
+def _target_array_case(target: int, numbers: list[int]) -> str:
+    return f"{len(numbers)} {target}\n{' '.join(str(number) for number in numbers)}\n"
+def _word_list_case(words: list[str]) -> str:
+    return f"{len(words)}\n{' '.join(words)}\n"
+def _matrix_case(matrix: list[list[int]]) -> str:
+    rows = [" ".join(str(value) for value in row) for row in matrix]
+    return f"{len(matrix)}\n" + "\n".join(rows) + "\n"
+def _two_line_case(first: str, second: str) -> str:
+    return f"{first}\n{second}\n"
+def _interval_case(intervals: list[tuple[int, int]]) -> str:
+    rows = [f"{start} {end}" for start, end in intervals]
+    return f"{len(intervals)}\n" + "\n".join(rows) + "\n"
+def _coin_case(coins: list[int], target: int) -> str:
+    return f"{len(coins)} {target}\n{' '.join(str(coin) for coin in coins)}\n"
+def _fizzbuzz_case(n: int, a: int, b: int, label_a: str, label_b: str) -> str:
+    return f"{n} {a} {b}\n{label_a} {label_b}\n"
+def _word_ladder_case(start: str, target: str, words: list[str]) -> str:
+    return f"{start} {target}\n{len(words)}\n{' '.join(words)}\n"
 def _solve_sum_even_numbers(stdin: str) -> str:
     _, numbers = _parse_int_array(stdin)
     return str(sum(number for number in numbers if number % 2 == 0))
     return str(max(numbers) - min(numbers))
+def _solve_count_vowels(stdin: str) -> str:
+    text = stdin.rstrip("\n")
+    return str(sum(1 for char in text.lower() if char in "aeiou"))
+def _solve_max_consecutive_ones(stdin: str) -> str:
+    binary = stdin.strip()
+    best = 0
+    current = 0
+    for char in binary:
+        if char == "1":
+            current += 1
+            best = max(best, current)
+        else:
+            current = 0
+    return str(best)
+def _solve_fizzbuzz_variant(stdin: str) -> str:
+    (n, a, b), (label_a, label_b) = _parse_fizzbuzz(stdin)
+    output = []
+    for value in range(1, n + 1):
+        token = ""
+        if value % a == 0:
+            token += label_a
+        if value % b == 0:
+            token += label_b
+        output.append(token or str(value))
+    return " ".join(output)
+def _solve_running_total(stdin: str) -> str:
+    _, numbers = _parse_int_array(stdin)
+    total = 0
+    running = []
+    for number in numbers:
+        total += number
+        running.append(str(total))
+    return " ".join(running)
 def _solve_count_local_peaks(stdin: str) -> str:
     _, numbers = _parse_int_array(stdin)
     peaks = 0
     return str(best)
+def _solve_two_sum_count(stdin: str) -> str:
+    _, target, numbers = _parse_target_array(stdin)
+    counts: Counter[int] = Counter()
+    pairs = 0
+    for number in numbers:
+        pairs += counts[target - number]
+        counts[number] += 1
+    return str(pairs)
+def _solve_max_subarray_sum(stdin: str) -> str:
+    _, numbers = _parse_int_array(stdin)
+    best = numbers[0]
+    current = numbers[0]
+    for number in numbers[1:]:
+        current = max(number, current + number)
+        best = max(best, current)
+    return str(best)
+def _solve_group_anagrams_count(stdin: str) -> str:
+    _, words = _parse_word_list(stdin)
+    groups = {"".join(sorted(word)) for word in words}
+    return str(len(groups))
+def _solve_balanced_brackets(stdin: str) -> str:
+    text = stdin.strip()
+    pairs = {")": "(", "]": "[", "}": "{"}
+    stack: list[str] = []
+    for char in text:
+        if char in "([{":
+            stack.append(char)
+        elif char in pairs:
+            if not stack or stack.pop() != pairs[char]:
+                return "NO"
+    return "YES" if not stack else "NO"
+def _solve_matrix_diagonal_sum(stdin: str) -> str:
+    _, matrix = _parse_matrix(stdin)
+    total = 0
+    size = len(matrix)
+    for index in range(size):
+        total += matrix[index][index]
+        mirrored = size - 1 - index
+        if mirrored != index:
+            total += matrix[index][mirrored]
+    return str(total)
 def _solve_smallest_most_frequent(stdin: str) -> str:
     _, numbers = _parse_int_array(stdin)
+    counts = Counter(numbers)
     best_count = max(counts.values())
     best_value = min(number for number, count in counts.items() if count == best_count)
     return str(best_value)
     return " ".join(reversed(words))
+def _solve_longest_common_subsequence(stdin: str) -> str:
+    left, right = _parse_two_strings(stdin)
+    dp = [[0] * (len(right) + 1) for _ in range(len(left) + 1)]
+    for i in range(1, len(left) + 1):
+        for j in range(1, len(right) + 1):
+            if left[i - 1] == right[j - 1]:
+                dp[i][j] = dp[i - 1][j - 1] + 1
+            else:
+                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
+    return str(dp[-1][-1])
+def _solve_word_ladder_steps(stdin: str) -> str:
+    start, target, words = _parse_word_ladder(stdin)
+    if start == target:
+        return "0"
+    word_set = set(words)
+    if target not in word_set:
+        return "-1"
+    queue: deque[tuple[str, int]] = deque([(start, 0)])
+    visited = {start}
+    alphabet = "abcdefghijklmnopqrstuvwxyz"
+    while queue:
+        current, steps = queue.popleft()
+        for index in range(len(current)):
+            for letter in alphabet:
+                if letter == current[index]:
+                    continue
+                candidate = current[:index] + letter + current[index + 1 :]
+                if candidate == target:
+                    return str(steps + 1)
+                if candidate in word_set and candidate not in visited:
+                    visited.add(candidate)
+                    queue.append((candidate, steps + 1))
+    return "-1"
+def _solve_merge_intervals(stdin: str) -> str:
+    intervals = _parse_intervals(stdin)
+    ordered = sorted(intervals)
+    merged: list[list[int]] = []
+    for start, end in ordered:
+        if not merged or start > merged[-1][1]:
+            merged.append([start, end])
+        else:
+            merged[-1][1] = max(merged[-1][1], end)
+    return str(len(merged))
+def _solve_min_coins(stdin: str) -> str:
+    _, target, coins = _parse_coin_problem(stdin)
+    best = [target + 1] * (target + 1)
+    best[0] = 0
+    for value in range(1, target + 1):
+        for coin in coins:
+            if coin <= value:
+                best[value] = min(best[value], best[value - coin] + 1)
+    return str(best[target] if best[target] <= target else -1)
+def _solve_rotate_matrix_90(stdin: str) -> str:
+    _, matrix = _parse_matrix(stdin)
+    size = len(matrix)
+    rotated = [[matrix[size - 1 - row][col] for row in range(size)] for col in range(size)]
+    flattened = [str(value) for row in rotated for value in row]
+    return " ".join(flattened)
 def _parse_int_array(stdin: str) -> tuple[int, list[int]]:
     lines = [line.strip() for line in stdin.strip().splitlines() if line.strip()]
     n = int(lines[0])
     if len(numbers) != n:
         raise ValueError(f"Expected {n} integers, received {len(numbers)}")
     return n, numbers
+def _parse_target_array(stdin: str) -> tuple[int, int, list[int]]:
+    lines = [line.strip() for line in stdin.strip().splitlines() if line.strip()]
+    n, target = map(int, lines[0].split())
+    numbers = [int(part) for part in lines[1].split()]
+    if len(numbers) != n:
+        raise ValueError(f"Expected {n} integers, received {len(numbers)}")
+    return n, target, numbers
+def _parse_word_list(stdin: str) -> tuple[int, list[str]]:
+    lines = [line.strip() for line in stdin.strip().splitlines() if line.strip()]
+    n = int(lines[0])
+    words = lines[1].split()
+    if len(words) != n:
+        raise ValueError(f"Expected {n} words, received {len(words)}")
+    return n, words
+def _parse_matrix(stdin: str) -> tuple[int, list[list[int]]]:
+    lines = [line.strip() for line in stdin.strip().splitlines() if line.strip()]
+    n = int(lines[0])
+    matrix = [[int(part) for part in line.split()] for line in lines[1 : n + 1]]
+    if len(matrix) != n or any(len(row) != n for row in matrix):
+        raise ValueError("Matrix dimensions do not match n.")
+    return n, matrix
+def _parse_two_strings(stdin: str) -> tuple[str, str]:
+    lines = stdin.strip().splitlines()
+    if len(lines) < 2:
+        raise ValueError("Expected two lines of text.")
+    return lines[0].strip(), lines[1].strip()
+def _parse_fizzbuzz(stdin: str) -> tuple[tuple[int, int, int], tuple[str, str]]:
+    lines = [line.strip() for line in stdin.strip().splitlines() if line.strip()]
+    n, a, b = map(int, lines[0].split())
+    label_a, label_b = lines[1].split()
+    return (n, a, b), (label_a, label_b)
+def _parse_word_ladder(stdin: str) -> tuple[str, str, list[str]]:
+    lines = [line.strip() for line in stdin.strip().splitlines() if line.strip()]
+    start, target = lines[0].split()
+    n = int(lines[1])
+    words = lines[2].split()
+    if len(words) != n:
+        raise ValueError(f"Expected {n} words, received {len(words)}")
+    return start, target, words
+def _parse_intervals(stdin: str) -> list[tuple[int, int]]:
+    lines = [line.strip() for line in stdin.strip().splitlines() if line.strip()]
+    n = int(lines[0])
+    intervals = [tuple(map(int, line.split())) for line in lines[1 : n + 1]]
+    if len(intervals) != n:
+        raise ValueError("Interval count does not match n.")
+    return [(start, end) for start, end in intervals]
+def _parse_coin_problem(stdin: str) -> tuple[int, int, list[int]]:
+    lines = [line.strip() for line in stdin.strip().splitlines() if line.strip()]
+    n, target = map(int, lines[0].split())
+    coins = [int(part) for part in lines[1].split()]
+    if len(coins) != n:
+        raise ValueError(f"Expected {n} coins, received {len(coins)}")
+    return n, target, coins
+def _make_balanced_brackets(rng: random.Random, pairs: int) -> str:
+    opens = ["(", "[", "{"]
+    closing = {"(": ")", "[": "]", "{": "}"}
+    stack: list[str] = []
+    output: list[str] = []
+    for _ in range(pairs * 2):
+        can_open = len(stack) < pairs and (not stack or rng.random() < 0.6)
+        if can_open:
+            token = rng.choice(opens)
+            stack.append(token)
+            output.append(token)
+        else:
+            output.append(closing[stack.pop()])
+    while stack:
+        output.append(closing[stack.pop()])
+    return "".join(output)
+def _make_unbalanced_brackets(rng: random.Random, pairs: int) -> str:
+    text = list(_make_balanced_brackets(rng, pairs))
+    if not text:
+        return "("
+    mode = rng.choice(["swap", "drop", "flip"])
+    if mode == "swap" and len(text) >= 2:
+        index = rng.randrange(len(text) - 1)
+        text[index], text[index + 1] = text[index + 1], text[index]
+    elif mode == "drop":
+        del text[rng.randrange(len(text))]
+    else:
+        replacements = ["(", ")", "[", "]", "{", "}"]
+        text[rng.randrange(len(text))] = rng.choice(replacements)
+    return "".join(text)
+def _random_word(rng: random.Random, length: int) -> str:
+    alphabet = "abcdefghijklmnopqrstuvwxyz"
+    return "".join(rng.choice(alphabet) for _ in range(length))
+def _build_word_ladder_path(rng: random.Random, length: int, steps: int) -> list[str]:
+    alphabet = "abcdefghijklmnopqrstuvwxyz"
+    current = _random_word(rng, length)
+    path = [current]
+    used = {current}
+    while len(path) < steps + 1:
+        chars = list(path[-1])
+        index = rng.randrange(length)
+        replacement = rng.choice(alphabet.replace(chars[index], ""))
+        chars[index] = replacement
+        candidate = "".join(chars)
+        if candidate in used:
+            continue
+        used.add(candidate)
+        path.append(candidate)
+    return path
+def _build_word_ladder_extras(
+    rng: random.Random,
+    length: int,
+    count: int,
+    disallowed: set[str],
+) -> list[str]:
+    words: list[str] = []
+    seen = set(disallowed)
+    while len(words) < count:
+        candidate = _random_word(rng, length)
+        if candidate in seen:
+            continue
+        seen.add(candidate)
+        words.append(candidate)
+    return words

env/test_cases.py CHANGED Viewed

@@ -2,9 +2,7 @@ from __future__ import annotations
 from typing import Any
-from env.generator import DIFFICULTY_LABELS, GeneratorAgent
-VISIBLE_TEST_COUNT = 0
 def load_problem_bank() -> list[dict[str, Any]]:

 from typing import Any
+from env.generator import DIFFICULTY_LABELS, GeneratorAgent, VISIBLE_TEST_COUNT
 def load_problem_bank() -> list[dict[str, Any]]:

models.py CHANGED Viewed

@@ -18,17 +18,22 @@ except ImportError:
         episode_id: str = ""
         step_count: int = 0
-from pydantic import Field
 class AdaptAction(Action):
     code: str = Field(..., min_length=1, description="Python code to execute.")
 class AdaptObservation(Observation):
     problem_id: str = Field(default="", description="Current problem identifier.")
     problem_type: str = Field(default="", description="Current generated problem family.")
     difficulty: str = Field(default="", description="Current curriculum difficulty tier.")
     problem: str = Field(default="", description="Problem statement shown to the agent.")
     input_format: str = Field(default="", description="Expected stdin format.")
     constraints: str = Field(default="", description="Problem constraints.")
@@ -48,14 +53,17 @@ class AdaptObservation(Observation):
 class AdaptState(State):
     problem_id: str = Field(default="")
     problem_type: str = Field(default="")
     difficulty: str = Field(default="")
     generator_mode: str = Field(default="heuristic")
-    generated_problem: dict[str, str] = Field(default_factory=dict)
     last_reward: float = Field(default=0.0)
     last_pass_rate: float = Field(default=0.0, ge=0.0, le=1.0)
     last_feedback: str = Field(default="")
     generator_reward_signal: float = Field(default=0.0)
     history: dict[str, Any] = Field(default_factory=dict)
     recent_metrics: dict[str, Any] = Field(default_factory=dict)

         episode_id: str = ""
         step_count: int = 0
 class AdaptAction(Action):
+    session_id: str = Field(
+        default="",
+        description="Environment session id for server-routed calls.",
+    )
     code: str = Field(..., min_length=1, description="Python code to execute.")
 class AdaptObservation(Observation):
+    session_id: str = Field(default="", description="Session id for the active environment instance.")
     problem_id: str = Field(default="", description="Current problem identifier.")
     problem_type: str = Field(default="", description="Current generated problem family.")
     difficulty: str = Field(default="", description="Current curriculum difficulty tier.")
+    attempt_number: int = Field(default=0, ge=0, description="1-indexed attempt number within the episode.")
+    max_steps: int = Field(default=3, ge=1, description="Maximum attempts allowed for the episode.")
     problem: str = Field(default="", description="Problem statement shown to the agent.")
     input_format: str = Field(default="", description="Expected stdin format.")
     constraints: str = Field(default="", description="Problem constraints.")
 class AdaptState(State):
+    session_id: str = Field(default="")
     problem_id: str = Field(default="")
     problem_type: str = Field(default="")
     difficulty: str = Field(default="")
     generator_mode: str = Field(default="heuristic")
+    max_steps: int = Field(default=3, ge=1)
+    generated_problem: dict[str, Any] = Field(default_factory=dict)
     last_reward: float = Field(default=0.0)
     last_pass_rate: float = Field(default=0.0, ge=0.0, le=1.0)
     last_feedback: str = Field(default="")
+    last_execution_status: str = Field(default="ready")
     generator_reward_signal: float = Field(default=0.0)
     history: dict[str, Any] = Field(default_factory=dict)
     recent_metrics: dict[str, Any] = Field(default_factory=dict)

openenv.yaml CHANGED Viewed

@@ -1,35 +1,7 @@
 spec_version: 1
-name: adapt_dsa_tutor
-type: space
 runtime: fastapi
 app: server.app:app
 port: 7860
-description: "ADAPT: an adversarial DSA tutor environment for RLVR code generation with hidden tests, tiered problems, and anti-hacking reward signals."
-version: "0.2.0"
-observation_space:
-  type: dict
-  description: "Problem prompt, examples, visible tests, difficulty metadata, reward, pass rates, execution status, and feedback."
-action_space:
-  type: dict
-  description: "AdaptAction with a Python code string submitted for stdin/stdout evaluation."
-reward_range: [0.0, 1.0]
-tasks:
-  - name: easy_double
-    description: "Easy arithmetic stdin/stdout problem."
-    difficulty: easy
-  - name: easy_sum_two
-    description: "Easy two-integer arithmetic problem."
-    difficulty: easy
-  - name: medium_maximum
-    description: "Medium array scanning problem."
-    difficulty: medium
-  - name: medium_count_even
-    description: "Medium counting problem over a list."
-    difficulty: medium
-  - name: hard_reverse_words
-    description: "Harder string normalization and ordering problem."
-    difficulty: hard

 spec_version: 1
+name: adapt-dsa-tutor
+version: "0.3.0"
 runtime: fastapi
 app: server.app:app
 port: 7860
+description: "Adversarial DSA Programming Tutor - RL environment for training LLMs to solve algorithmic problems through adaptive curriculum and self-repair"

scripts/test_env.py CHANGED Viewed

@@ -7,7 +7,7 @@ ROOT = Path(__file__).resolve().parents[1]
 if str(ROOT) not in sys.path:
     sys.path.insert(0, str(ROOT))
-from env.adapt_env import AdaptEnvironment
 from env.generator import GeneratorAgent
 from models import AdaptAction
@@ -23,10 +23,12 @@ def main() -> None:
     env = AdaptEnvironment(generator=GeneratorAgent())
     observation = env.reset(problem_id="sum_even_numbers", difficulty="easy")
     assert observation.problem
     assert observation.input_format
     assert observation.constraints
     assert observation.problem_type == "sum_even_numbers"
     assert observation.execution_status == "ready"
     assert_hidden_tests_are_not_exposed(observation.model_dump())
     correct = env.step(
@@ -39,11 +41,13 @@ def main() -> None:
         )
     )
     print(correct)
-    assert correct.reward > 0.8, correct.model_dump()
     assert correct.pass_rate == 1.0
     assert correct.execution_status == "completed"
-    wrong = env.step(
         AdaptAction(
             code=(
                 "n=int(input())\n"
@@ -52,41 +56,64 @@ def main() -> None:
             )
         )
     )
-    print(wrong)
-    assert 0.0 <= float(wrong.reward) < 1.0
-    assert wrong.execution_status in {"wrong_answer", "completed"}
-    assert wrong.pass_rate < 1.0
-    invalid_output = env.step(
         AdaptAction(
             code=(
                 "n=int(input())\n"
-                "input()\n"
-                "print()"
             )
         )
     )
-    print(invalid_output)
-    assert invalid_output.invalid_output_count > 0
-    assert invalid_output.execution_status == "invalid_output_format"
     syntax = env.step(AdaptAction(code="def broken(:\n    pass"))
     print(syntax)
     assert syntax.reward == 0.0
     assert syntax.execution_status == "syntax_error"
     timeout = env.step(AdaptAction(code="while True:\n    pass"))
     print(timeout)
     assert timeout.timeout_count > 0
     assert timeout.execution_status == "timeout"
     unsafe = env.step(AdaptAction(code="import os\nprint(os.listdir('.'))"))
     print(unsafe)
     assert unsafe.reward == 0.0
     assert unsafe.execution_status == "safety_violation"
-    assert env.state.step_count == 6
-    assert env.state.history["recent_pass_rates"]
     assert_hidden_tests_are_not_exposed(timeout.model_dump())
     print("ADAPT OpenEnv smoke tests passed")

 if str(ROOT) not in sys.path:
     sys.path.insert(0, str(ROOT))
+from env.adapt_env import AdaptEnvironment, MAX_STEPS_PER_EPISODE
 from env.generator import GeneratorAgent
 from models import AdaptAction
     env = AdaptEnvironment(generator=GeneratorAgent())
     observation = env.reset(problem_id="sum_even_numbers", difficulty="easy")
     assert observation.problem
+    assert "Examples:" in observation.problem
     assert observation.input_format
     assert observation.constraints
     assert observation.problem_type == "sum_even_numbers"
     assert observation.execution_status == "ready"
+    assert observation.max_steps == MAX_STEPS_PER_EPISODE
     assert_hidden_tests_are_not_exposed(observation.model_dump())
     correct = env.step(
         )
     )
     print(correct)
+    assert correct.reward == 1.0, correct.model_dump()
     assert correct.pass_rate == 1.0
     assert correct.execution_status == "completed"
+    assert correct.done is True
+    observation = env.reset(problem_id="running_total", difficulty="easy")
+    repair_1 = env.step(
         AdaptAction(
             code=(
                 "n=int(input())\n"
             )
         )
     )
+    print(repair_1)
+    assert repair_1.done is False
+    assert repair_1.execution_status in {"wrong_answer", "runtime_error", "invalid_output_format"}
+    assert "Previous attempt status: ready" in repair_1.feedback
+    repair_2 = env.step(
         AdaptAction(
             code=(
                 "n=int(input())\n"
+                "nums=list(map(int,input().split()))\n"
+                "running=0\n"
+                "out=[]\n"
+                "for x in nums:\n"
+                "    running += x\n"
+                "    out.append(str(running))\n"
+                "print(' '.join(out))"
             )
         )
     )
+    print(repair_2)
+    assert repair_2.done is True
+    assert repair_2.pass_rate == 1.0
+    assert repair_2.reward == 0.85
+    assert "Previous attempt status:" in repair_2.feedback
+    observation = env.reset(problem_id="sum_even_numbers", difficulty="easy")
     syntax = env.step(AdaptAction(code="def broken(:\n    pass"))
     print(syntax)
     assert syntax.reward == 0.0
+    assert syntax.done is False
     assert syntax.execution_status == "syntax_error"
+    runtime = env.step(
+        AdaptAction(
+            code=(
+                "n=int(input())\n"
+                "nums=list(map(int,input().split()))\n"
+                "print(nums[n])"
+            )
+        )
+    )
+    print(runtime)
+    assert runtime.execution_status == "runtime_error"
     timeout = env.step(AdaptAction(code="while True:\n    pass"))
     print(timeout)
     assert timeout.timeout_count > 0
     assert timeout.execution_status == "timeout"
+    assert timeout.done is True
+    observation = env.reset(problem_id="sum_even_numbers", difficulty="easy")
     unsafe = env.step(AdaptAction(code="import os\nprint(os.listdir('.'))"))
     print(unsafe)
     assert unsafe.reward == 0.0
     assert unsafe.execution_status == "safety_violation"
+    assert unsafe.done is False
+    assert env.state.history["attempts"]
     assert_hidden_tests_are_not_exposed(timeout.model_dump())
     print("ADAPT OpenEnv smoke tests passed")

server/app.py CHANGED Viewed

@@ -1,10 +1,12 @@
 from __future__ import annotations
 import argparse
 from typing import Any
 import uvicorn
-from fastapi import Body, FastAPI, HTTPException, Request
 from fastapi.responses import RedirectResponse, Response
 from pydantic import BaseModel
@@ -12,11 +14,15 @@ from env.adapt_env import AdaptEnvironment
 from env.test_cases import load_problem_bank
 from models import AdaptAction, AdaptObservation, AdaptState
-ENV_NAME = "adapt_dsa_tutor"
 ENV_DESCRIPTION = (
-    "RL environment for DSA code generation with hidden tests, tiered problems, "
-    "and verifier-aware reward shaping."
 )
 TASKS = [
     {
         "name": problem["problem_id"],
@@ -26,11 +32,11 @@ TASKS = [
     for problem in load_problem_bank()
 ]
-app = FastAPI(title="ADAPT DSA Tutor OpenEnv", version="0.2.0")
-ENV = AdaptEnvironment()
 class ResetRequest(BaseModel):
     seed: int | None = None
     episode_id: str | None = None
     problem_id: str | None = None
@@ -41,16 +47,47 @@ def _metadata() -> dict[str, Any]:
     return {
         "name": ENV_NAME,
         "description": ENV_DESCRIPTION,
-        "version": "0.2.0",
         "tasks": TASKS,
         "mode": "simulation",
     }
 @app.get("/")
 def root() -> dict[str, Any]:
     payload = _metadata()
     payload["status"] = "ok"
     return payload
@@ -70,22 +107,26 @@ def favicon() -> Response:
 @app.get("/health")
-def health() -> dict[str, str]:
-    return {"status": "healthy"}
 @app.get("/metadata")
 def metadata() -> dict[str, Any]:
     return _metadata()
 @app.get("/tasks")
 def list_tasks() -> dict[str, Any]:
     return {"tasks": TASKS}
 @app.get("/schema")
 def schema() -> dict[str, Any]:
     return {
         "action": AdaptAction.model_json_schema(),
         "observation": AdaptObservation.model_json_schema(),
@@ -95,6 +136,7 @@ def schema() -> dict[str, Any]:
 @app.post("/mcp")
 def mcp(payload: dict[str, Any] = Body(default_factory=dict)) -> dict[str, Any]:
     return {
         "jsonrpc": "2.0",
         "id": payload.get("id"),
@@ -107,8 +149,14 @@ def mcp(payload: dict[str, Any] = Body(default_factory=dict)) -> dict[str, Any]:
 @app.post("/reset")
 def reset(request: ResetRequest | None = None) -> dict[str, Any]:
     effective_request = request or ResetRequest()
-    observation = ENV.reset(
         seed=effective_request.seed,
         episode_id=effective_request.episode_id,
         problem_id=effective_request.problem_id,
@@ -119,6 +167,7 @@ def reset(request: ResetRequest | None = None) -> dict[str, Any]:
 @app.post("/step")
 async def step(request: Request) -> dict[str, Any]:
     payload = await request.json()
     if not isinstance(payload, dict):
         raise HTTPException(status_code=422, detail="Request body must be a JSON object.")
@@ -129,24 +178,31 @@ async def step(request: Request) -> dict[str, Any]:
     except Exception as exc:
         raise HTTPException(status_code=422, detail=f"Invalid action payload: {exc}") from exc
-    observation = ENV.step(effective_action)
     return {
         "observation": observation.model_dump(),
         "reward": float(observation.reward),
         "done": bool(observation.done),
         "info": {
             "feedback": observation.feedback,
             "pass_rate": observation.pass_rate,
             "execution_status": observation.execution_status,
         },
     }
 @app.get("/state")
-def state() -> dict[str, Any]:
-    if not ENV.problem:
-        ENV.reset()
-    return ENV.state.model_dump()
 def main(host: str | None = None, port: int | None = None) -> None:

 from __future__ import annotations
 import argparse
+from datetime import datetime, timedelta, timezone
 from typing import Any
+from uuid import uuid4
 import uvicorn
+from fastapi import Body, FastAPI, HTTPException, Query, Request
 from fastapi.responses import RedirectResponse, Response
 from pydantic import BaseModel
 from env.test_cases import load_problem_bank
 from models import AdaptAction, AdaptObservation, AdaptState
+ENV_NAME = "adapt-dsa-tutor"
 ENV_DESCRIPTION = (
+    "Adversarial DSA Programming Tutor - RL environment for training LLMs to solve "
+    "algorithmic problems through adaptive curriculum and self-repair."
 )
+ENV_VERSION = "0.3.0"
+SESSION_TTL = timedelta(minutes=30)
+SESSIONS: dict[str, AdaptEnvironment] = {}
+SESSION_LAST_ACCESSED: dict[str, datetime] = {}
 TASKS = [
     {
         "name": problem["problem_id"],
     for problem in load_problem_bank()
 ]
+app = FastAPI(title="ADAPT DSA Tutor OpenEnv", version=ENV_VERSION)
 class ResetRequest(BaseModel):
+    session_id: str | None = None
     seed: int | None = None
     episode_id: str | None = None
     problem_id: str | None = None
     return {
         "name": ENV_NAME,
         "description": ENV_DESCRIPTION,
+        "version": ENV_VERSION,
         "tasks": TASKS,
         "mode": "simulation",
     }
+def _utc_now() -> datetime:
+    return datetime.now(timezone.utc)
+def _cleanup_sessions() -> None:
+    now = _utc_now()
+    expired = [
+        session_id
+        for session_id, last_seen in SESSION_LAST_ACCESSED.items()
+        if now - last_seen > SESSION_TTL
+    ]
+    for session_id in expired:
+        SESSIONS.pop(session_id, None)
+        SESSION_LAST_ACCESSED.pop(session_id, None)
+def _touch_session(session_id: str) -> None:
+    SESSION_LAST_ACCESSED[session_id] = _utc_now()
+def _require_session(session_id: str) -> AdaptEnvironment:
+    _cleanup_sessions()
+    env = SESSIONS.get(session_id)
+    if env is None:
+        raise HTTPException(status_code=404, detail=f"Unknown or expired session_id: {session_id}")
+    _touch_session(session_id)
+    return env
 @app.get("/")
 def root() -> dict[str, Any]:
+    _cleanup_sessions()
     payload = _metadata()
     payload["status"] = "ok"
+    payload["active_sessions"] = len(SESSIONS)
     return payload
 @app.get("/health")
+def health() -> dict[str, Any]:
+    _cleanup_sessions()
+    return {"status": "healthy", "active_sessions": len(SESSIONS)}
 @app.get("/metadata")
 def metadata() -> dict[str, Any]:
+    _cleanup_sessions()
     return _metadata()
 @app.get("/tasks")
 def list_tasks() -> dict[str, Any]:
+    _cleanup_sessions()
     return {"tasks": TASKS}
 @app.get("/schema")
 def schema() -> dict[str, Any]:
+    _cleanup_sessions()
     return {
         "action": AdaptAction.model_json_schema(),
         "observation": AdaptObservation.model_json_schema(),
 @app.post("/mcp")
 def mcp(payload: dict[str, Any] = Body(default_factory=dict)) -> dict[str, Any]:
+    _cleanup_sessions()
     return {
         "jsonrpc": "2.0",
         "id": payload.get("id"),
 @app.post("/reset")
 def reset(request: ResetRequest | None = None) -> dict[str, Any]:
+    _cleanup_sessions()
     effective_request = request or ResetRequest()
+    session_id = effective_request.session_id or str(uuid4())
+    env = AdaptEnvironment(session_id=session_id)
+    SESSIONS[session_id] = env
+    _touch_session(session_id)
+    observation = env.reset(
+        session_id=session_id,
         seed=effective_request.seed,
         episode_id=effective_request.episode_id,
         problem_id=effective_request.problem_id,
 @app.post("/step")
 async def step(request: Request) -> dict[str, Any]:
+    _cleanup_sessions()
     payload = await request.json()
     if not isinstance(payload, dict):
         raise HTTPException(status_code=422, detail="Request body must be a JSON object.")
     except Exception as exc:
         raise HTTPException(status_code=422, detail=f"Invalid action payload: {exc}") from exc
+    if not effective_action.session_id:
+        raise HTTPException(status_code=422, detail="`session_id` is required in the /step request body.")
+    env = _require_session(effective_action.session_id)
+    observation = env.step(effective_action)
     return {
         "observation": observation.model_dump(),
         "reward": float(observation.reward),
         "done": bool(observation.done),
         "info": {
+            "session_id": observation.session_id,
             "feedback": observation.feedback,
             "pass_rate": observation.pass_rate,
+            "visible_pass_rate": observation.visible_pass_rate,
             "execution_status": observation.execution_status,
         },
     }
 @app.get("/state")
+def state(session_id: str = Query(..., description="Session id returned from /reset.")) -> dict[str, Any]:
+    env = _require_session(session_id)
+    if not env.problem:
+        env.reset(session_id=session_id)
+    return env.state.model_dump()
 def main(host: str | None = None, port: int | None = None) -> None:

training/plot_results.py ADDED Viewed

	@@ -0,0 +1,139 @@

+from __future__ import annotations
+import argparse
+import csv
+from collections import defaultdict
+from pathlib import Path
+from typing import Any
+def read_rows(csv_path: Path) -> list[dict[str, Any]]:
+    with csv_path.open("r", encoding="utf-8", newline="") as handle:
+        reader = csv.DictReader(handle)
+        rows: list[dict[str, Any]] = []
+        for row in reader:
+            parsed: dict[str, Any] = {}
+            for key, value in row.items():
+                if value is None:
+                    parsed[key] = value
+                    continue
+                value = value.strip()
+                if value == "":
+                    parsed[key] = value
+                    continue
+                try:
+                    parsed[key] = float(value) if "." in value else int(value)
+                except ValueError:
+                    parsed[key] = value
+            rows.append(parsed)
+        return rows
+def rolling_mean(values: list[float], window: int) -> list[float]:
+    output: list[float] = []
+    for index in range(len(values)):
+        start = max(0, index - window + 1)
+        chunk = values[start : index + 1]
+        output.append(sum(chunk) / len(chunk))
+    return output
+def plot_reward_curve(rows: list[dict[str, Any]], output_dir: Path) -> None:
+    import matplotlib.pyplot as plt
+    train_rows = [row for row in rows if row.get("phase") == "train"]
+    steps = [int(row["step"]) for row in train_rows]
+    rewards = [float(row["episode_reward"]) for row in train_rows]
+    reward_smooth = rolling_mean(rewards, window=20)
+    plt.figure(figsize=(10, 5))
+    plt.plot(steps, rewards, alpha=0.25, label="Episode reward")
+    plt.plot(steps, reward_smooth, linewidth=2, label="20-step moving average")
+    plt.xlabel("Training step")
+    plt.ylabel("Reward")
+    plt.title("ADAPT Training Reward Curve")
+    plt.legend()
+    plt.tight_layout()
+    plt.savefig(output_dir / "reward_curve.png", dpi=200)
+    plt.close()
+def plot_pass_rate_by_difficulty(rows: list[dict[str, Any]], output_dir: Path) -> None:
+    import matplotlib.pyplot as plt
+    train_rows = [row for row in rows if row.get("phase") == "train"]
+    grouped: dict[str, list[tuple[int, float]]] = defaultdict(list)
+    for row in train_rows:
+        grouped[str(row["difficulty_tier"])].append((int(row["step"]), float(row["pass_rate"])))
+    plt.figure(figsize=(10, 5))
+    for difficulty in ("easy", "medium", "hard"):
+        points = grouped.get(difficulty, [])
+        if not points:
+            continue
+        steps = [step for step, _ in points]
+        values = [value for _, value in points]
+        smooth = rolling_mean(values, window=10)
+        plt.plot(steps, smooth, linewidth=2, label=difficulty.title())
+    plt.xlabel("Training step")
+    plt.ylabel("Pass rate")
+    plt.title("Pass Rate by Difficulty Tier")
+    plt.legend()
+    plt.tight_layout()
+    plt.savefig(output_dir / "pass_rate_by_difficulty.png", dpi=200)
+    plt.close()
+def plot_family_productivity(rows: list[dict[str, Any]], output_dir: Path) -> None:
+    import matplotlib.pyplot as plt
+    train_rows = [row for row in rows if row.get("phase") == "train"]
+    productivity_columns = [key for key in train_rows[0].keys() if str(key).startswith("family_productivity__")]
+    if not productivity_columns:
+        return
+    ranked_columns = sorted(
+        productivity_columns,
+        key=lambda column: float(train_rows[-1].get(column, 0.0)),
+        reverse=True,
+    )[:8]
+    plt.figure(figsize=(11, 6))
+    steps = [int(row["step"]) for row in train_rows]
+    for column in ranked_columns:
+        family = column.split("__", 1)[1]
+        values = [float(row.get(column, 0.0)) for row in train_rows]
+        plt.plot(steps, values, linewidth=2, label=family)
+    plt.xlabel("Training step")
+    plt.ylabel("Family productivity EMA")
+    plt.title("Reward-Aware Family Productivity Over Training")
+    plt.legend(loc="upper left", fontsize=8)
+    plt.tight_layout()
+    plt.savefig(output_dir / "family_productivity.png", dpi=200)
+    plt.close()
+def main(argv: list[str] | None = None) -> None:
+    parser = argparse.ArgumentParser(description="Plot ADAPT reward and curriculum artifacts from reward_curve.csv.")
+    parser.add_argument("csv_path", help="Path to reward_curve.csv")
+    parser.add_argument("--output-dir", default=None, help="Directory for PNG outputs. Defaults to the CSV directory.")
+    args = parser.parse_args(argv)
+    csv_path = Path(args.csv_path)
+    output_dir = Path(args.output_dir) if args.output_dir else csv_path.parent
+    output_dir.mkdir(parents=True, exist_ok=True)
+    rows = read_rows(csv_path)
+    if not rows:
+        raise RuntimeError(f"No rows found in {csv_path}")
+    plot_reward_curve(rows, output_dir)
+    plot_pass_rate_by_difficulty(rows, output_dir)
+    plot_family_productivity(rows, output_dir)
+    print(f"Saved plots to {output_dir}")
+if __name__ == "__main__":
+    main()

training/train_grpo.py CHANGED Viewed

@@ -1,14 +1,25 @@
 from __future__ import annotations
 import argparse
 import json
 from dataclasses import dataclass, field
 from typing import Any
-from env.adapt_env import AdaptEnvironment
-from env.generator import GeneratorAgent
 from models import AdaptAction
 def extract_code(completion: str) -> str:
     text = completion.strip()
@@ -19,21 +30,45 @@ def extract_code(completion: str) -> str:
     return text
-def build_solver_prompt(problem: dict[str, Any]) -> str:
-    public_problem = {
         "problem_id": problem["problem_id"],
         "difficulty": problem["difficulty_label"],
-        "problem": problem["problem"],
         "input_format": problem["input_format"],
         "constraints": problem["constraints"],
     }
-    return (
-        "You are the Solver Agent for ADAPT.\n"
-        "Read the generated DSA task and reply with only runnable Python code.\n"
-        "The program must read from stdin and print to stdout.\n"
-        "No markdown, no explanation.\n\n"
-        f"{json.dumps(public_problem, indent=2)}"
-    )
 @dataclass
@@ -42,29 +77,37 @@ class CurriculumManager:
     current_idx: int = 0
     success_history: list[float] = field(default_factory=list)
     window_size: int = 10
     def current_difficulty(self) -> str:
         return self.difficulties[self.current_idx]
-    def update(self, batch_success_rate: float) -> None:
-        self.success_history.append(float(batch_success_rate))
         if len(self.success_history) > self.window_size:
             self.success_history.pop(0)
         moving_average = sum(self.success_history) / len(self.success_history)
-        if moving_average > 0.70 and self.current_idx < len(self.difficulties) - 1:
             self.current_idx += 1
             self.success_history.clear()
             print(
                 f"[curriculum] promoted to {self.current_difficulty()} "
-                f"(moving_success={moving_average:.2f})"
             )
-        elif moving_average < 0.25 and self.current_idx > 0:
             self.current_idx -= 1
             self.success_history.clear()
             print(
-                f"[curriculum] reduced to {self.current_difficulty()} "
-                f"(moving_success={moving_average:.2f})"
             )
@@ -72,6 +115,7 @@ class CurriculumManager:
 class GeneratorController:
     mode: str = "heuristic"
     deterministic: bool = True
     generator: GeneratorAgent = field(init=False)
     history: dict[str, Any] = field(
         default_factory=lambda: {
@@ -83,35 +127,80 @@ class GeneratorController:
         }
     )
     prompt_registry: dict[str, dict[str, Any]] = field(default_factory=dict)
     def __post_init__(self) -> None:
         self.generator = GeneratorAgent(deterministic=self.deterministic)
     def create_rollout_problem(self, difficulty: str) -> tuple[str, dict[str, Any]]:
-        problem = self.generator.generate(difficulty, self.history)
-        prompt = build_solver_prompt(problem)
         self.prompt_registry[prompt] = problem
         return prompt, problem
     def resolve_prompt(self, prompt: str) -> dict[str, Any]:
         if prompt not in self.prompt_registry:
             raise KeyError("Prompt was not registered with the generator controller.")
-        return self.prompt_registry[prompt]
-    def update(self, problem: dict[str, Any], pass_rate: float, generator_reward_signal: float) -> None:
         self.history["recent_pass_rates"].append(round(float(pass_rate), 4))
         self.history["problem_types"].append(problem.get("problem_type", ""))
         self.history["problem_signatures"].append(problem.get("problem_id", ""))
-        if self.mode == "reward_aware":
-            self.history["generator_rewards"].append(round(float(generator_reward_signal), 4))
-        else:
-            self.history["generator_rewards"].append(0.0)
         self.history["episode_index"] = int(self.history.get("episode_index", 0)) + 1
         for key in ("recent_pass_rates", "problem_types", "problem_signatures", "generator_rewards"):
             values = self.history[key]
-            if len(values) > 50:
-                del values[:-50]
     def stats_snapshot(self) -> dict[str, Any]:
         return {
@@ -120,6 +209,13 @@ class GeneratorController:
             "recent_pass_rates": list(self.history["recent_pass_rates"][-5:]),
             "recent_problem_types": list(self.history["problem_types"][-5:]),
             "recent_generator_rewards": list(self.history["generator_rewards"][-5:]),
         }
@@ -138,50 +234,273 @@ class GeneratorRolloutDataset:
         return {"prompt": prompt}
-def build_reward_func(curriculum: CurriculumManager, controller: GeneratorController):
     def reward_func(prompts, completions, **kwargs) -> list[float]:
         del kwargs
-        env = AdaptEnvironment(generator=controller.generator, generator_mode=controller.mode)
         rewards: list[float] = []
-        pass_rates: list[float] = []
         for prompt, completion in zip(prompts, completions):
             problem = controller.resolve_prompt(prompt)
             env.reset(
                 difficulty=problem["difficulty_label"],
                 generated_problem=problem,
                 generator_mode=controller.mode,
             )
-            observation = env.step(AdaptAction(code=extract_code(completion)))
             rewards.append(float(observation.reward))
-            pass_rates.append(float(observation.pass_rate))
-            controller.update(problem, observation.pass_rate, observation.generator_reward_signal)
-            print(
-                "[rollout]",
-                json.dumps(
-                    {
-                        "problem_id": problem["problem_id"],
-                        "problem_type": problem["problem_type"],
-                        "difficulty": problem["difficulty_label"],
-                        "solver_reward": observation.reward,
-                        "pass_rate": observation.pass_rate,
-                        "generator_reward": observation.generator_reward_signal,
-                        "status": observation.execution_status,
-                    }
-                ),
             )
-        if pass_rates:
-            curriculum.update(sum(pass_rates) / len(pass_rates))
-            print("[generator]", json.dumps(controller.stats_snapshot()))
         return rewards
     return reward_func
-def build_dataset(size: int, controller: GeneratorController, curriculum: CurriculumManager) -> GeneratorRolloutDataset:
-    return GeneratorRolloutDataset(size=size, controller=controller, curriculum=curriculum)
 def run_training(args: argparse.Namespace) -> None:
@@ -193,6 +512,9 @@ def run_training(args: argparse.Namespace) -> None:
             "Training dependencies are missing. Install `trl` and `unsloth` before running GRPO training."
         ) from exc
     PatchFastRL("GRPO", FastLanguageModel)
     model, tokenizer = FastLanguageModel.from_pretrained(
@@ -200,6 +522,8 @@ def run_training(args: argparse.Namespace) -> None:
         max_seq_length=args.max_seq_length,
         load_in_4bit=not args.disable_4bit,
     )
     model = FastLanguageModel.get_peft_model(
         model,
@@ -214,6 +538,31 @@ def run_training(args: argparse.Namespace) -> None:
         mode="reward_aware" if args.generator_mode == "reward_aware" else "heuristic",
         deterministic=not args.non_deterministic_generator,
     )
     training_args = GRPOConfig(
         output_dir=args.output_dir,
         learning_rate=args.learning_rate,
@@ -225,41 +574,68 @@ def run_training(args: argparse.Namespace) -> None:
         max_steps=args.max_steps,
         logging_steps=1,
         bf16=args.bf16,
     )
     trainer = GRPOTrainer(
         model=model,
-        reward_funcs=[build_reward_func(curriculum, controller)],
         args=training_args,
         train_dataset=build_dataset(args.dataset_size, controller, curriculum),
     )
     trainer.train()
     model.save_pretrained(args.output_dir)
     tokenizer.save_pretrained(args.output_dir)
 def build_parser() -> argparse.ArgumentParser:
     parser = argparse.ArgumentParser(description="GRPO training entrypoint for the ADAPT DSA environment.")
     parser.add_argument("--model-name", default="unsloth/Llama-3.2-3B-Instruct")
-    parser.add_argument("--output-dir", default="outputs_v2")
     parser.add_argument("--dataset-size", type=int, default=200)
     parser.add_argument("--max-steps", type=int, default=250)
     parser.add_argument("--batch-size", type=int, default=1)
     parser.add_argument("--gradient-accumulation-steps", type=int, default=8)
     parser.add_argument("--num-generations", type=int, default=8)
     parser.add_argument("--max-seq-length", type=int, default=2048)
-    parser.add_argument("--max-prompt-length", type=int, default=768)
     parser.add_argument("--max-completion-length", type=int, default=512)
     parser.add_argument("--learning-rate", type=float, default=5e-6)
     parser.add_argument("--lora-rank", type=int, default=16)
     parser.add_argument("--lora-alpha", type=int, default=16)
     parser.add_argument("--disable-4bit", action="store_true")
     parser.add_argument("--bf16", action="store_true")
     parser.add_argument(
         "--generator-mode",
         choices=["heuristic", "reward_aware"],
-        default="heuristic",
-        help="Use heuristic generation (V1/V2) or reward-aware bookkeeping for V3-ready training.",
     )
     parser.add_argument(
         "--non-deterministic-generator",

 from __future__ import annotations
 import argparse
+import csv
 import json
+import math
 from dataclasses import dataclass, field
+from pathlib import Path
 from typing import Any
+import torch
+from env.adapt_env import AdaptEnvironment, MAX_STEPS_PER_EPISODE
+from env.generator import DIFFICULTY_LABELS, GeneratorAgent
 from models import AdaptAction
+SYSTEM_PROMPT = """You are the Solver Agent for ADAPT.
+Write only runnable Python code.
+The program must read from stdin and print to stdout.
+If feedback is present, repair your previous solution instead of starting from scratch.
+Do not include markdown fences or explanations."""
 def extract_code(completion: str) -> str:
     text = completion.strip()
     return text
+def format_examples(problem: dict[str, Any]) -> str:
+    visible_cases = [test_case for test_case in problem.get("test_cases", []) if test_case.get("is_visible", False)]
+    if not visible_cases:
+        return problem["problem"]
+    chunks = []
+    for test_case in visible_cases:
+        chunks.append(f"Input:\n{test_case['input']}Expected Output:\n{test_case['output']}\n")
+    return f"{problem['problem']}\n\nExamples:\n" + "\n".join(chunks).rstrip()
+def build_solver_prompt(payload: dict[str, Any]) -> str:
+    feedback = payload.get("feedback") or "No previous attempt yet."
+    return (
+        f"{SYSTEM_PROMPT}\n\n"
+        f"Problem ID: {payload['problem_id']}\n"
+        f"Problem Family: {payload['problem_type']}\n"
+        f"Difficulty: {payload['difficulty']}\n"
+        f"Attempt: {payload.get('attempt_number', 0)}/{payload.get('max_steps', MAX_STEPS_PER_EPISODE)}\n\n"
+        f"Problem:\n{payload['problem']}\n\n"
+        f"Input Format:\n{payload['input_format']}\n\n"
+        f"Constraints:\n{payload['constraints']}\n\n"
+        f"Feedback:\n{feedback}\n"
+    )
+def build_prompt_from_problem(problem: dict[str, Any]) -> str:
+    payload = {
         "problem_id": problem["problem_id"],
+        "problem_type": problem["problem_type"],
         "difficulty": problem["difficulty_label"],
+        "attempt_number": 0,
+        "max_steps": MAX_STEPS_PER_EPISODE,
+        "problem": format_examples(problem),
         "input_format": problem["input_format"],
         "constraints": problem["constraints"],
+        "feedback": "No previous attempt yet. Solve the problem directly from the examples and constraints.",
     }
+    return build_solver_prompt(payload)
 @dataclass
     current_idx: int = 0
     success_history: list[float] = field(default_factory=list)
     window_size: int = 10
+    promote_threshold: float = 0.70
+    demote_threshold: float = 0.30
     def current_difficulty(self) -> str:
         return self.difficulties[self.current_idx]
+    def current_level(self) -> int:
+        return self.current_idx + 1
+    def update(self, episode_pass_rate: float) -> None:
+        self.success_history.append(float(episode_pass_rate))
         if len(self.success_history) > self.window_size:
             self.success_history.pop(0)
+        if len(self.success_history) < self.window_size:
+            return
         moving_average = sum(self.success_history) / len(self.success_history)
+        if moving_average >= self.promote_threshold and self.current_idx < len(self.difficulties) - 1:
             self.current_idx += 1
             self.success_history.clear()
             print(
                 f"[curriculum] promoted to {self.current_difficulty()} "
+                f"(moving_pass_rate={moving_average:.2f})"
             )
+        elif moving_average <= self.demote_threshold and self.current_idx > 0:
             self.current_idx -= 1
             self.success_history.clear()
             print(
+                f"[curriculum] demoted to {self.current_difficulty()} "
+                f"(moving_pass_rate={moving_average:.2f})"
             )
 class GeneratorController:
     mode: str = "heuristic"
     deterministic: bool = True
+    temperature: float = 0.5
     generator: GeneratorAgent = field(init=False)
     history: dict[str, Any] = field(
         default_factory=lambda: {
         }
     )
     prompt_registry: dict[str, dict[str, Any]] = field(default_factory=dict)
+    family_productivity: dict[str, float] = field(default_factory=dict)
     def __post_init__(self) -> None:
         self.generator = GeneratorAgent(deterministic=self.deterministic)
+        if not self.family_productivity:
+            self.family_productivity = {
+                template.problem_type: 0.0 for template in self.generator.templates
+            }
+    @property
+    def family_names(self) -> list[str]:
+        return sorted(self.family_productivity)
+    def sample_problem(self, difficulty: str) -> dict[str, Any]:
+        family_weights = self.family_weights_for_difficulty(difficulty)
+        problem = self.generator.generate_problem(
+            difficulty_level=difficulty,
+            history=self.history,
+            family_weights=family_weights,
+        )
+        return problem
     def create_rollout_problem(self, difficulty: str) -> tuple[str, dict[str, Any]]:
+        problem = self.sample_problem(difficulty)
+        prompt = build_prompt_from_problem(problem)
         self.prompt_registry[prompt] = problem
         return prompt, problem
     def resolve_prompt(self, prompt: str) -> dict[str, Any]:
         if prompt not in self.prompt_registry:
             raise KeyError("Prompt was not registered with the generator controller.")
+        return self.prompt_registry.pop(prompt)
+    def family_weights_for_difficulty(self, difficulty: str) -> dict[str, float] | None:
+        if self.mode != "reward_aware":
+            return None
+        eligible = [
+            template.problem_type
+            for template in self.generator.templates
+            if DIFFICULTY_LABELS[template.difficulty_tier] == difficulty
+        ]
+        if not eligible:
+            return None
+        logits = [self.family_productivity.get(family, 0.0) / self.temperature for family in eligible]
+        max_logit = max(logits)
+        exp_values = [math.exp(logit - max_logit) for logit in logits]
+        return {family: value for family, value in zip(eligible, exp_values)}
+    def update(
+        self,
+        problem: dict[str, Any],
+        pass_rate: float,
+        generator_reward_signal: float,
+        *,
+        update_productivity: bool = True,
+    ) -> None:
         self.history["recent_pass_rates"].append(round(float(pass_rate), 4))
         self.history["problem_types"].append(problem.get("problem_type", ""))
         self.history["problem_signatures"].append(problem.get("problem_id", ""))
+        self.history["generator_rewards"].append(round(float(generator_reward_signal), 4))
         self.history["episode_index"] = int(self.history.get("episode_index", 0)) + 1
+        if self.mode == "reward_aware" and update_productivity:
+            family = problem.get("problem_type", "")
+            current = float(self.family_productivity.get(family, 0.0))
+            updated = 0.9 * current + 0.1 * float(generator_reward_signal)
+            self.family_productivity[family] = round(updated, 6)
         for key in ("recent_pass_rates", "problem_types", "problem_signatures", "generator_rewards"):
             values = self.history[key]
+            if len(values) > 100:
+                del values[:-100]
     def stats_snapshot(self) -> dict[str, Any]:
         return {
             "recent_pass_rates": list(self.history["recent_pass_rates"][-5:]),
             "recent_problem_types": list(self.history["problem_types"][-5:]),
             "recent_generator_rewards": list(self.history["generator_rewards"][-5:]),
+            "family_productivity": self.productivity_snapshot(),
+        }
+    def productivity_snapshot(self) -> dict[str, float]:
+        return {
+            family: round(float(value), 6)
+            for family, value in sorted(self.family_productivity.items())
         }
         return {"prompt": prompt}
+@dataclass
+class TrainingLogger:
+    output_dir: Path
+    family_names: list[str]
+    use_wandb: bool = True
+    wandb_project: str = "adapt-dsa-tutor"
+    wandb_run_name: str | None = None
+    rows: list[dict[str, Any]] = field(default_factory=list)
+    global_step: int = 0
+    _wandb_run: Any = field(default=None, init=False, repr=False)
+    def __post_init__(self) -> None:
+        self.output_dir.mkdir(parents=True, exist_ok=True)
+        if not self.use_wandb:
+            return
+        try:
+            import wandb
+            self._wandb_run = wandb.init(
+                project=self.wandb_project,
+                name=self.wandb_run_name,
+                config={"family_names": self.family_names},
+                reinit=True,
+            )
+        except Exception:
+            self._wandb_run = None
+    def log_event(
+        self,
+        *,
+        phase: str,
+        episode_reward: float,
+        pass_rate: float,
+        visible_pass_rate: float,
+        difficulty_tier: str,
+        problem_family: str,
+        curriculum_level: int,
+        execution_status: str,
+        attempt_number: int,
+        family_productivity: dict[str, float],
+        extra: dict[str, Any] | None = None,
+    ) -> None:
+        row: dict[str, Any] = {
+            "step": self.global_step,
+            "phase": phase,
+            "episode_reward": round(float(episode_reward), 4),
+            "pass_rate": round(float(pass_rate), 4),
+            "visible_pass_rate": round(float(visible_pass_rate), 4),
+            "difficulty_tier": difficulty_tier,
+            "problem_family": problem_family,
+            "curriculum_level": curriculum_level,
+            "execution_status": execution_status,
+            "attempt_number": int(attempt_number),
+        }
+        for family in self.family_names:
+            row[f"family_productivity__{family}"] = round(float(family_productivity.get(family, 0.0)), 6)
+        if extra:
+            row.update(extra)
+        self.rows.append(row)
+        if self._wandb_run is not None:
+            self._wandb_run.log(row, step=self.global_step)
+        self.global_step += 1
+    def write_csv(self) -> Path:
+        output_path = self.output_dir / "reward_curve.csv"
+        fieldnames: list[str] = []
+        for row in self.rows:
+            for key in row:
+                if key not in fieldnames:
+                    fieldnames.append(key)
+        with output_path.open("w", newline="", encoding="utf-8") as handle:
+            writer = csv.DictWriter(handle, fieldnames=fieldnames)
+            writer.writeheader()
+            writer.writerows(self.rows)
+        return output_path
+    def close(self) -> None:
+        if self._wandb_run is not None:
+            self._wandb_run.finish()
+def build_dataset(size: int, controller: GeneratorController, curriculum: CurriculumManager) -> GeneratorRolloutDataset:
+    return GeneratorRolloutDataset(size=size, controller=controller, curriculum=curriculum)
+def build_reward_func(
+    curriculum: CurriculumManager,
+    controller: GeneratorController,
+    logger: TrainingLogger,
+):
     def reward_func(prompts, completions, **kwargs) -> list[float]:
         del kwargs
         rewards: list[float] = []
         for prompt, completion in zip(prompts, completions):
             problem = controller.resolve_prompt(prompt)
+            env = AdaptEnvironment(generator=controller.generator, generator_mode=controller.mode)
             env.reset(
                 difficulty=problem["difficulty_label"],
                 generated_problem=problem,
                 generator_mode=controller.mode,
+                session_id=env.session_id,
+            )
+            observation = env.step(
+                AdaptAction(
+                    session_id=env.session_id,
+                    code=extract_code(completion),
+                )
             )
             rewards.append(float(observation.reward))
+            controller.update(
+                problem=problem,
+                pass_rate=observation.pass_rate,
+                generator_reward_signal=observation.generator_reward_signal,
             )
+            curriculum.update(observation.pass_rate)
+            logger.log_event(
+                phase="train",
+                episode_reward=float(observation.reward),
+                pass_rate=float(observation.pass_rate),
+                visible_pass_rate=float(observation.visible_pass_rate),
+                difficulty_tier=problem["difficulty_label"],
+                problem_family=problem["problem_type"],
+                curriculum_level=curriculum.current_level(),
+                execution_status=observation.execution_status,
+                attempt_number=int(observation.attempt_number),
+                family_productivity=controller.productivity_snapshot(),
+                extra={
+                    "generator_reward": round(float(observation.generator_reward_signal), 4),
+                    "problem_id": problem["problem_id"],
+                },
+            )
+            if controller.mode == "reward_aware" and controller.history["episode_index"] % 50 == 0:
+                print("[family_productivity]", json.dumps(controller.productivity_snapshot()))
         return rewards
     return reward_func
+def render_prompt(tokenizer: Any, prompt: str) -> str:
+    messages = [
+        {"role": "system", "content": SYSTEM_PROMPT},
+        {"role": "user", "content": prompt},
+    ]
+    if hasattr(tokenizer, "apply_chat_template"):
+        return tokenizer.apply_chat_template(
+            messages,
+            tokenize=False,
+            add_generation_prompt=True,
+        )
+    return f"{SYSTEM_PROMPT}\n\n{prompt}"
+def generate_completion(
+    model: Any,
+    tokenizer: Any,
+    prompt: str,
+    *,
+    max_new_tokens: int,
+) -> str:
+    rendered = render_prompt(tokenizer, prompt)
+    inputs = tokenizer(rendered, return_tensors="pt")
+    device = getattr(model, "device", None)
+    if device is None:
+        device = next(model.parameters()).device
+    inputs = {key: value.to(device) for key, value in inputs.items()}
+    with torch.no_grad():
+        outputs = model.generate(
+            **inputs,
+            max_new_tokens=max_new_tokens,
+            do_sample=False,
+            pad_token_id=tokenizer.eos_token_id,
+        )
+    generated_tokens = outputs[0][inputs["input_ids"].shape[1] :]
+    return tokenizer.decode(generated_tokens, skip_special_tokens=True)
+def run_policy_evaluation(
+    *,
+    model: Any,
+    tokenizer: Any,
+    generator_mode: str,
+    deterministic_generator: bool,
+    episodes: int,
+    logger: TrainingLogger,
+    phase: str,
+    max_new_tokens: int,
+) -> dict[str, Any]:
+    controller = GeneratorController(
+        mode=generator_mode,
+        deterministic=deterministic_generator,
+    )
+    schedule = ["easy"] * (episodes // 3 + (1 if episodes % 3 > 0 else 0))
+    schedule += ["medium"] * (episodes // 3 + (1 if episodes % 3 > 1 else 0))
+    schedule += ["hard"] * (episodes // 3)
+    schedule = schedule[:episodes]
+    tier_records: dict[str, list[float]] = {"easy": [], "medium": [], "hard": []}
+    for difficulty in schedule:
+        problem = controller.sample_problem(difficulty)
+        env = AdaptEnvironment(generator=controller.generator, generator_mode=generator_mode)
+        observation = env.reset(
+            difficulty=difficulty,
+            generated_problem=problem,
+            session_id=env.session_id,
+            generator_mode=generator_mode,
+        )
+        for _ in range(MAX_STEPS_PER_EPISODE):
+            prompt = build_solver_prompt(observation.model_dump())
+            completion = generate_completion(
+                model=model,
+                tokenizer=tokenizer,
+                prompt=prompt,
+                max_new_tokens=max_new_tokens,
+            )
+            observation = env.step(
+                AdaptAction(
+                    session_id=env.session_id,
+                    code=extract_code(completion),
+                )
+            )
+            if observation.done:
+                break
+        controller.update(
+            problem=problem,
+            pass_rate=observation.pass_rate,
+            generator_reward_signal=observation.generator_reward_signal,
+            update_productivity=False,
+        )
+        tier_records[difficulty].append(float(observation.pass_rate))
+        logger.log_event(
+            phase=phase,
+            episode_reward=float(observation.reward),
+            pass_rate=float(observation.pass_rate),
+            visible_pass_rate=float(observation.visible_pass_rate),
+            difficulty_tier=difficulty,
+            problem_family=problem["problem_type"],
+            curriculum_level={"easy": 1, "medium": 2, "hard": 3}[difficulty],
+            execution_status=observation.execution_status,
+            attempt_number=int(observation.attempt_number),
+            family_productivity=controller.productivity_snapshot(),
+            extra={
+                "generator_reward": round(float(observation.generator_reward_signal), 4),
+                "problem_id": problem["problem_id"],
+            },
+        )
+    summary = {
+        tier: (sum(values) / len(values) if values else 0.0)
+        for tier, values in tier_records.items()
+    }
+    summary["overall"] = (
+        sum(value for values in tier_records.values() for value in values) / episodes if episodes else 0.0
+    )
+    return summary
+def print_evaluation_summary(baseline: dict[str, Any], trained: dict[str, Any]) -> None:
+    print("\nBaseline vs trained pass rate summary")
+    print(f"{'Difficulty':<12} {'Baseline':>10} {'Trained':>10}")
+    print("-" * 34)
+    for tier in ("easy", "medium", "hard", "overall"):
+        print(f"{tier:<12} {baseline.get(tier, 0.0):>10.3f} {trained.get(tier, 0.0):>10.3f}")
 def run_training(args: argparse.Namespace) -> None:
             "Training dependencies are missing. Install `trl` and `unsloth` before running GRPO training."
         ) from exc
+    output_dir = Path(args.output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
     PatchFastRL("GRPO", FastLanguageModel)
     model, tokenizer = FastLanguageModel.from_pretrained(
         max_seq_length=args.max_seq_length,
         load_in_4bit=not args.disable_4bit,
     )
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
     model = FastLanguageModel.get_peft_model(
         model,
         mode="reward_aware" if args.generator_mode == "reward_aware" else "heuristic",
         deterministic=not args.non_deterministic_generator,
     )
+    logger = TrainingLogger(
+        output_dir=output_dir,
+        family_names=controller.family_names,
+        use_wandb=not args.disable_wandb,
+        wandb_project=args.wandb_project,
+        wandb_run_name=args.wandb_run_name,
+    )
+    baseline_summary = {"easy": 0.0, "medium": 0.0, "hard": 0.0, "overall": 0.0}
+    trained_summary = {"easy": 0.0, "medium": 0.0, "hard": 0.0, "overall": 0.0}
+    if args.baseline_eval:
+        FastLanguageModel.for_inference(model)
+        baseline_summary = run_policy_evaluation(
+            model=model,
+            tokenizer=tokenizer,
+            generator_mode=controller.mode,
+            deterministic_generator=not args.non_deterministic_generator,
+            episodes=args.evaluation_episodes,
+            logger=logger,
+            phase="baseline_eval",
+            max_new_tokens=args.eval_max_new_tokens,
+        )
+        print(f"[baseline_eval] {json.dumps(baseline_summary)}")
     training_args = GRPOConfig(
         output_dir=args.output_dir,
         learning_rate=args.learning_rate,
         max_steps=args.max_steps,
         logging_steps=1,
         bf16=args.bf16,
+        report_to=[],
     )
     trainer = GRPOTrainer(
         model=model,
+        reward_funcs=[build_reward_func(curriculum, controller, logger)],
         args=training_args,
         train_dataset=build_dataset(args.dataset_size, controller, curriculum),
     )
     trainer.train()
     model.save_pretrained(args.output_dir)
     tokenizer.save_pretrained(args.output_dir)
+    if args.baseline_eval:
+        FastLanguageModel.for_inference(model)
+        trained_summary = run_policy_evaluation(
+            model=model,
+            tokenizer=tokenizer,
+            generator_mode=controller.mode,
+            deterministic_generator=not args.non_deterministic_generator,
+            episodes=args.evaluation_episodes,
+            logger=logger,
+            phase="trained_eval",
+            max_new_tokens=args.eval_max_new_tokens,
+        )
+        print(f"[trained_eval] {json.dumps(trained_summary)}")
+        print_evaluation_summary(baseline_summary, trained_summary)
+    csv_path = logger.write_csv()
+    logger.close()
+    print(f"[artifacts] reward curve CSV written to {csv_path}")
 def build_parser() -> argparse.ArgumentParser:
     parser = argparse.ArgumentParser(description="GRPO training entrypoint for the ADAPT DSA environment.")
     parser.add_argument("--model-name", default="unsloth/Llama-3.2-3B-Instruct")
+    parser.add_argument("--output-dir", default="outputs_v3")
     parser.add_argument("--dataset-size", type=int, default=200)
     parser.add_argument("--max-steps", type=int, default=250)
     parser.add_argument("--batch-size", type=int, default=1)
     parser.add_argument("--gradient-accumulation-steps", type=int, default=8)
     parser.add_argument("--num-generations", type=int, default=8)
     parser.add_argument("--max-seq-length", type=int, default=2048)
+    parser.add_argument("--max-prompt-length", type=int, default=1024)
     parser.add_argument("--max-completion-length", type=int, default=512)
     parser.add_argument("--learning-rate", type=float, default=5e-6)
     parser.add_argument("--lora-rank", type=int, default=16)
     parser.add_argument("--lora-alpha", type=int, default=16)
     parser.add_argument("--disable-4bit", action="store_true")
     parser.add_argument("--bf16", action="store_true")
+    parser.add_argument("--baseline-eval", action="store_true")
+    parser.add_argument("--evaluation-episodes", type=int, default=20)
+    parser.add_argument("--eval-max-new-tokens", type=int, default=512)
+    parser.add_argument("--disable-wandb", action="store_true")
+    parser.add_argument("--wandb-project", default="adapt-dsa-tutor")
+    parser.add_argument("--wandb-run-name", default=None)
     parser.add_argument(
         "--generator-mode",
         choices=["heuristic", "reward_aware"],
+        default="reward_aware",
+        help="Use heuristic generation or reward-aware family weighting.",
     )
     parser.add_argument(
         "--non-deterministic-generator",

verifier/metrics.py CHANGED Viewed

@@ -3,30 +3,54 @@ from __future__ import annotations
 from typing import Any
-def compute_pass_rate(results: list[dict[str, Any]]) -> tuple[float, dict[str, Any]]:
     total = len(results)
     passed = sum(1 for result in results if result["passed"])
     timeout_count = sum(1 for result in results if result["status"] == "timeout")
     runtime_error_count = sum(1 for result in results if result["status"] == "runtime_error")
     invalid_output_count = sum(1 for result in results if result["status"] == "invalid_output_format")
     wrong_answer_count = sum(1 for result in results if result["status"] == "wrong_answer")
     format_ok_count = sum(1 for result in results if result.get("format_ok", False))
-    pass_rate = passed / total if total else 0.0
     format_compliance = format_ok_count / total if total else 0.0
-    timeout_rate = timeout_count / total if total else 0.0
-    runtime_error_rate = runtime_error_count / total if total else 0.0
-    invalid_output_rate = invalid_output_count / total if total else 0.0
-    reward_components = {
-        "correctness": 0.8 * pass_rate,
-        "format": 0.1 * format_compliance,
-        "execution": 0.1 if timeout_count == 0 and runtime_error_count == 0 else 0.0,
-        "timeout_penalty": -0.2 * timeout_rate,
-        "runtime_penalty": -0.1 * runtime_error_rate,
-        "invalid_output_penalty": -0.1 * invalid_output_rate,
-    }
-    reward = max(0.0, min(1.0, sum(reward_components.values())))
     if timeout_count:
         execution_status = "timeout"
@@ -39,10 +63,23 @@ def compute_pass_rate(results: list[dict[str, Any]]) -> tuple[float, dict[str, A
     else:
         execution_status = "completed"
     return reward, {
         "passed": passed,
         "total": total,
         "pass_rate": round(pass_rate, 4),
         "timeout_count": timeout_count,
         "runtime_error_count": runtime_error_count,
         "invalid_output_count": invalid_output_count,
@@ -50,6 +87,8 @@ def compute_pass_rate(results: list[dict[str, Any]]) -> tuple[float, dict[str, A
         "format_compliance": round(format_compliance, 4),
         "execution_status": execution_status,
         "reward_components": {
-            key: round(float(value), 4) for key, value in reward_components.items()
         },
     }

 from typing import Any
+def compute_reward(
+    pass_rate: float,
+    step_number: int,
+    execution_status: str,
+    format_compliance: float,
+) -> float:
+    """
+    Clean, interpretable reward signal for GRPO training.
+    """
+    del format_compliance
+    step_discount = 1.0 if step_number == 1 else (0.85 if step_number == 2 else 0.70)
+    correctness = pass_rate
+    if execution_status == "timeout":
+        return 0.0
+    if execution_status == "syntax_error":
+        return 0.0
+    reward = correctness * step_discount
+    return round(min(max(reward, 0.0), 1.0), 4)
+def compute_pass_rate(
+    results: list[dict[str, Any]],
+    step_number: int = 1,
+) -> tuple[float, dict[str, Any]]:
     total = len(results)
+    hidden_results = [result for result in results if result.get("visibility") == "hidden"]
+    visible_results = [result for result in results if result.get("visibility") == "visible"]
+    hidden_total = len(hidden_results)
+    visible_total = len(visible_results)
+    hidden_passed = sum(1 for result in hidden_results if result["passed"])
+    visible_passed = sum(1 for result in visible_results if result["passed"])
     passed = sum(1 for result in results if result["passed"])
     timeout_count = sum(1 for result in results if result["status"] == "timeout")
     runtime_error_count = sum(1 for result in results if result["status"] == "runtime_error")
     invalid_output_count = sum(1 for result in results if result["status"] == "invalid_output_format")
     wrong_answer_count = sum(1 for result in results if result["status"] == "wrong_answer")
     format_ok_count = sum(1 for result in results if result.get("format_ok", False))
+    hidden_pass_rate = hidden_passed / hidden_total if hidden_total else 0.0
+    visible_pass_rate = visible_passed / visible_total if visible_total else 0.0
+    pass_rate = hidden_pass_rate if hidden_total else (passed / total if total else 0.0)
     format_compliance = format_ok_count / total if total else 0.0
     if timeout_count:
         execution_status = "timeout"
     else:
         execution_status = "completed"
+    reward = compute_reward(
+        pass_rate=pass_rate,
+        step_number=step_number,
+        execution_status=execution_status,
+        format_compliance=format_compliance,
+    )
     return reward, {
         "passed": passed,
         "total": total,
+        "hidden_passed": hidden_passed,
+        "hidden_total": hidden_total,
+        "visible_passed": visible_passed,
+        "visible_total": visible_total,
         "pass_rate": round(pass_rate, 4),
+        "hidden_pass_rate": round(hidden_pass_rate, 4),
+        "visible_pass_rate": round(visible_pass_rate, 4),
         "timeout_count": timeout_count,
         "runtime_error_count": runtime_error_count,
         "invalid_output_count": invalid_output_count,
         "format_compliance": round(format_compliance, 4),
         "execution_status": execution_status,
         "reward_components": {
+            "correctness": round(float(pass_rate), 4),
+            "step_discount": 1.0 if step_number == 1 else (0.85 if step_number == 2 else 0.70),
+            "reward": reward,
         },
     }