Post
123
π Just shipped reconcile_gst2b_env at OpenEnv Hackathon 2026 (Meta x Scaler India).
An RL environment for the monthly GST tax reconciliation that 14M Indian businesses do by hand. Trained Qwen3-4B SFT + GRPO with custom Tier 2c length-shaping reward modification. Headline: n=5 mean composite reward 0.305, +69% over prompted baseline.
5 documented failure modes including a novel research finding: the SAME composite reward design that defends against 6 red-team attacks ALSO makes a 3-step shortcut score higher than 50 steps of honest training. Empirically proven on-site (step-350 mean > step-375 mean).
Live demo + repo + writeup linked below.
π huggingface.co/spaces/akashkathole/reconcile_gst2b_env
π₯ youtube.com/watch?v=K-sZ8c1TMjw
π BLOG.md in the Space
akashkathole/reconcile_gst2b_env
An RL environment for the monthly GST tax reconciliation that 14M Indian businesses do by hand. Trained Qwen3-4B SFT + GRPO with custom Tier 2c length-shaping reward modification. Headline: n=5 mean composite reward 0.305, +69% over prompted baseline.
5 documented failure modes including a novel research finding: the SAME composite reward design that defends against 6 red-team attacks ALSO makes a 3-step shortcut score higher than 50 steps of honest training. Empirically proven on-site (step-350 mean > step-375 mean).
Live demo + repo + writeup linked below.
π huggingface.co/spaces/akashkathole/reconcile_gst2b_env
π₯ youtube.com/watch?v=K-sZ8c1TMjw
π BLOG.md in the Space
akashkathole/reconcile_gst2b_env