Abstract
Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.
Community
LINKS
paper: https://arxiv.org/abs/2505.03335
project page: https://andrewzh112.github.io/absolute-zero-reasoner/
code: https://github.com/LeapLabTHU/Absolute-Zero-Reasoner
models: https://huggingface.co/collections/andrewzh/absolute-zero-reasoner-68139b2bca82afb00bc69e5b
logs: https://wandb.ai/andrewzhao112/AbsoluteZeroReasoner
Twitter thread: https://x.com/AndrewZ45732491/status/1919920459748909288
made an audio overview for this:
ciao
ciao
Hi everyone! Thanks for this great paper, great idea and execution! I wrote this summary:
๐๐๐ ๐ ๐ฐ๐ฎ๐ป ๐๐ฟ๐ฎ๐ถ๐ป ๐๐ถ๐๐ต๐ผ๐๐ ๐ฎ๐ป๐ ๐ฒ๐ ๐๐ฒ๐ฟ๐ป๐ฎ๐น ๐ฑ๐ฎ๐๐ฎ ๐คฏ
Has the "data wall" just been breached?
Recent RL paradigms often relied on a set of questions an answers that needs to be manually curated. Researchers from Tsinghua University went like "why though".
๐ค Indeed, why learn from question designed by a human teacher, when the model can start from their base knowledge and learn by experimenting in a code environment, proposing coding tasks themselves and trying to solve them?
Thus they created โAbsolute Zero Reasoningโ (AZR), an approach that removes any need for human curated data.
๐ญ ๐๐๐ฎ๐น ๐ฟ๐ผ๐น๐ฒ๐:
โฃ Proposer: Generates challenging but solvable coding tasks
โฃ Solver: Attempts to solve those self-proposed tasks
๐งช ๐ง๐ต๐ฟ๐ฒ๐ฒ ๐๐ฎ๐๐ธ ๐๐๐ฝ๐ฒ๐: all types are defined as triplets of program, input and output
โฃ Deduction: Give model an input and program, it must deduce the output
โฃ Abduction: Give model an program and output, it must find the input that gave said output
โฃ Induction: Synthesize a program from input/output pairs
Btw this reminded me of my long-forgotten philosophy classes: Aristotle was more on the induction side, learning from real-world analogies, while Plato was more on the deduction side, trying to progress quite far with just one input and his reasoning.
๐ ๐ฅ๐ฒ๐๐๐น๐๐:
โฃ AZR post-training creates a nice improvement on known models like Qwen2.5-7B
โฃ Shows strong cross-domain transfer: coding โ๏ธ math reasoning
๐ง ๐ข๐๐ต๐ฒ๐ฟ ๐ณ๐ถ๐ป๐ฑ๐ถ๐ป๐ด๐:
โฃ Having a better base performance (general or code specific) amplify the gains from Absolute Zero Reasoning
โฃ Researchers warn about "Uh-oh moments" (winking to the "aha moments" of DeepSeek) where the model generates concerning goals like "make an extremely convoluted code to outsmart all these humans": so supervision is still needed!
This might signal a new era where models learn through experimentation rather than supervision! Maybe the solution to the "data wall" that many LLM companies are facing?
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper