Post
319
๐๐ฏ๐๐ผ๐น๐๐๐ฒ ๐ญ๐ฒ๐ฟ๐ผ: ๐๐๐ ๐ ๐ฐ๐ฎ๐ป ๐๐ฟ๐ฎ๐ถ๐ป ๐๐ถ๐๐ต๐ผ๐๐ ๐ฎ๐ป๐ ๐ฒ๐
๐๐ฒ๐ฟ๐ป๐ฎ๐น ๐ฑ๐ฎ๐๐ฎ ๐คฏ
Has the "data wall" just been breached?
Recent RL paradigms often relied on a set of questions an answers that needs to be manually curated. Researchers from Tsinghua University went like "why though".
๐ค Indeed, why learn from question designed by a human teacher, when the model can start from their base knowledge and learn by experimenting in a code environment, proposing coding tasks themselves and trying to solve them?
Thus they created โAbsolute Zero Reasoningโ (AZR), an approach that removes any need for human curated data.
๐ญ ๐๐๐ฎ๐น ๐ฟ๐ผ๐น๐ฒ๐:
โฃ Proposer: Generates challenging but solvable coding tasks
โฃ Solver: Attempts to solve those self-proposed tasks
๐งช ๐ง๐ต๐ฟ๐ฒ๐ฒ ๐๐ฎ๐๐ธ ๐๐๐ฝ๐ฒ๐: all types are defined as triplets of program, input and output
โฃ Deduction: Give model an input and program, it must deduce the output
โฃ Abduction: Give model an program and output, it must find the input that gave said output
โฃ Induction: Synthesize a program from input/output pairs
Btw this reminded me of my long-forgotten philosophy classes: Aristotle was more on the induction side, learning from real-world analogies, while Plato was more on the deduction side, trying to progress quite far with just one input and his reasoning.
๐ ๐ฅ๐ฒ๐๐๐น๐๐:
โฃ AZR post-training creates a nice improvement on known models like Qwen2.5-7B
โฃ Shows strong cross-domain transfer: coding โ๏ธ math reasoning
๐ง ๐ข๐๐ต๐ฒ๐ฟ ๐ณ๐ถ๐ป๐ฑ๐ถ๐ป๐ด๐:
โฃ Having a better base performance (general or code specific) amplify the gains from Absolute Zero Reasoning
โฃ Researchers warn about "Uh-oh moments" (winking to the "aha moments" of DeepSeek) where the model generates concerning goals like "make an extremely convoluted code to outsmart all these humans": so supervision is still needed!
Paper here: Absolute Zero: Reinforced Self-play Reasoning with Zero Data (2505.03335)
Has the "data wall" just been breached?
Recent RL paradigms often relied on a set of questions an answers that needs to be manually curated. Researchers from Tsinghua University went like "why though".
๐ค Indeed, why learn from question designed by a human teacher, when the model can start from their base knowledge and learn by experimenting in a code environment, proposing coding tasks themselves and trying to solve them?
Thus they created โAbsolute Zero Reasoningโ (AZR), an approach that removes any need for human curated data.
๐ญ ๐๐๐ฎ๐น ๐ฟ๐ผ๐น๐ฒ๐:
โฃ Proposer: Generates challenging but solvable coding tasks
โฃ Solver: Attempts to solve those self-proposed tasks
๐งช ๐ง๐ต๐ฟ๐ฒ๐ฒ ๐๐ฎ๐๐ธ ๐๐๐ฝ๐ฒ๐: all types are defined as triplets of program, input and output
โฃ Deduction: Give model an input and program, it must deduce the output
โฃ Abduction: Give model an program and output, it must find the input that gave said output
โฃ Induction: Synthesize a program from input/output pairs
Btw this reminded me of my long-forgotten philosophy classes: Aristotle was more on the induction side, learning from real-world analogies, while Plato was more on the deduction side, trying to progress quite far with just one input and his reasoning.
๐ ๐ฅ๐ฒ๐๐๐น๐๐:
โฃ AZR post-training creates a nice improvement on known models like Qwen2.5-7B
โฃ Shows strong cross-domain transfer: coding โ๏ธ math reasoning
๐ง ๐ข๐๐ต๐ฒ๐ฟ ๐ณ๐ถ๐ป๐ฑ๐ถ๐ป๐ด๐:
โฃ Having a better base performance (general or code specific) amplify the gains from Absolute Zero Reasoning
โฃ Researchers warn about "Uh-oh moments" (winking to the "aha moments" of DeepSeek) where the model generates concerning goals like "make an extremely convoluted code to outsmart all these humans": so supervision is still needed!
Paper here: Absolute Zero: Reinforced Self-play Reasoning with Zero Data (2505.03335)