Add pipeline tag, GitHub link, and improved model description
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,5 +1,6 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
|
|
|
| 3 |
---
|
| 4 |
|
| 5 |
<h1 align="center">
|
|
@@ -23,8 +24,29 @@ Carnegie Mellon University, Language Technologies Institute
|
|
| 23 |
|
| 24 |
</div>
|
| 25 |
|
|
|
|
| 26 |
|
| 27 |
-
This repository contains mid-training related checkpoints in the extrapolation tasks.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
## π Citation
|
| 30 |
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
+
pipeline_tag: text-generation
|
| 4 |
---
|
| 5 |
|
| 6 |
<h1 align="center">
|
|
|
|
| 24 |
|
| 25 |
</div>
|
| 26 |
|
| 27 |
+
## Does Reinforcement Learning Truly Extend Reasoning?
|
| 28 |
|
| 29 |
+
This work explores the discrepancy in views on RL's effectiveness in extending language models' reasoning abilities. Some characterize RL as a capability refiner, while others see it as inducing new compositional skills. This challenge stems from a lack of control in modern training pipelines. Our work aims to resolve this conflict through controlled analysis, going beyond the initial description that this repository contains mid-training related checkpoints in the extrapolation tasks.
|
| 30 |
+
|
| 31 |
+
## π Overview
|
| 32 |
+
|
| 33 |
+
Our paper builds a fully controlled experimental framework to analyze how pre-training, mid-training, and RL-based post-training jointly shape the reasoning abilities of language models. Using synthetic math-style reasoning tasks with explicit atomic operations and process-verifiable reasoning traces, we study:
|
| 34 |
+
|
| 35 |
+
* **Extrapolative generalization** to more complex compositions (deeper dependency graphs).
|
| 36 |
+
* **Contextual generalization** across diverse surface forms and linguistic contexts.
|
| 37 |
+
* How **RL interacts** with prior knowledge, and when it yields **genuine capability gains** beyond pre-training.
|
| 38 |
+
|
| 39 |
+
## π§ Key findings
|
| 40 |
+
<div align="center">
|
| 41 |
+
<h1 align="center">
|
| 42 |
+
<img src="assets/findings.png" width="500" />
|
| 43 |
+
</h1>
|
| 44 |
+
</div>
|
| 45 |
+
You may also find the comic generated by Notebook LLM [here](assets/Interplay-LM-Reasoning.pdf).
|
| 46 |
+
|
| 47 |
+
## Code
|
| 48 |
+
|
| 49 |
+
The code and data for this work will be released soon at the following GitHub repository: [https://github.com/Interplay-LM-Reasoning/Interplay-LM-Reasoning](https://github.com/Interplay-LM-Reasoning/Interplay-LM-Reasoning)
|
| 50 |
|
| 51 |
## π Citation
|
| 52 |
|