metadata

title: ReTool Implementation
emoji: 🔧
colorFrom: blue
colorTo: purple
sdk: static
app_file: README.md
pinned: false
license: mit
tags:
  - reinforcement-learning
  - tool-use
  - code-interpreter
  - mathematical-reasoning
  - rl-training
  - ppo
  - research-implementation
language: en
library_name: transformers

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

A PyTorch implementation of ReTool from the paper "ReTool: Reinforcement Learning for Strategic Tool Use in LLMs" by Feng et al. (2025).

ReTool enhances long-form reasoning by integrating code interpreter execution into the RL training loop, enabling models to learn when and how to invoke computational tools for mathematical problem solving.

Figure 2: Comparison of standard text-based RL vs ReTool's code-integrated training process

🚀 Key Features

Multi-turn Generation: Dynamic code execution during reasoning with KV-cache optimization
Strategic Tool Use: Learns when and how to invoke code interpreters through RL
Interpreter Masking: Excludes external tool outputs from gradient computation
Production Ready: Built on HuggingFace Transformers with proper batching and distributed training support

📊 Performance

Figure 1: ReTool achieves 67% accuracy on AIME 2024, significantly outperforming text-based RL (40%)

🛠️ Installation

git clone https://github.com/yourusername/retool-implementation.git
cd  retool-implementation/scr
pip install -r requirements.txt

🚧 Current Status

This is a research implementation based on the ReTool paper. The core components are implemented but not yet fully tested.

What's Implemented ✅

Multi-turn generation with KV-cache optimization
Interpreter token masking for RL training
Modified PPO loss computation
Complete training pipeline structure
Proper tensor handling and batching

What Needs Testing/Integration 🔧

End-to-end training verification
Code execution sandbox integration
Edge case handling for truncated sequences
Memory optimization for large models

For Researchers & Developers

This implementation serves as a foundation for:

Understanding ReTool's architecture
Building upon the multi-turn generation approach
Integrating custom code execution environments
Extending to other tool-use scenarios

📊 Dataset Format

Your dataset should contain dictionaries with:

{
    "prompt": "Solve this math problem: ...",
    "answer": "42"  # Ground truth for reward computation
}

🔍 How It Works

Multi-turn Generation: Model generates reasoning step-by-step
Code Detection: When </code> is generated, extract and execute code
Tool Integration: Append <interpreter>result</interpreter> to context
Continued Reasoning: Model continues with tool feedback
Reward Computation: Binary reward based on final answer correctness
RL Training: PPO updates exclude interpreter tokens from loss

⚙️ Key Components

ReToolTrainer Class

_retool_generate_with_interpreter(): Multi-turn generation with tool execution
_create_interpreter_mask(): Creates masks for excluding tool outputs
_compute_loss(): Modified PPO loss with interpreter masking
_compute_rewards_and_advantages(): Binary reward computation

Configuration Options

trainer = ReToolTrainer(
    # ... model and data ...
    max_turns=10,              # Maximum reasoning turns
    temperature=0.7,           # Generation temperature
    max_completion_length=1024, # Max tokens per turn
    mask_truncated_completions=True,  # Handle incomplete sequences
)

💡 Usage Example (Conceptual)

from retool_trainer import ReToolTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments

# This shows the intended API - full testing in progress
trainer = ReToolTrainer(
    model=AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-32B-Instruct"),
    processing_class=AutoTokenizer.from_pretrained("Qwen/Qwen2.5-32B-Instruct"),
    args=TrainingArguments(...),
    train_dataset=your_math_dataset,
    max_turns=10,
)

# trainer.train()  # Full integration testing in progress

📈 Results From Paper

AIME 2024: 67% accuracy (vs 40% text-based RL)
AIME 2025: 49.3% accuracy (vs 36.7% text-based RL)
Efficiency: Converges in 400 steps vs 1080 for baseline
Token Efficiency: 40% reduction in response length

🚧 Limitations & TODOs

Code execution sandbox integration
Support for multiple reward functions
Advanced error handling for malformed code
Distributed training optimizations
Tool selection beyond code interpreter

📚 Citation

@article{feng2025retool,
  title={ReTool: Reinforcement Learning for Strategic Tool Use in LLMs},
  author={Feng, Jiazhan and Huang, Shijue and Qu, Xingwei and Zhang, Ge and Qin, Yujia and Zhong, Baoquan and Jiang, Chengquan and Chi, Jinxin and Zhong, Wanjun},
  journal={arXiv preprint arXiv:2504.11536},
  year={2025}
}

📄 License

MIT License - see LICENSE file for details.

🤝 Collaboration Welcome (But Not Required)

I'm perfectly happy working on this solo, but collaboration can be rewarding when there's mutual value and good fit.

🛠️ Areas Where I'd Value Expertise

Distributed Sandbox Engineering:

Asynchronous code execution environment with load balancing
Worker pool architecture for parallel code execution
Systems engineering and containerization expertise

Dataset Engineering:

Mathematical reasoning dataset curation and validation
Cold-start data pipeline design
Quality control and formatting workflows

🚀 Collaboration Approach

Start small: Open an issue to discuss your approach first
Show, don't tell: Small proof-of-concept before larger contributions
Quality focused: Code review and documentation required
Clear attribution: All substantial contributors get proper credit

💰 The Compute Reality

Full training requires significant resources:

~8x A100s for complete AIME validation
Currently exploring compute sponsorship options
Happy to validate on smaller models first

🎯 What I'm Looking For

People who bring complementary skills (not just ML knowledge)
Contributors who can work independently and deliver quality
Collaborative mindset without drama or politics

Interested? Open an issue with your background and what you'd like to work on. Let's see if there's a good fit!

No pressure though - I genuinely enjoy the solo research implementation process too. 😊

🙏 Acknowledgments

Original paper authors for the ReTool framework
HuggingFace team for the transformers library
TRL team for GRPO implementation patterns

Built with ❤️ for advancing AI reasoning capabilities