ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models Paper • 2310.10505 • Published Oct 16, 2023 • 3
Spectral Policy Optimization: Coloring your Incorrect Reasoning in GRPO Paper • 2505.11595 • Published May 16 • 1
TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling Paper • 2508.17445 • Published Aug 24 • 80
Bridging Formal Language with Chain-of-Thought Reasoning to Geometry Problem Solving Paper • 2508.09099 • Published Aug 12
Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment Paper • 2505.04113 • Published May 7
Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation Paper • 2509.25849 • Published Sep 30 • 47
Scaling Flaws of Verifier-Guided Search in Mathematical Reasoning Paper • 2502.00271 • Published Feb 1 • 1
OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation Paper • 2505.23885 • Published May 29
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models Paper • 2508.06471 • Published Aug 8 • 195
Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents Paper • 2509.09265 • Published Sep 11 • 47
SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning Paper • 2509.02479 • Published Sep 2 • 83
FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction Paper • 2508.11987 • Published Aug 16 • 71
RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques Paper • 2501.14492 • Published Jan 24 • 29
SnapKV: LLM Knows What You are Looking for Before Generation Paper • 2404.14469 • Published Apr 22, 2024 • 27