SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios Paper • 2512.18470 • Published 6 days ago • 8
Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows Paper • 2512.16969 • Published 9 days ago • 105
MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments Paper • 2512.19432 • Published 4 days ago • 10
QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models Paper • 2512.19526 • Published 4 days ago • 10
Reinforcement Learning for Self-Improving Agent with Skill Library Paper • 2512.17102 • Published 8 days ago • 20
Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision Paper • 2512.15489 • Published 9 days ago • 6
Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows Paper • 2512.13168 • Published 12 days ago • 49
The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality Paper • 2512.10791 • Published 15 days ago • 7
Evaluating Gemini Robotics Policies in a Veo World Simulator Paper • 2512.10675 • Published 15 days ago • 16
Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale Paper • 2512.10398 • Published 16 days ago • 6
Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving Paper • 2512.10739 • Published 15 days ago • 45
RefineBench: Evaluating Refinement Capability of Language Models via Checklists Paper • 2511.22173 • Published 30 days ago • 13
RefineBench: Evaluating Refinement Capability of Language Models via Checklists Paper • 2511.22173 • Published 30 days ago • 13