LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces Paper • 2602.14337 • Published 21 days ago • 13
Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs Paper • 2602.21198 • Published 12 days ago • 4
On Data Engineering for Scaling LLM Terminal Capabilities Paper • 2602.21193 • Published 12 days ago • 91
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs Paper • 2602.12705 • Published 23 days ago • 65
Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs Paper • 2602.10388 • Published 26 days ago • 240
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks Paper • 2602.12670 • Published 23 days ago • 54
Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training Paper • 2602.07824 • Published 28 days ago • 16
How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs Paper • 2602.08808 • Published 27 days ago • 8
SAGE: Benchmarking and Improving Retrieval for Deep Research Agents Paper • 2602.05975 • Published about 1 month ago • 12
SWE-World: Building Software Engineering Agents in Docker-Free Environments Paper • 2602.03419 • Published Feb 3 • 40
PaperSearchQA: Learning to Search and Reason over Scientific Papers with RLVR Paper • 2601.18207 • Published Jan 26 • 19
TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents Paper • 2602.02196 • Published Feb 2 • 35
UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing Paper • 2602.02437 • Published Feb 2 • 77
Scientific Image Synthesis: Benchmarking, Methodologies, and Downstream Utility Paper • 2601.17027 • Published Jan 17 • 41
Innovator-VL: A Multimodal Large Language Model for Scientific Discovery Paper • 2601.19325 • Published Jan 27 • 79
MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods Paper • 2601.21821 • Published Jan 29 • 60
Idea2Story: An Automated Pipeline for Transforming Research Concepts into Complete Scientific Narratives Paper • 2601.20833 • Published Jan 28 • 182
MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning Paper • 2601.21468 • Published Jan 29 • 25
PaperBanana: Automating Academic Illustration for AI Scientists Paper • 2601.23265 • Published Jan 30 • 213