cost_explanation.md · agent-evals/leaderboard at main

Accuracy vs. Cost Frontier

This plot shows the trade-off between accuracy and the total cost of evaluating on the entire benchmark for each agent. Costs are calculated using the token counts (both input and output) for each agent's underlying model. For benchmarks that support it, you can dynamically adjust the token prices using the "Token Pricing Configuration" panel above to see how different pricing models affect the relative cost-effectiveness of different agents. For agents we ran with a per-task cost limit, we indicate the cost limit on the leaderboard. Note: We currently do not consider prompt caching in our cost calculations and calculate total costs based on the raw token counts.