benediktstroebl commited on
Commit
c27a759
·
1 Parent(s): 1baa168

added cost and heatmap explanation

Browse files
Files changed (2) hide show
  1. cost_explanation.md +2 -0
  2. heatmap_explanation.md +2 -0
cost_explanation.md ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ ## Accuracy vs. Cost Frontier
2
+ This plot shows the trade-off between accuracy and the total cost of evaluating on the entire benchmark for each agent. Costs are calculated using the token counts (both input and output) for each agent's underlying model. For benchmarks that support it, you can dynamically adjust the token prices using the "Token Pricing Configuration" panel above to see how different pricing models affect the relative cost-effectiveness of different agents. For agents we ran with a per-task cost limit, we indicate the cost limit on the leaderboard. *Note:* We currently do not consider prompt caching in our cost calculations and calculate total costs based on the raw token counts.
heatmap_explanation.md ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ ## Task success heatmap
2
+ The task success heatmap shows which agent can solve which tasks. Agents are sorted by total accuracy (higher is better); tasks are sorted by decreasing order of difficulty (tasks on the left are solved by the most agents; tasks on the right are solved by the least). For agents that have been run more than once, the run with the highest score is shown.