Spaces:

agent-evals
/

leaderboard

Running

App Files Files Community

benediktstroebl commited on Dec 4, 2024

Commit

c27a759

·

1 Parent(s): 1baa168

added cost and heatmap explanation

Files changed (2) hide show

cost_explanation.md +2 -0
heatmap_explanation.md +2 -0

cost_explanation.md ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ ## Accuracy vs. Cost Frontier
2	+ This plot shows the trade-off between accuracy and the total cost of evaluating on the entire benchmark for each agent. Costs are calculated using the token counts (both input and output) for each agent's underlying model. For benchmarks that support it, you can dynamically adjust the token prices using the "Token Pricing Configuration" panel above to see how different pricing models affect the relative cost-effectiveness of different agents. For agents we ran with a per-task cost limit, we indicate the cost limit on the leaderboard. Note: We currently do not consider prompt caching in our cost calculations and calculate total costs based on the raw token counts.

heatmap_explanation.md ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ ## Task success heatmap
2	+ The task success heatmap shows which agent can solve which tasks. Agents are sorted by total accuracy (higher is better); tasks are sorted by decreasing order of difficulty (tasks on the left are solved by the most agents; tasks on the right are solved by the least). For agents that have been run more than once, the run with the highest score is shown.