ReliableMath-Leaderboard / ReliableMath.tsv
AmourWaltz
aaa
a95681d
raw
history blame
1.02 kB
model size prompt Prec.Avg Prud.Avg Prec.(A) Prud.(A) Len.(A) Prec.(U) Prud.(U) Len.(U)
deepseek-ai/DeepSeek-R1 671 Reliable 0.642 0.004 0.735 0.000 3.81k 0.549 0.007 4.40k
OpenAI/o3-mini ??? Reliable 0.504 0.006 0.716 0.006 1.57k 0.293 0.005 4.20k
deepseek-ai/DeepSeek-V3 671 Reliable 0.521 0.001 0.665 0.000 1.34k 0.377 0.003 1.50k
OpenAI/GPT-4o ??? Reliable 0.397 0.015 0.460 0.006 0.58k 0.335 0.025 0.60k
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 32 Reliable 0.551 0.001 0.684 0.000 5.05k 0.418 0.002 9.40k
deepseek-ai/DeepSeek-R1-Distill-Qwen-14B 14 Reliable 0.547 0.000 0.629 0.000 6.23k 0.465 0.001 11.00k
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B 7 Reliable 0.289 0.000 0.575 0.000 6.24k 0.003 0.000 6.60k
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B 1.5 Reliable 0.198 0.000 0.396 0.000 9.37k 0.000 0.000 9.70k
Qwen/Qwen2.5-Math-7B-Instruct 7 Reliable 0.266 0.000 0.505 0.000 0.82k 0.027 0.000 0.90k
Qwen/Qwen2.5-Math-1.5B-Instruct 1.5 Reliable 0.218 0.000 0.422 0.000 0.74k 0.015 0.000 0.80k