|
model size prompt Prec.Avg Prud.Avg Prec.(A) Prud.(A) Len.(A) Prec.(U) Prud.(U) Len.(U)
|
|
deepseek-ai/DeepSeek-R1 671 Reliable 0.642 0.004 0.735 0.000 3.81k 0.549 0.007 4.40k
|
|
OpenAI/o3-mini ??? Reliable 0.504 0.006 0.716 0.006 1.57k 0.293 0.005 4.20k
|
|
deepseek-ai/DeepSeek-V3 671 Reliable 0.521 0.001 0.665 0.000 1.34k 0.377 0.003 1.50k
|
|
OpenAI/GPT-4o ??? Reliable 0.397 0.015 0.460 0.006 0.58k 0.335 0.025 0.60k
|
|
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 32 Reliable 0.551 0.001 0.684 0.000 5.05k 0.418 0.002 9.40k
|
|
deepseek-ai/DeepSeek-R1-Distill-Qwen-14B 14 Reliable 0.547 0.000 0.629 0.000 6.23k 0.465 0.001 11.00k
|
|
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B 7 Reliable 0.289 0.000 0.575 0.000 6.24k 0.003 0.000 6.60k
|
|
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B 1.5 Reliable 0.198 0.000 0.396 0.000 9.37k 0.000 0.000 9.70k
|
|
Qwen/Qwen2.5-Math-7B-Instruct 7 Reliable 0.266 0.000 0.505 0.000 0.82k 0.027 0.000 0.90k
|
|
Qwen/Qwen2.5-Math-1.5B-Instruct 1.5 Reliable 0.218 0.000 0.422 0.000 0.74k 0.015 0.000 0.80k
|
|
|