AIME eval script does not score correctly some answers
#132
by
ggerganov
- opened
I am running some evals and I am noticing that from time to time the model produces a correct answer, but the logic in the AIME eval script does not parse it correctly. Here are a few examples:
Of course, most of the times the answers are correctly bracketed, but every now and then this happens. The occurrence rate is not negligible that I think it can actually lead to underestimation of the actual AIME score for the model.
I am using gpt-oss-120b
at high
reasoning with llama.cpp Metal backend.
- Anyone else observing this?
- Does the reference vLLM outputs have such incorrectly skipped answers?