AIME eval script does not score correctly some answers

#132
by ggerganov - opened

I am running some evals and I am noticing that from time to time the model produces a correct answer, but the logic in the AIME eval script does not parse it correctly. Here are a few examples:

image.png

image.png

image.png

image.png

Of course, most of the times the answers are correctly bracketed, but every now and then this happens. The occurrence rate is not negligible that I think it can actually lead to underestimation of the actual AIME score for the model.

I am using gpt-oss-120b at high reasoning with llama.cpp Metal backend.

  • Anyone else observing this?
  • Does the reference vLLM outputs have such incorrectly skipped answers?

Sign up or log in to comment