openai/gpt-oss-120b · AIME eval script does not score correctly some answers

I am running some evals and I am noticing that from time to time the model produces a correct answer, but the logic in the AIME eval script does not parse it correctly. Here are a few examples:

Of course, most of the times the answers are correctly bracketed, but every now and then this happens. The occurrence rate is not negligible that I think it can actually lead to underestimation of the actual AIME score for the model.

I am using gpt-oss-120b at high reasoning with llama.cpp Metal backend.

Anyone else observing this?
Does the reference vLLM outputs have such incorrectly skipped answers?