Top-performing model has incorrect benchmark results.

#11
by resitaydin - opened

The top-performing model on this leaderboard (Metin/Gemma-2-9b-it-TR-DPO-V1) has incorrect benchmark results and therefore needs to be corrected.
On the model page itself on HF, the results are different.
Reference: https://huggingface.co/Metin/Gemma-2-9b-it-TR-DPO-V1/blob/main/README.md
On another Turkish-fine-tuned model page, the benchmark results are also consistent with the results on the model page of Metin/Gemma-2-9b-it-TR-DPO-V1 model.
Reference: https://huggingface.co/WiroAI/wiroai-turkish-llm-9b#benchmark-scores
I've also tested it myself on Google Colab using A100 GPU using truthful_qa-v0.2 benchmark and my result is also consistent with the two above.

Screenshot 2025-05-21 at 14.10.16.png

Therefore, the benchmark results for this model needs to be updated as soon as possible.

resitaydin changed discussion status to closed

Sign up or log in to comment