Plans to include additional models?
Hello, I'm just inquiring as to whether there's any plans to further update the this benchmark/leaderboard with additional models. Would there be any way for us to request models to be tested/benchmarked?
Hello! I'm doing my best to maintain the leaderboard with the time I have between other projects. 🙂
Absolutely — feel free to suggest models! Ideally, they should be runnable with vLLM and have a context length of at least ~8k tokens. You’re welcome to post suggestions here or open a new issue.
Would we be able to test the following models?
https://huggingface.co/shuttleai/shuttle-3.5
https://huggingface.co/THUDM/GLM-4-32B-0414
https://huggingface.co/Qwen/Qwen3-235B-A22B
https://huggingface.co/Qwen/Qwen3-30B-A3B
https://huggingface.co/Qwen/Qwen3-32B
https://huggingface.co/Qwen/Qwen3-8B
https://huggingface.co/Qwen/Qwen3-4B
These are more recent models that have dropped where I've seen creative writing benchmarking/evaluation but none really on role play.
Edit: Added Qwen3-235B-A22B to the list
The models that were easily runnable with vLLM were added. On another note, keep in mind that this leaderboard captures but one aspect of role play, population-level stability of value expression over various context.