flowers-team/StickToYourRoleLeaderboard · Plans to include additional models?

Apr 21

Hello, I'm just inquiring as to whether there's any plans to further update the this benchmark/leaderboard with additional models. Would there be any way for us to request models to be tested/benchmarked?

grg

Flowers AI & CogSci Lab org Apr 22

Hello! I'm doing my best to maintain the leaderboard with the time I have between other projects. 🙂
Absolutely — feel free to suggest models! Ideally, they should be runnable with vLLM and have a context length of at least ~8k tokens. You’re welcome to post suggestions here or open a new issue.

SamuraiBarbi

May 1

•

edited May 1

Would we be able to test the following models?

https://huggingface.co/shuttleai/shuttle-3.5

https://huggingface.co/THUDM/GLM-4-32B-0414

https://huggingface.co/Qwen/Qwen3-235B-A22B

https://huggingface.co/Qwen/Qwen3-30B-A3B

https://huggingface.co/Qwen/Qwen3-32B

https://huggingface.co/Qwen/Qwen3-8B

https://huggingface.co/Qwen/Qwen3-4B

These are more recent models that have dropped where I've seen creative writing benchmarking/evaluation but none really on role play.

Edit: Added Qwen3-235B-A22B to the list

grg

Flowers AI & CogSci Lab org Jun 30

The models that were easily runnable with vLLM were added. On another note, keep in mind that this leaderboard captures but one aspect of role play, population-level stability of value expression over various context.

grg changed discussion status to closed Jun 30