I don't think youtube views is a valid task
When you sort each individual category and subcategory on the leaderboard, performance is strongly correlated with size (except W/10, but it makes sense why that wouldn't). Propriety and large models trend towards the top of the leaderboard and tiny models fill the bottom ranks. When you click the world model subcategories you see this sort of trend as well, with the top of the leaderboard filling with white text indicating propriety models... except when you get to youtube views. With youtube views rankings are extremely random.
The amount of views that a youtube video gets is in large part due to the whims of the youtube algorithm and external happenstance factors like whether the video was linked somewhere else on the net. Videos within a channel can also have a large spread between them, with similar types of videos having several times the views of another with no rhyme or reason. It's simply too unpredictable and it's unreasonable to expect a model to be able to have any real insight into this. A model scoring higher or lower on the benchmark is more indicative of a random lucky guess than any indication of capabilities.
In short, I believe that youtube views is almost entirely noise. I would suggest that it be removed and the world model scores be recalculated.
I don't think the models are scoring well in the youtube views ranking because of lucky guesses, since I'm testing them on many prompts and averaging their scores, and they get pretty consistent scores. Though all of the prompts were view predictions for videos from the same channel, which I didn't think would matter that much when making it, but I now realize is an issue. It rewards knowledge on a very niche subject.
So it's definitely measuring something, but its ranking might not be particularly useful to people, and I'm fine with removing it.
Seeing whether the rankings on a benchmark follow the trend of increasing score with increasing size is probably a good "sanity check" as to whether there's something wrong with the benchmark or not, unless there's a logical reason why it wouldn't. Of course there are variances across model types and generations, with some models scoring particular high or low. Like how even though the Qwen models consistently have less knowledge for their size compared to their contemporaries, it doesn't change the fact that larger Qwen models know more than smaller ones. The fact that models increase in intelligence and retained knowledge with increased parameter count for a given training process is fairly well established and so should be safe to fall back on as an assumption.
It may very well be that the youtube views task was measuring something rather than being completely random, but my intuition is that if it was then it's probably something REALLY specific that can't be generalized as an indication of a model's general "understanding capabilities".