nomadicsynth's picture

nomadicsynth PRO

nomadicsynth

AI & ML interests

Yes.

Recent Activity

reacted to clefourrier's post with 👀 4 days ago
Saying Claude 4 is "the best coding model in the world" from the SWEBench scores is super misleading, and here is why: If you look at the announcement table, their model has the best scores, but... if you look at the very bottom, in font 4, you'll see that the metric they report is actually not the same metric as the one used for the other models! Comparing "pass@1 averaged 10 times" to "normal pass@1" is like grading one student by allowing them to take the test 10 times and averaging question scores, when the other students only get one chance at grading. The first way to grade (avg@10) is actually quite good statistically, much better than what model creators usually report, because models tend to be quite inconsistent - sometimes good, sometimes bad... But! You want to do it for all models then, and report with error bars. The issue is that, if you do... well, it's going to be harder to say your model is the best, because the error bars will overlap between models, by a lot. Also, you'll see that 2 numbers are reported: the first one is using avg@10 (what I explained above), and the second, highest one is using this plus many other tricks: - test time compute (so having the model generate a tree of answers and selecting the best as you go, more or less) - removing the times when the model breaks the tests - and using another model to select the most promising solution! You can't really say it's better than the rest, mostly because it's **way less efficient** to achieve a similar result. It's honestly a bit sad because from user reports, the model sounds good - however, this announcement is overblown numbers wise, and I'm quite sure it's more a problem of "too much marketing" than of "bad science" Another thing which makes the comparison invalid is the complete absence of open source from the report - don't think they are aware of DeepSeek/ Qwen/The new mistral for code/and all the cool specialised models found on the hub?
View all activity

Organizations

Neon Cortex's profile picture

Posts 4

view post
Post
2629
Anyone using AI and ML to help neurodivergent people? I'd love to hear what you're doing.
view post
Post
358
How do you talk about AI’s promise without sounding like you’re selling out to the next tech gold rush?