Benchmarks for increasing and reducing experts?

#3
by benhaotang - opened

Wonderful work! I was wondering how much of a difference will changing the number of experts has on model's intelligence? Can you do some benchmarks in the future for your models with 4, 16 vs the base one to test this quantitatively? Thanks in advance.

With MOEs this can help or average out performance depending on:

1 - Experts training
2 - Number of experts
3 - Prompt content.

With #3 being the "roadmap" to which experts are selected. (IE "best of" 4,8,12,16 etc).

With a set of diverse experts sometimes - per prompt - results may drop or improve.
Because the prompt content is mission critical to which experts are selected.

Now... this completely changes if the experts are closely related - IE all medical, creative, etc etc.
In this case more experts can actually improve - vastly in some cases - generation/reasoning results.

In the case of Qwen3 model, this model has 128 experts on every layer of the model. (48 layers)
With 8 activated in the default model ; and 24 at this repo.
Oddly , even at 24 experts this is only a fraction of all experts.
I am mentioning this, because it is another factor.

This information is derived from reading papers on MOES, building MOEs and testing them.

I also don't have the computer power avail to benchmark - I am quanting / testing / tweaking almost non-stop.
If someone(s) would like to benchmark, I will post.

With Qwen3 and this model specifically:

There is a drop off in quality past 24 experts.
This is not so much a Qwen issue, but a MOE issue due to "diverse experts" issue...

IE: You don't want a "cook" telling you how to "fix" a car.

With all this being said; there may be "ultra complex" prompts / content (and multi-turn convo) which may benefit from more that 24 experts.

Must have expert selector somewhere ,idea is if this selector make errors many experts to lower this errors, from other side if experts are not well balanced this will give many errors because wrong /more stupid/ expert will take the word (this is my imagination , i really do not know exactly how it works in practice)

perfect moe must work perfect with only 2 (1??? if we have perfect selector then why to use 2 experts,it will select perfect 1 expert always) experts if not it is not perfect , main advantage of moe is fast inference | not 100% sure in this,may have and other advantages :) ,without selector then this is like using 2 llms(or more) sharing same context and competing for best next token result !

"128 experts on every layer of the model. (48 layers)
With 8 activated in the default model "

Layers splitting with dynamic next layer expert prediction and selection is other beer ,if qwen moe work this way and active experts are only active paths it runs then cheers

This way combinations are layer experts size (128) x next layer experts size (128) total times 48 , which is 128^48 (very big number for available paths of experts!!!! )

and this is for only one active path !!!!!!!!

{if path selection results are same no need to recompute and inference speed will up ,up to 8 times(when using 8 active paths) if all path computed are all over same one selected layer expert, if only one is the correct path selected}-may this is wrong if in data is different!!!!

--also somehow must predict and select next layer best expert(for example from previous or from parallel net especialy for predicting next layer expert)-- one of 128 combinations

we need exact and well 3d explanation how it works exactly .this is possible new types of moes innovations

I have tested this model down to 2 experts ; it does not work (well).

At 4 experts - this seems to be the limit.
Likewise after 24 experts, there is an "averaging out" issue - too many cooks in the kitchen.

However, a MOE like 8X3b , 8X1b, 8X8B etc etc, can function with just 1 expert activated, with "2" as the default.

In terms of the Qwen3 MOE -> The experts are smaller, and more specialized.
In fact if you do the math -> 30B / 128 => Each expert is 235 million parameters roughly.
8 experts -> 1.8 B.
Funny thing is -> Qwen says 3B are activated. (?).

NOTE: Qwen Moes have a shared expert in addition to activated experts, so this may account for the 1.2 B.

My take from this, is this MOE structure is unique , as are the experts / training and so on.
There is more going on here in this new SOTA Moe structure.

min 4 max 24 are personal observing and believing

I play with setting in llama-server i found they tune models well :)

how people evaluate models when different settings change everything

I believe for existing gold setting for each model

Thank you a lot for all your insights, really interesting! Really wondering how was the original data set setup to compensate for all these mini experts.

LMstudio recently updated their product; you can now select number of experts at loading time.
Tested - works great.
You can select 1 to 128 experts.

"--override-kv llama.expert_used_count=int:2"
this not work with qwen3 moes and llama.cpp

@21world

Change:
lama.expert_used_count

TO:
qwen3moe.expert_used_count

"kv"s are specific to each model.

Sign up or log in to comment