I also have a hypothesis that this model can be efficiently downsized not by pruning experts, but by using merges and LoRAs to downsize their unique parameter count. The merge would be most of the shared parameters, and the routing table need not change.
I'm building up a new version of my pipeline to test this hypothesis. I suspect it'd let us get most of the performance in <12B parameters.