Mixture-of-experts (MoE) architectures, designed for better scaling of model sizes and more efficient inference and training, present a challenge to optimize due to their non-differentiable, discrete nature. Traditional MoEs use a router network which directs input data to expert modules, a process that is complex and can lead to inefficiencies and under-specialization of expert modules.…
