mixture of experts

Token generation is handled by many sub-modules. Each token is generated by a different module, as determined by a routing or weighting module.
How is this different from an ensemble? Typically each input is only handled by a small number of experts, as determined by the router. In contrast in ensembles, each input is seen by all models in the ensemble.