mixture of experts
- Token generation is handled by many sub-modules. Each token is generated by a different module, as determined by a routing or weighting module.
- How is this different from an ensemble? Typically each input is only handled by a small number of experts, as determined by the router. In contrast in ensembles, each input is seen by all models in the ensemble.