UP | HOME

multi-armed bandit problem

1. brief description

You have \(N\) rounds to play \(M\) slot machines. The reward given by each machine is governed by an unknown distribution. In each round, you can play \(k\) machines.

2. explore vs exploit

You want to explore in order to find which machines are the best to play. But you might need multiple samples per machine to get a good sense of the distribution. But if you spend too much time exploring, you won't be able to exploit the good machines

3. comparison with markov decision process

In a MDP, every time you make an action, you get put in a different state. In MAB, you stay in the same state no matter which slot machine you play.

For comparison, think about training a semantic parser using weak supervision. Given an input, the parser spits out an output string. Note that you only get a reward for the entire output string, you don't get an incremental reward for each token outputted. In this sense, we can think of this problem as a multi-armed bandit problem, where every possible output string corresponds with a machine.

4. useful links

Created: 2024-07-15 Mon 01:26