Main Results
The SR+CR MCTS Scheme
- • Selects an action at the current root
suitable for minimizing the simple regret.
- • Then selects actions according to UCB, that approximately minimizes the cumulative regret.
1 Rollout(node, depth=1)
2 if IsLeaf(node, depth)
3 return 0
4 else
5 if depth=1 then action $\gets$ FirstAction(node)
6 else action $\gets$ NextAction(node)
7 next $\gets$ NextState(node, action)
8 reward $\gets$ Reward(node, action, next)
9 + Rollout(next, depth+1)
10 UpdateStats(node, action, reward)
11 return reward
Sampling for Simple Regret
- $\varepsilon$-greedy sampling $\left(\varepsilon=\frac 1 2\right)$.
- Modified version of UCB (optimized for simple regret).
- VOI-aware sampling:
$$VOI_\alpha\approx\frac {\overline X_\beta} {n_\alpha+1}\exp\left(-2(\overline X_\alpha - \overline X_\beta)^2 n_\alpha\right)$$
$$VOI_i\approx\frac {1-\overline X_\alpha} {n_i+1}\exp\left(-2(\overline X_\alpha - \overline X_i)^2 n_i\right),\; i\ne\alpha$$