David Tolpin, Solomon Eyal Shimony
Ben Gurion University of the Negev,
Beer Sheva, Israel

MCTS Based on Simple Regret
AAAI 2012

Multi-armed Bandits

UCB and UCT

Metareasoning

Acknowledgments

Experiments

Random Trees

Regret vs. number of samples:

Sailing Domain

Path cost vs. number of samples:

Path cost vs. exploration factor:

Monte-Carlo Sampling in Trees

Main Results

The SR+CR MCTS Scheme

1 Rollout(node, depth=1) 2 if IsLeaf(node, depth) 3 return 0 4 else 5 if depth=1 then action $\gets$ FirstAction(node) 6 else action $\gets$ NextAction(node) 7 next $\gets$ NextState(node, action) 8 reward $\gets$ Reward(node, action, next) 9 + Rollout(next, depth+1) 10 UpdateStats(node, action, reward) 11 return reward

Sampling for Simple Regret

  1. $\varepsilon$-greedy sampling $\left(\varepsilon=\frac 1 2\right)$.
  2. Modified version of UCB (optimized for simple regret).
  3. VOI-aware sampling: $$VOI_\alpha\approx\frac {\overline X_\beta} {n_\alpha+1}\exp\left(-2(\overline X_\alpha - \overline X_\beta)^2 n_\alpha\right)$$ $$VOI_i\approx\frac {1-\overline X_\alpha} {n_i+1}\exp\left(-2(\overline X_\alpha - \overline X_i)^2 n_i\right),\; i\ne\alpha$$

Contributions


Future Work

http://www.offtopia.net/aaai-2012-srcr-poster/