Lecture 15 Monte Carlo Tree Search

Alpha-Beta Pruning Recap

Alpha-beta pruning is a search algorithm optimization technique used to reduce the number of nodes that need to be evaluated in the search tree by the minimax algorithm. It is an adversarial search algorithm used most often for machine playing of two-player games (Tic-tac-toe, Chess, Go, etc.). It stops completely evaluating a node when at least one possibility has been found that proves the node to be worse than a previously examined node. Such nodes need not be evaluated further. For realistic games, optimality may be sacrificed using heuristic evaluation functions to cut off the search, which means accepting a potentially less-than-perfect solution to make the search process feasible within a reasonable amount of time.

AlphaGo and Monte Carlo Tree Search

AlphaGo combined Monte Carlo tree search (or a variant) with neural networks to create an AI Go player that competed with and eventually beat top human players. This was a landmark achievement in the field of AI, demonstrating the power of combining search algorithms with machine learning techniques.
Go is more challenging than chess due to its large branching factor (the number of possible moves at each turn) and the difficulty in designing a good evaluation function (a method of assessing the quality of a game state). The vast number of possibilities in Go makes it impossible to explore every possible move, necessitating the use of intelligent search algorithms and evaluation functions.
AlphaGo uses a deep learning-trained evaluation function (neural network). This neural network is trained on a large dataset of Go games and is capable of accurately predicting the outcome of a game from a given state. This allows AlphaGo to make informed decisions about which moves to play.

Why Monte Carlo Tree Search?

Monte Carlo algorithms use random sampling to evaluate a function. This approach is particularly useful when dealing with complex systems where it is difficult to calculate the exact value of a function.
In this case, random games estimate the utility of a move. By simulating a large number of random games, the algorithm can get a good estimate of the value of a particular move.
The basic strategy of randomly generating games isn't sufficient for complex games and needs adaptation. In complex games like Go, simply playing random games is unlikely to lead to good results. The algorithm needs to be more intelligent in how it explores the game tree.
Monte Carlo Tree Search balances exploration (random actions) and exploitation (using learned information). This means that the algorithm will sometimes try new, potentially risky moves (exploration) and sometimes stick with moves that have worked well in the past (exploitation).
A formula balances exploration vs. exploitation during game tree search. This formula helps the algorithm decide when to explore new moves and when to exploit existing knowledge.

Basic Monte Carlo Tree Search

Basic Monte Carlo Tree Search doesn't use an evaluation function. Instead, it relies solely on random simulations to estimate the quality of a move. This makes it applicable to games where it is difficult to design a good evaluation function.
It plays random games to estimate the quality of a move. For each possible move, random games are started, with both players choosing moves randomly. The average quality of these random games determines the move's quality. This approach is based on the idea that the more often a move leads to a win in a random game, the better that move is likely to be.
For chess, the win/loss percentage among randomly generated games is used. Draws can be factored in (e.g., 1 for win, 0.5 for draw). The algorithm keeps track of the number of wins, losses, and draws for each possible move and uses these statistics to estimate the quality of the move.

Play Out/Roll Out Policy

The strategy used to perform these random game simulations is called the "play out policy" or "roll out policy." This policy determines how the game is played out from a given state to the end of the game.
Many random simulations (thousands or more) are needed to estimate the win percentage accurately. The more simulations that are performed, the more accurate the estimate of the win percentage will be.
For complex games, complete exploration with random moves isn't effective. In complex games like Go, simply playing random moves is unlikely to lead to good results. The algorithm needs to be more intelligent in how it explores the game tree.
A more intelligent strategy simulates real games, using a simple neural network model to choose moves. This neural network model can be trained on a large dataset of game data and can learn to make more informed decisions about which moves to play.

Balancing Exploration and Exploitation

Complete exploration of the game tree can be inefficient. Exploring every possible move in the game tree can take a very long time, especially in complex games.
A more intelligent strategy focuses on parts of the game tree already identified as interesting or leading to high utility. This means that the algorithm will focus on exploring the parts of the game tree that are most likely to lead to a good outcome.

Example Scenario

Consider a node with win statistics for white (e.g., 37 wins out of 100 games). This node represents a particular game state, and the win statistics indicate how often white has won from that state in the past.
The next level has nodes for black's possible moves, with win statistics for black. Each of these nodes represents a possible move that black can make from the current state, along with the win statistics for black after making that move.
Pure Monte Carlo Tree Search would start random simulations from each of black's moves. This means that the algorithm would play a large number of random games from each of black's possible moves to estimate the value of each move.
Instead, we use previously learned information and traverse the tree to a leaf node to start simulations. This means that the algorithm will use the win statistics to guide its search and explore the parts of the game tree that are most likely to lead to a good outcome.

Selection Policy

The "selection policy" chooses which part of the stored game tree to explore. This policy determines which node in the game tree the algorithm will visit next.
The selection policy balances the win percentage with exploration, considering nodes with fewer simulations. This means that the algorithm will favor nodes that have a high win percentage but will also explore nodes that have not been visited as often. A formula is used to select the action, considering both win percentage and the number of simulations.

Expansion and Simulation

The selection policy helps traverse the tree and find a leaf node. Once the algorithm has reached a leaf node, it will expand that node by adding a new child node.
A new node is added to the leaf node, corresponding to a randomly chosen move. This new node represents a possible move that can be made from the current state.
A simulated game is started from the new node using the roll out policy. The play out policy determines how the game is played out from the new node to the end of the game.
The game results in a win, loss, or draw, and the new leaf node's value is updated. The algorithm updates the win statistics for the new node based on the outcome of the simulated game.

Back Propagation

Win statistics are updated not only at the new leaf node but also along the path from the root to the leaf node. This means that the algorithm updates the win statistics for all of the nodes that were visited on the path from the root node to the new leaf node.

Four Main Steps of Monte Carlo Tree Search

Selection: Select a leaf node in the tree using the selection policy.
Expansion: Expand the leaf node by adding a new node (randomly choosing a move).
Simulation: Perform a simulation starting at the new node, using the roll out policy.
Back Propagation: Update win/loss statistics along the path from the new leaf node to the root node.

These steps are repeated until a time limit or iteration limit is reached.

Algorithm Pseudo Code

The four main steps (selection, expansion, simulation, back propagation) are performed repeatedly until time runs out.
After the time limit, a move is chosen based on the tree's statistics.

Choosing a Move After the Search

One idea is to pick the move with the highest win percentage. However, a corresponding move may only have been visted once. Choosing the action for which the corresponding node has the highest win probability estimated is not such a good idea. The number of simulations should be taken into account.
The original algorithm often picks the action with the highest number of simulations, assuming the selection policy is intelligent and considers win percentage.

Exploitation and Exploration

Exploitation (finding a good leaf node to expand) is performed in the selection step.
Exploration (visiting new states) is also incorporated in the selection policy.
This balances exploiting current knowledge with exploring new states. This exploration/exploitation trade-off also occurs in reinforcement learning.

Four Main Steps Revisited

Selection: Choose a path through the tree using the selection policy.
Expansion: Add a new child node at the leaf node.
Simulation: Run a simulated game using the roll out policy.
Back Propagation: Update statistics in the tree.

UCB1 Formula

The play out policy can be completely random or slightly more intelligent.
A crucial component is choosing which leaf node to visit in the existing tree.
The UCB1 (Upper Confidence Bound 1) formula computes a numeric value for each possible node/action.
$UCB1 = \frac{wins}{simulations} + c \cdot \sqrt{\frac{\log(parent_visits)}{visits}}$
Where:
- wins = number of wins for the node
- simulations = total number of simulations through the node
- c = exploration parameter to tune (theoretically $\sqrt{2}$ , in practice can differ)
- parent_visits = number of visits to the parent node
- visits = number of visits to the current node

The node with the largest UCB1 score is selected.

Components of the UCB1 Formula

The first component is the estimated utility of the node (wins / simulations).
The second component incorporates exploration, encouraging visits to less-visited nodes.
$c$ is a parameter that can be tuned; in practice may differ from $\sqrt{2}$ .
The exploration term is larger for nodes less visited that, for nodes that haven't been visited as much, the relative contribution of the term is larger.

Formula Interpretation

The win percentage (exploitation) is balanced with a term encouraging exploration of less-visited states
By appropriately combining the win percentage with how often the state has been visited, you can balance the trade-off between exploitation and exploration.
This is the estimated win ration (exploitation term)
The trade-off basically looks at how often the node has been visited (exploration term)

Advantages of Monte Carlo Tree Search

Can be applied to games where human experts haven't developed good evaluation functions.
In AlphaGo, Monte Carlo Tree Search was combined with self-play reinforcement learning to learn new neural network models. With the AlphaGo Go-playing agent, a version of the formula was used where a neural network estimated a probability that's included in the term. The neural network was optimized using self-play.

Limitations of Monte Carlo Tree Search

It's stochastic, and there's no guarantee that all parts of the tree will be visited (unless run for an infinite number of simulations).
It's possible that a good move will be missed.
In games like chess, human-generated evaluation functions might identify clearly better moves faster than basic Monte Carlo Tree Search.

Improvements and Enhancements

AlphaGo utilizes machine learning to generate evaluation functions for early play out termination.

Key Takeaways:

Single play outs are actually super cheap.
Depending on the complexity of the game the formula can generate a very deep path, and updating statistics is cheap.

Upcoming Topics

Next week: Logical reasoning using propositional logic.
Test next Friday.