Paper - Mastering the Game of Go with Deep Neural Networks and Tree Search
This was the paper that introduced AlphaGo Zero, which was the first Go computer program to beat a human professional player in a full-sized game of Go.
It’s a lot harder to get a computer to play Go because the search space is much larger (the average branching factor is around ~250, compared to ~35 for chess), and because it’s difficult to accurately evaluate the board.
The main innovation of this paper was augmenting Monte-Carlo tree search with two deep neural networks: a “value network”, which evaluates board positions, and a “policy network” that selects moves, which were trained through a combination of supervised and reinforcement learning.
More notes
- All games of perfect information have an optimal value function $v^\ast (s)$ which determines the outcome under perfect play by all players.
- Minimax is an algorithm for computing this optimal value function, but this is often intractable.
- The search space is typically reduced by two principles:
    - Reducing the depth by cutting off search when a depth limit is reached, and then using an approximation, or
- Reducing the breadth by sampling actions from a policy $p(a \mid s)$, a distribution of the possible moves in a given state, rather than considering all actions in a given state.
 
- Monte-Carlo rollouts evaluates nodes by randomly sampling actions until the game is over.
- Monte-Carlo tree search combines traditional search with Monte-Carlo rollouts, and the policy is improved over time.
- AlphaZero uses a value network for evaluating states, and a policy network for sampling actions.
- Initially, two supervised learning policy networks $p _ \sigma$ and $p _ \pi$ are trained from expert human moves. $p _ \sigma$ is a much larger network than $p _ \pi$.
- Then a reinforcement learning policy network $p _ \rho$ is trained which improves the $p _ \sigma$ network by optimising for the final outcome of games of self-play. This means that the network predicts actions which will lead to winning outcomes, rather than the most likely move a human would play.
- Then a value network $v _ \theta$ is trained that predicts the winner of games played by $p _ \rho$ against itself.
- These are all integrated into the standard Monte-Carlo tree search algorithm, which is modified to also be able to run at scale.