Network structure of alpha dog and dexterous go

Alpha dog (AlphaGo) means "go king", commonly known as "Alpha dog". It is the first AI in the world to defeat the human go champion. In October 2015, alpha dog defeated European go champion fan Hui section II 5-0. In March 2016, alpha dog defeated world champion Li Shishi 4-1. In 2017, the new version of alphago zero, which does not rely on human experience and teaches itself from scratch, defeated alpha dogs 100:0.

Alpha dog uses strategy network and value network to assist Monte Carlo tree search to reduce the depth and width of search. The placement strategy of Qiqiao go is completely based on zero dog algorithm. This paper will use reinforcement learning language to describe the state and action of go game, and introduce the strategy network and value network constructed in alpha dog and Qiqiao go.

1. Action and status

The chessboard of go is a 19 X 19 grid. The black and white sides take turns to place pieces at the intersection of the two lines. There are 19 X 19 = 361 positions where chess pieces can be placed. At the same time, you can choose pass (give up the right to drop the current piece once), so the action space is A = 0 , 1 , 2 , ⋯   , 361 \mathcal{A}={0,1,2,\cdots,361} A=0,1,2,..., 361, where the second i i i action is indicated in the i i Place chess pieces in i positions (starting from 0), and the 361st action represents PASS.

Qiqiao go is an artificial intelligence program based on 9-way go, that is, the chessboard is a 9 X 9 grid. Corresponding action space A = 0 , 1 , 2 , ⋯   , 81 \mathcal{A}={0,1,2,\cdots,81} A=0,1,2,⋯,81.

Alpha dog 2016 version uses a tensor of 19 X 19 X 48 to represent a state, and zero dog uses a tensor of 19 X 19 X 17 to represent a state. As shown in Figure 1, the meaning of the state tensor used in zero dog is as follows:

  • Each slice of the state tensor is a 19 X 19 matrix, corresponding to a 19 X 19 chessboard. A 19 X 19 matrix can represent the positions of all black chess pieces on the chessboard. If there are black chess pieces in one position, the element of the corresponding position of the matrix is 1, otherwise it is 0. Similarly, you can use a 19 X 19 matrix to represent the positions of all white pieces on the chessboard.
  • There are 17 matrices in the zero dog state tensor. Among them, 8 matrices record the position of sunspots on the chessboard in the last 8 steps, and 8 matrices record the position of sunspots in the last 8 steps. There is also a matrix that represents the next sub square. If the next sub square is black, all the matrix elements are equal to 1. If the next sub square is white, all the elements of the matrix are equal to 0.

In order to reduce the amount of calculation, Qiqiao go simplifies the state tensor. In Qiqiao go, the tensor of 9 X 9 X 10 is used to represent a state, in which four 9 X 9 matrices record the position of sunspots on the chessboard in the last four steps, and four matrices record the position of sunspots. A matrix represents the next drop square. If the next drop is black square, all matrix elements are equal to 0, and if the next drop is white square, they are equal to 1. The last matrix represents the position of the previous step, that is, the element of the previous step is 1, and the other position elements are 0. If the previous step is PASS, all the matrix elements are 0.

The meaning of the state tensor of alpha dog 2016 version is complex. This paper will not expand in detail. For details, please refer to the following figure:

2. Policy network

Policy network π ( a ∣ s ; θ ) \pi(a|s;\theta) π(a∣s; θ) The structure of the zero dog policy network is shown in Figure 3. The input of the zero dog policy network is 19 X 19 X 17 s s s. The output is a 362 dimensional vector f f f. Each of its elements corresponds to an action in the action space. The output layer activation function of the policy network is Softmax, so the vector f f f all elements are positive and add up to 1.

3. Value network

There is also a value network in alpha v π ( s ; ω ) v_\pi(s;\omega) vπ​(s; ω), It is a function of state value V π ( s ) V_\pi(s) According to the approximation of V π (s), the structure of the value network is shown in Figure 4. The input of the value network is in the state of 19 X 19 X 17 s s s. The output is a [ − 1 , + 1 ] [-1, +1] The real number of [− 1, + 1], and its size evaluates the current state s s s is good or bad.

The inputs of policy network and value network are the same, and both are status s s s. And they all use convolution layers s s s is mapped to the eigenvector, so the strategy network and value network share the convolution layer in the zero dog.

4. Smart go network structure

5000 TPU s are used for zero dog training. In order to reduce the amount of calculation, the strategy network and value network are greatly simplified in Qiqiao go. In Qiqiao go, three convolution layers are used from the state s s s, respectively:

  • 32 channel convolution of 3 X 3 step 1;
  • 64 channel convolution of 3 X 3 step 1;
  • 128 channel convolution of 3 X 3 step 1.

In the strategy network part, first use the 1 X 1 8-channel convolution to integrate the information, then connect a full connection layer to compress the feature vector dimension into 256, and finally connect to the output layer; in the value network part, first use the 1 X 1 4-channel convolution to integrate the information, then connect to two full connection layers, and finally connect to the output layer. The specific code is as follows:

# -*- coding: utf-8 -*-
# @Time    : 2021/3/29 21:01
# @Author  : He Ruizhi
# @File    : policy_value_net.py
# @Software: PyCharm

import paddle


class PolicyValueNet(paddle.nn.Layer):
    def __init__(self, input_channels: int = 10,
                 board_size: int = 9):
        """

        :param input_channels: The number of channels entered is 10 by default. For the last 4 steps of both sides, add a plane representing the current falling square and a plane representing the nearest hand position
        :param board_size: Chessboard size
        """
        super(PolicyValueNet, self).__init__()

        # AlphaGo Zero Network Architecture: one body, two heads
        # Feature extraction network part
        self.conv_layer = paddle.nn.Sequential(
            paddle.nn.Conv2D(in_channels=input_channels, out_channels=32, kernel_size=3, padding=1),
            paddle.nn.ReLU(),
            paddle.nn.Conv2D(in_channels=32, out_channels=64, kernel_size=3, padding=1),
            paddle.nn.ReLU(),
            paddle.nn.Conv2D(in_channels=64, out_channels=128, kernel_size=3, padding=1),
            paddle.nn.ReLU()
        )

        # Policy network part
        self.policy_layer = paddle.nn.Sequential(
            paddle.nn.Conv2D(in_channels=128, out_channels=8, kernel_size=1),
            paddle.nn.ReLU(),
            paddle.nn.Flatten(),
            paddle.nn.Linear(in_features=9*9*8, out_features=256),
            paddle.nn.ReLU(),
            paddle.nn.Linear(in_features=256, out_features=board_size*board_size+1),
            paddle.nn.Softmax()
        )

        # Value network part
        self.value_layer = paddle.nn.Sequential(
            paddle.nn.Conv2D(in_channels=128, out_channels=4, kernel_size=1),
            paddle.nn.ReLU(),
            paddle.nn.Flatten(),
            paddle.nn.Linear(in_features=9*9*4, out_features=128),
            paddle.nn.ReLU(),
            paddle.nn.Linear(in_features=128, out_features=64),
            paddle.nn.ReLU(),
            paddle.nn.Linear(in_features=64, out_features=1),
            paddle.nn.Tanh()
        )

    def forward(self, x):
        x = self.conv_layer(x)
        policy = self.policy_layer(x)
        value = self.value_layer(x)
        return policy, value

5. Conclusion

This paper introduces two deep neural networks in alpha dog strategy network and value network, and explains the network implementation in dexterous go. In alpha dog or dexterous go, the neural network structure is not constant, and can be adjusted freely according to personal experience or preference. In general, shallow network can reduce the amount of calculation and speed up the process of training and falling , deep networks may be more promising to train higher-level dogs.

Finally, I hope you can praise this article and point a Star for the Qiqiao go project on GitHub~

Clever go project link: https://github.com/QPT-Family/QPT-CleverGo

Tags: AI

Posted on Thu, 02 Dec 2021 14:12:37 -0500 by 01chris