Deep learning: starting from the deeper network

This paper is part of the reading notes of "getting started with deep learning based on Python theory and implementation"
Code and picture refer to this book

catalog

Add deep motivation

Reduce the number of network parameters
Make learning more efficient

High speed of deep learning

High speed based on GPU
Distributed learning
Digit reduction of operation precision

Realize deep CNN
Application cases of deep learning (Introduction)

Object recognition
Object detection
image segmentation

Add deep motivation

Reduce the number of network parameters

To be more specific, a deeper layer network can achieve the same level (or stronger) performance with fewer parameters than a deeper layer network. This point combines the filter size in convolution operation to think about it.

Each output node of the filter using 5 × 55 \times 55 × 5 is calculated from a 5 × 55 \times 55 × 5 region of the input data.

As can be seen from the above figure, after repeating the convolution layer of 3 × 33 \times 33 × 3 twice, the output data is also calculated after observing a 5 × 55 \times 55 × 5 area of the input data

In other words, the area of a convolution operation of 5 × 55 \times 55 × 5 can be regarded as offset by two convolutions of 3 × 33 \times 33 × 3, and the number of parameters is less. The difference in the number of parameters increases as the layer deepens. For example, if you repeat three convolutions of 3 × 33 \times 33 × 3, the total number of parameters is 27. In order to use a convolution operation to "observe" the same area, a filter of 7 × 77 \times 77 × 7 is needed. At this time, the number of parameters is 49.

From this we can see the advantages of adding small filters to deepen the network: reducing the number of parameters, expanding the receptive field (a local space area that applies changes to neurons). In addition, the activation functions, such as ReLU, are sandwiched in the middle of the convolution layer to further improve the performance of the network. This is because the "non-linear" expressive force based on activation function is added to the network. Through the superposition of non-linear functions, more complex things can be represented.

Make learning more efficient

In the convolution layer, neurons will respond to simple shapes such as edges. With the deepening of the layer, neurons begin to respond to more complex things such as textures and object components. The convolution layer of CNN can extract information hierarchically, so as to learn efficiently.

For example, consider the problem of dog recognition. To solve this problem with shallow network, convolution layer needs to understand a lot of "dog" characteristics at once. There are various kinds of dogs, and the appearance changes greatly according to the different shooting environment. Therefore, to understand the characteristics of "dog", we need a lot of different learning data, which will lead to a lot of learning time. However, by deepening the network, problems that need to be learned can be decomposed hierarchically. Therefore, the problems that need to be learned at all levels become simpler problems. For example, the first layer only needs to focus on the learning edge, so that it can learn efficiently with less learning data. Why is that? Because compared with the pictures printed with "dog", there are many images with edges, and the pattern of edges is simpler than that of "dog".

High speed of deep learning

High speed based on GPU

Baidu's AI Studio can use 12 hours of GPU every day 😏 Real fragrance

GPU was originally used as a special graphics card for image processing, but recently it is not only used for image processing, but also for general numerical calculation. Because GPU can perform parallel numerical computation at high speed, the goal of GPU computing is to use this overwhelming computing power for various purposes.

In depth learning, a lot of product accumulation operations (or product operations of large matrices) are needed. This large amount of parallel computing is what GPU is good at (conversely, CPU is good at continuous and complex computing). Therefore, compared with the use of a single CPU, the use of GPU for deep learning can achieve amazing high speed.

GPU is mainly provided by NVIDIA and AMD. Although both GPUs can be used for general numerical calculation, NVIDIA GPUs are "close" to deep learning. In fact, most deep learning frameworks only benefit from NVIDIA's GPUs. This is because CUDA, a comprehensive development environment for GPU computing provided by NVIDIA, is used in the framework of deep learning.

cuDNN shown in the figure below is a library running on CUDA, which implements functions optimized for deep learning, etc.

At the same time, it should be noted that Numpy library will not actively detect and use GPU. If you want to use GPU for calculation, you can use its substitute minpy or directly use other deep learning frameworks. Because the previous implementation has always used Numpy, it can only use CPU for network training

Distributed learning

In order to further improve the computing speed of deep learning, distributed computing on multiple GPU s or multiple machines can be considered.

"How to do distributed computing" is a very difficult topic. It contains many problems that can not be solved easily, such as communication between machines, data synchronization and so on. These problems can be handed over to excellent frameworks such as TensorFlow.

Digit reduction of operation precision

In the high speed of deep learning, in addition to the amount of computation, memory capacity and bus bandwidth may also become bottlenecks. With regard to memory capacity, a large number of weight parameters or intermediate data need to be considered in memory. With regard to bus bandwidth, when the data flowing through the GPU (or CPU) bus exceeds a certain limit, it becomes a bottleneck. With these in mind, we want to minimize the number of bits of data that flow through the network.

Deep learning does not require the number of digits with numerical accuracy. This is an important property of neural network. This property is based on the robustness of neural network. The robustness here means that, for example, even if the input image is accompanied by some small noise, the output result remains unchanged. It can be said that because of this robustness, even if the data flowing through the network is "degraded", the impact on the output results is small.

According to the previous experimental results, in the deep learning, even the half float of 16 bits can be learned smoothly. In fact, Pascal, the next generation GPU framework of NVIDIA, also supports the operation of semi precision floating-point numbers, so it can be considered that semi precision floating-point numbers will be used as the standard in the future.

In the past, the implementation of deep learning did not pay attention to the precision of numerical value, but in Python, 64 bit floating-point numbers are generally used. The 16 bit semi precision floating-point type is provided in NumPy (however, only the 16 bit type is stored, and the operation itself does not need 16 bits). Even if the NumPy semi precision floating-point number is used, the recognition accuracy will not decline.

Especially when using deep learning in embedded applications, it is very important to reduce the number of bits.

Realize deep CNN

This network structure refers to VGG. All convolution layers use 3 × 33 \times 33 × 3 small filters, pad = 1, strip = 1pad = 1, strip = 1pad = 1, Strip = 1 (there is one layer of convolution layer filter pad=2pad=2pad=2 to ensure that the length and width of the input data are even before passing through the last layer of pooling layer), so that the length and width of the input data are the same and the number of channels is increased (the number of channels is increased in the order of 16, 16, 32, 32, 64, 64 from the front layer). The pooling layer is used to gradually reduce the space size of the intermediate data. The filter of 2 × 22 \times 22 × 2 is used, pad = 0, strip = 2pad = 0, strip = 0, strip = 2, so that each time the input data passes through a pooling layer, the number of channels remains unchanged and the length and width are halved. The Dropout layer is also added to suppress over fitting.

import sys file_path = __file__.replace('\\', '/') dir_path = file_path[: file_path.rfind('/')] # Path to the current folder pardir_path = dir_path[: dir_path.rfind('/')] sys.path.append(pardir_path) # Add upper level directory to python module search path import numpy as np from func.gradient import numerical_gradient, gradient_check from layer.common import * from collections import OrderedDict import os import pickle class DeepConvNet: """ //The network structure is as follows conv - relu - conv- relu - pool - conv - relu - conv- relu - pool - conv - relu - conv- relu - pool - affine - relu - dropout - affine - dropout - softmax //Shape change of input data (1,28,28) (16, 28, 28) - (16, 28, 28) - (16, 14, 14) - (32, 14, 14) - (32, 16, 16) - (32, 8, 8) - (64, 8, 8) - (64, 8, 8) - (64, 4, 4) - (50) - (10) //If you want to change the parameters of convolution kernel or the number of neurons in the full connection layer, you need to not only adjust the parameters in the initialization method, but also manually change the standard deviation of initialization weight and the input data shape of the first hidden layer in the code """ def __init__(self, input_dim=(1, 28, 28), conv_param_1 = {'filter_num':16, 'filter_size':3, 'pad':1, 'stride':1}, conv_param_2 = {'filter_num':16, 'filter_size':3, 'pad':1, 'stride':1}, conv_param_3 = {'filter_num':32, 'filter_size':3, 'pad':1, 'stride':1}, conv_param_4 = {'filter_num':32, 'filter_size':3, 'pad':2, 'stride':1}, conv_param_5 = {'filter_num':64, 'filter_size':3, 'pad':1, 'stride':1}, conv_param_6 = {'filter_num':64, 'filter_size':3, 'pad':1, 'stride':1}, hidden_size=50, output_size=10, dropout_ratio=0.5, pretrain_flag=True, pkl_file_name=None): self.pkl_file_name = pkl_file_name if pretrain_flag == 1 and os.path.exists(self.pkl_file_name): self.load_pretrain_model() else: # Initialize weights=========== # The neurons in each layer are connected with several neurons in the previous layer on average # conv1 conv2 conv3 conv4 conv5 conv6 affine1 affine2 pre_node_nums = np.array([1*3*3, 16*3*3, 16*3*3, 32*3*3, 32*3*3, 64*3*3, 64*4*4, hidden_size]) wight_init_scales = np.sqrt(2.0 / pre_node_nums) # Recommended initial value with ReLU self.params = {} pre_channel_num = input_dim[0] for idx, conv_param in enumerate([conv_param_1, conv_param_2, conv_param_3, conv_param_4, conv_param_5, conv_param_6]): self.params['W' + str(idx+1)] = wight_init_scales[idx] * np.random.randn(conv_param['filter_num'], pre_channel_num, conv_param['filter_size'], conv_param['filter_size']) self.params['b' + str(idx+1)] = np.zeros(conv_param['filter_num']) pre_channel_num = conv_param['filter_num'] self.params['W7'] = wight_init_scales[6] * np.random.randn(64*4*4, hidden_size) self.params['b7'] = np.zeros(hidden_size) self.params['W8'] = wight_init_scales[7] * np.random.randn(hidden_size, output_size) self.params['b8'] = np.zeros(output_size) # Generative layer=========== self.layers = [] self.layers.append(Convolution(self.params['W1'], self.params['b1'], conv_param_1['stride'], conv_param_1['pad'])) self.layers.append(Relu()) self.layers.append(Convolution(self.params['W2'], self.params['b2'], conv_param_2['stride'], conv_param_2['pad'])) self.layers.append(Relu()) self.layers.append(Pooling(pool_h=2, pool_w=2, stride=2)) self.layers.append(Convolution(self.params['W3'], self.params['b3'], conv_param_3['stride'], conv_param_3['pad'])) self.layers.append(Relu()) self.layers.append(Convolution(self.params['W4'], self.params['b4'], conv_param_4['stride'], conv_param_4['pad'])) self.layers.append(Relu()) self.layers.append(Pooling(pool_h=2, pool_w=2, stride=2)) self.layers.append(Convolution(self.params['W5'], self.params['b5'], conv_param_5['stride'], conv_param_5['pad'])) self.layers.append(Relu()) self.layers.append(Convolution(self.params['W6'], self.params['b6'], conv_param_6['stride'], conv_param_6['pad'])) self.layers.append(Relu()) self.layers.append(Pooling(pool_h=2, pool_w=2, stride=2)) self.layers.append(Affine(self.params['W7'], self.params['b7'])) self.layers.append(Relu()) self.layers.append(Dropout(dropout_ratio)) self.layers.append(Affine(self.params['W8'], self.params['b8'])) self.layers.append(Dropout(dropout_ratio)) self.last_layer = SoftmaxWithLoss() def load_pretrain_model(self): with open(self.pkl_file_name, 'rb') as f: model = pickle.load(f) for key in ('params', 'layers', 'last_layer'): exec('self.' + key + '=model.' + key) print('params loaded!') def predict(self, x, train_flg=False): for layer in self.layers: if isinstance(layer, Dropout): x = layer.forward(x, train_flg) else: x = layer.forward(x) return x def loss(self, x, t): y = self.predict(x, train_flg=True) return self.last_layer.forward(y, t) def accuracy(self, x, t, batch_size=100): if t.ndim != 1: t = np.argmax(t, axis=1) acc = 0.0 for i in range(int(x.shape[0] / batch_size)): tx = x[i*batch_size:(i+1)*batch_size] tt = t[i*batch_size:(i+1)*batch_size] y = self.predict(tx, train_flg=False) y = np.argmax(y, axis=1) acc += np.sum(y == tt) return acc / x.shape[0] def gradient(self, x, t): # forward self.loss(x, t) # backward dout = 1 dout = self.last_layer.backward(dout) for layer in reversed(self.layers): dout = layer.backward(dout) # set up grads = {} for i, layer_idx in enumerate((0, 2, 5, 7, 10, 12, 15, 18)): grads['W' + str(i+1)] = self.layers[layer_idx].dW grads['b' + str(i+1)] = self.layers[layer_idx].db return grads if __name__ == '__main__': from dataset.mnist import load_mnist from trainer.trainer import Trainer (x_train, t_train), (x_test, t_test) = load_mnist(normalize=True, flatten=False, one_hot_label=True, shuffle_data=True) # setting train_flag = 1 # Training or forecasting gradcheck_flag = 0 # Gradient test of trained network pkl_file_name = dir_path + '/deep_convnet.pkl' fig_name = dir_path + '/deep_convnet.png' net = DeepConvNet(input_dim=(1, 28, 28), conv_param_1 = {'filter_num':16, 'filter_size':3, 'pad':1, 'stride':1}, conv_param_2 = {'filter_num':16, 'filter_size':3, 'pad':1, 'stride':1}, conv_param_3 = {'filter_num':32, 'filter_size':3, 'pad':1, 'stride':1}, conv_param_4 = {'filter_num':32, 'filter_size':3, 'pad':2, 'stride':1}, conv_param_5 = {'filter_num':64, 'filter_size':3, 'pad':1, 'stride':1}, conv_param_6 = {'filter_num':64, 'filter_size':3, 'pad':1, 'stride':1}, hidden_size=50, output_size=10, dropout_ratio=0.5, pretrain_flag=True, pkl_file_name=pkl_file_name) trainer = Trainer(net, x_train, t_train, x_test, t_test, epochs=5, mini_batch_size=128, optimizer='Adam', optimizer_param={}, save_model_flag=True, pkl_file_name=pkl_file_name, plot_flag=True, fig_name=fig_name, evaluate_sample_num_per_epoch=1000, verbose=True) if gradcheck_flag == 1: gradient_check(net, x_train[:2], t_train[:2]) if train_flag: trainer.train() else: acc = net.accuracy(x_train, t_train) print('accuracy:', acc)

After training three epoch s, the accuracy has successfully exceeded 99%, and no over fitting phenomenon has occurred

=============== Final Test Accuracy =============== test acc:0.9917

Application cases of deep learning (Introduction)

Object recognition

Object detection

image segmentation

Image segmentation refers to the classification of images at the pixel level. As shown in the figure below, we use the pixel as the unit to learn the supervision data of each object coloring separately. Then, in reasoning, all pixels of the input image are classified.

In order to segment the image based on neural network, the simplest method is to take all pixels as objects and perform reasoning processing for each pixel. For example, prepare a network to classify pixels in the center of a rectangular area, and perform reasoning processing with all pixels as objects. As you can imagine, such a method needs to carry out the corresponding forward processing according to the number of pixels, so it takes a lot of time (correctly speaking, the convolution operation will repeat the meaningless calculation of many areas). In order to solve this meaningless computing problem, a method named FCN (full revolutionary network) was proposed. This method classifies all pixels by a forward process

Compared with general CNN, which contains full connection layer, FCN replaces full connection layer with convolution layer which plays the same role. In the full connection layer of the network used in object recognition, the spatial capacity of the intermediate data is processed as a row of nodes, while in the network only composed of the convolution layer, the spatial capacity can remain the same until the final output.

In the full connection layer, the output is connected to all the inputs. Using convolution layer can also realize the same connection with this structure. For example, a full connection layer for data with an input size of 32 × 10 × 10 (number of channels 32, height 10, length 10) can be replaced by a convolution layer with a filter size of 32 × 10 × 10. If the number of output nodes in the full connection layer is 100, then 100 32 × 10 × 10 filters in the convolution layer can achieve the same processing. In this way, the full connection layer can be replaced by a convolution layer that performs the same processing.

As shown in the figure below, FCN is characterized by the introduction of the process of expanding the space size. Based on this processing, the reduced intermediate data can be expanded to the same size as the input image. The final expansion processing of FCN is based on the expansion of bilinear interpolation (bilinear interpolation expansion). In FCN, this bilinear interpolation expansion is realized by deconvolution (deconvolution operation)