# Intelligent system Lab1_part2 experiment document

## Basic code structure

In part 2 of the lab this time, I used the model in pytorch, which is a convolutional neural network based on LeNet

In the model.py file, we build a volume layer - ReLU layer - pool layer - convolution layer - ReLU layer - pool layer - the cnn neural network that connects the linear layer, call the network in train.py file and load data for training and testing.

model.py

In the construction method, the network is initialized. The first set of conv, namely conv1, is set as the structure of convolution layer - ReLU layer - pooling layer, and the parameters of the layer are set according to the input vector. The second set of conv, namely conv2, is only slightly different in the parameters of convolution layer, and the output is (32, 7, 7), or convolution layer - ReLU layer - pooling layer; Finally, the network output also passes through a linear full connection layer to form a 12 dimensional vector

In pytorch, the channels of tensor (i.e. input / output layer) are sorted as: [batch, channel, height, width]

import torch.nn as nn

class LeNet(nn.Module):
def __init__(self):
super(LeNet, self).__init__()
self.conv1 = nn.Sequential(
nn.Conv2d(in_channels=1,    # Enter the depth of the characteristic matrix
out_channels=16,  # The depth of the output characteristic matrix is also equal to the number of convolution kernels
kernel_size=5,    # Size of convolution kernel
stride=1,     # Step size of convolution kernel
padding=2),   # Zero filling operation
nn.ReLU(),
nn.MaxPool2d(kernel_size=2)  # 2 * 2 sampling
)

# The calculation formula of output layer size after convolution is (W - F + 2P) / S + 1

self.conv2 = nn.Sequential(
nn.Conv2d(16, 32, 5, 1, 2),
nn.ReLU(),
nn.MaxPool2d(2)
)

self.out = nn.Linear(32 * 7 * 7, 12)

def forward(self, x):
x = self.conv1(x)
x = self.conv2(x)
x = x.view(x.size(0), -1)
output = self.out(x)
return output



The structure of the initial train.py file is:

train.py

Use DataLoader to load the data set, convert the picture data of training set and test set into tensor format, build the network, define the optimizer and loss function

import torch
import torchvision
from model import *
from torch.autograd import Variable
import torch.utils.data as data
import torchvision.datasets as dset
import matplotlib.pyplot as plt
import numpy as np
import torchvision.transforms as transforms

torch.manual_seed(1)

# Super parameter
EPOCH = 20
BATCH_SIZE = 5
LR = 0.001

train_data = dset.ImageFolder('train', transform=transforms.Compose([
transforms.Grayscale(1),  # single channel
transforms.ToTensor()  # Convert picture data to tensor format
]))
# Load the training set and convert it into easy to operate tensor data

# Load the test set and convert it into easy to operate tensor data
test_data = dset.ImageFolder('test', transform=transforms.Compose([
transforms.Grayscale(1),  # single channel
transforms.ToTensor()  # Convert picture data to tensor format
]))

cnn = LeNet()
# optimizer
optimizer = torch.optim.Adam(cnn.parameters(), lr=LR)
loss_func = nn.CrossEntropyLoss()

for epoch in range(EPOCH):
count = 0
total_loss = 0
for i, (train_x, train_y) in enumerate(train_loader):
batch_x = Variable(train_x)
batch_y = Variable(train_y)
# Enter training data
output = cnn(batch_x)

loss = loss_func(output, batch_y)
# Clear last gradient
# Error reverse transfer
loss.backward()
# Optimizer parameter update
optimizer.step()
print("Completed the second", epoch + 1, "Second iteration")

count = 0
total_loss = 0
for step, (test_x, test_y) in enumerate(test_loader):
test_output = cnn(test_x)
pred_y = torch.max(test_output, 1)[1].data.numpy().squeeze()

batch_x = Variable(test_x)
batch_y = Variable(test_y)
output = cnn(batch_x)
total_loss += loss_func(output, batch_y).item()
index = torch.max(output, 1)[1].data.numpy().squeeze()
if index == batch_y.item():
count += 1

print("accuracy: " + str(count / len(test_data)))


The experimental results are as follows:

## Design experiment improvement network

The learning of deep network architecture requires a lot of data and high computing power. A large number of connections between neurons and parameters need to be adjusted iteratively through gradient descent and its variants. In addition, some architectures may have over fitting of test data due to strong characterization. At this time, we can use regularization and optimization techniques to solve these two problems.

Including data enhancement, L1 regularization, L2 regularization, Dropout, DropConnect and Early stopping methods

### Regularization technique

Regularization technology is an effective tool to ensure the generalization ability of the algorithm. Regularization is also a key model for in-depth learning with more training parameters than the training data set, and can avoid algorithm over fitting (over fitting usually occurs when the input data of algorithm learning cannot reflect the real distribution and there are some noises)

#### L1/2 regularization

L1/L2 regularization is the most commonly used regularization method. L1 regularization adds a regularization term to the objective function to reduce the sum of the values of the parameters; In L2 regularization, the purpose of adding regularization term is to reduce the sum of squares of parameters. According to previous studies, many parameter vectors in L1 regularization are sparse vectors, because many models lead to parameters close to 0, so it is often used in feature selection settings. The most commonly used regularization method in machine learning is to impose L2 norm constraints on weights.

The standard regularization cost function is as follows:
θ = a r g m i n 1 N ∑ i = 1 N ( L ( y i , y ) + λ R ( w ) ) \theta = argmin\frac{1}{N}\sum_{i=1}^{N}{(L(y_i,y) + {\lambda}R(w))} θ=argminN1​i=1∑N​(L(yi​,y)+λR(w))
Where the regularization term R( ω) Yes:
R L 2 ( w ) = ∣ ∣ W ∣ ∣ 2 2 R_{L_2}(w)=||W||_2^2 RL2​​(w)=∣∣W∣∣22​
L1 regularization:
R L 1 ( w ) = ∑ k = 1 Q ∣ ∣ W ∣ ∣ 1 R_{L_1}(w)=\sum_{k = 1}^{Q}||W||_1 RL1​​(w)=k=1∑Q​∣∣W∣∣1​
L1 regularization is non differentiable at zero, so the weight increases with a constant factor approaching zero. Many neural networks use the first-order step in the weight attenuation formula to solve the nonconvex L1 regularization problem. The approximate variant of L1 norm is:
∣ W ∣ 1 = ∑ k = 1 Q w k 2 + ε |W|_1=\sum_{k=1}^{Q}\sqrt{w_k^2+\varepsilon} ∣W∣1​=k=1∑Q​wk2​+ε ​

##### L1 regularization

Instead of simply calling CrossEntropyLoss() each time the loss is calculated, additional regular items are added and backpropagate is performed:

# Using L1 regularization
reg_loss = 0
for papam in cnn.parameters():
reg_loss += torch.sum(torch.abs(papam))
classify_loss = loss_func(output, batch_y)
loss = classify_loss + 0.01 * reg_loss


epochAccuracy using L1 regularizationAccuracy without L1 regularization
80.8655550.974444
100.8877770.987777
120.870.98
150.8783330.979
200.8966660.983888

The above results can be obtained through experiments

It can be clearly observed from the results that the accuracy decreases after L1 regularization. This may be because the twelve classification tasks are too simple, the network structure is simple, there are few parameters, and there is no obvious over fitting phenomenon. Therefore, this strategy is not adopted

##### L2 regularization

In this lab, L2 regularization is used to set the parameter weight_decaly when setting the optimizer:

#Using L2 regularization
optimizer = torch.optim.Adam(cnn.parameters(), lr=LR, weight_decay=0.01)


The following data can be obtained through experiments:

epochAccuracy using L2 regularizationAccuracy without L2 regularization
80.9594444/0.9722220.974444
100.9383333/0.9744440.987777
120.9472222/0.982220.98
150.9527777/0.9794440.979
200.9633333/0.9766660.983888

It can be seen that when weight_decaly is taken as 1e-5, the accuracy is slightly higher than that without L2 regularization

### Dropout

Bagging is a technology to reduce generalization error by combining multiple models. The main method is to train several different models respectively, and then let all models vote on the output of test samples. Dropout can be considered as a bagging method integrating a large number of deep neural networks, so it provides a cheap bagging integration approximation method, which can train and evaluate value data Number of neural networks.

Dropout refers to temporarily discarding some neurons and their connections. Randomly discarding neurons can prevent overfitting and connect different network architectures exponentially and efficiently. The probability of neurons being discarded is 1 − P, reducing the co adaptation between neurons. The hidden layer usually discards neurons with a probability of 0.5. Use the complete network (the output weight of each node is p) The sample average of all 2^n dropout neurons is approximately calculated. Dropout significantly reduces the over fitting, and improves the learning speed of the algorithm by avoiding the training nodes on the training data.

When building a cnn network, add a Dropout layer before the full connection layer

nn.Dropout()#drop out

epochAccuracy with DropoutAccuracy without Dropout
80.9722220.974444
100.96888890.987777
120.97833330.98
150.9750.979
200.97940.983888

### Early stop method

Each time the epoch is completed, the accuracy of this epoch is compared with that of the last epoch. When the accuracy decreases, the training is stopped immediately, so as to prevent over fitting

### Different optimization functions

I tried many different optimization functions such as random gradient descent and small batch gradient descent. When other conditions are the same, epoch takes 20 and compares them. It is found that there is no obvious difference in accuracy

optimizer = torch.optim.Adam(cnn.parameters(), lr=LR, weight_decay=1e-5)#L2 regularization

optimizer = torch.optim.SGD(cnn.parameters(), lr = LR)#Random gradient descent

optimizer = torch.optim.ASGD(params, lr=0.01, lambd=0.0001, alpha=0.75, t0=1000000.0, weight_decay=0)#Average random gradient descent algorithm

optimizer = torch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)#RMSprop algorithm

optimizer = torch.optim.Adamax(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)#Adamax algorithm (infinite norm variant of Adamd)

optimizer = torch.optim.SparseAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08)#SparseAdam algorithm

optimizer = torch.optim.LBFGS(params, lr=1, max_iter=20, max_eval=None, tolerance_grad=1e-05, tolerance_change=1e-09, history_size=100, line_search_fn=None)#L-BFGS algorithm

optimizer = torch.optim.Rprop(params, lr=0.01, etas=(0.5, 1.2), step_sizes=(1e-06, 50))#Elastic back propagation algorithm


### Batch Normalization

When constructing cnn network, a BatchNorm2d layer is added after the convolution layer

nn.BatchNorm2d(16),#batch normalization

epochAccuracy after Batch NormalizationAccuracy without Batch Normalization
80.96611110.974444
100.97111110.987777
120.98166660.98
150.98333330.979
200.98777770.983888

## Understanding of network design

### Introduction to CNN structure

The above is a simple CNN structure diagram. Input the picture on the first layer and perform the Convolution operation to obtain the feature map with the depth of 3 on the second layer. Pool the characteristic map of the second layer to obtain the characteristic map with the depth of the third layer of 3. Repeat the above operations to obtain the feature map with the depth of the fifth layer of 5. Finally, expand and connect the five feature maps, that is, the five matrices, into vectors according to rows, and transfer them to the full connected layer. The full connected layer is a BP neural network. Each feature map in the figure can be regarded as neurons arranged in matrix form, The following is the calculation process of Convolution and pooling

#### convolution

For an input picture, it is transformed into a matrix, and the elements of the matrix are the corresponding pixel values. Suppose there is a 55 image, and a 33 convolution kernel is used for convolution to obtain a 3 * 3 characteristic image. The convolution kernel is also called Filter. The convolution kernel slides in the input matrix. Each time it slides to a position, multiply and sum the corresponding numbers to obtain the elements of a characteristic graph matrix. Note that the convolution kernel slides one unit at a time in the moving graph, and the sliding amplitude can be adjusted as needed. If the sliding step is greater than 1, Then the convolution kernel may not slide to the edge. In this case, zero can be filled in the outermost layer of the matrix

In general, the input image matrix, the convolution kernel and the characteristic image matrix are square matrices. Here, let the size of the input matrix be w ww, the size of the convolution kernel be k kk, the step length be SS, and the number of zero filling layers be p pp, then the calculation formula of the characteristic image size generated after convolution is:
w ′ = w + 2 p − k s + 1 w^{'} = \frac{w+2p-k}{s} + 1 w′=sw+2p−k​+1
The above is the process of using a convolution kernel to convolute a feature map. In order to extract more features, multiple convolution kernels can be used to convolute respectively, so that multiple feature maps can be obtained. Sometimes, for a three-channel color picture, or as shown in the third layer feature map, a group of matrices are input, and the convolution kernel is no longer one layer, but becomes the corresponding depth

#### Pooling

Pooling is also called down sampling, as opposed to up sampling. The feature map obtained by convolution generally needs a pooling layer to reduce the amount of data. The pooling operation is shown in the following figure:

Like convolution, pooling also has a sliding core, which can be called a sliding window. The size of the sliding window in the above figure is 2 × 2 2\times 22 × 2. The stride is 2 22. Each time you slide to an area, the maximum value is taken as the output. This operation is called Max Pooling. You can also output the mean value, which is called Mean Pooling

#### Full connection

After several layers of convolution and pooling operations, the obtained characteristic graphs are expanded in rows, connected into vectors, and input into the fully connected network

### Comparison between CNN and BP

Too many parameters. The fully connected neural network has connections between all nodes in the upper layer and all nodes in the lower layer, resulting in the increase of the number of parameters in the form of the product of the number of nodes in each layer, which is too large for image classification, because each image will have hundreds of pixel points or even more input. CNN solves this problem by convolution of local connections, using the same parameters for the same local, and pooling.

**The position information between pixels is not used** In a fully connected neural network, it seems that each node of the same layer is equal - because they connect all nodes of the upper layer and the lower layer indiscriminately. However, for an image, a pixel is usually closely related to the pixels around it. CNN solves this by convolution of local connections.

**Network layer limit** Due to the limitation of gradient descent method, the number of layers of fully connected neural network is very difficult, because there is the problem of gradient disappearance, which is solved by CNN through multiple means, such as ReLU.

Posted on Tue, 02 Nov 2021 05:45:04 -0400 by fahrvergnuugen