Intelligent system Lab1_part2 experiment document
Basic code structure
In part 2 of the lab this time, I used the model in pytorch, which is a convolutional neural network based on LeNet
In the model.py file, we build a volume layer - ReLU layer - pool layer - convolution layer - ReLU layer - pool layer - the cnn neural network that connects the linear layer, call the network in train.py file and load data for training and testing.
model.py
In the construction method, the network is initialized. The first set of conv, namely conv1, is set as the structure of convolution layer - ReLU layer - pooling layer, and the parameters of the layer are set according to the input vector. The second set of conv, namely conv2, is only slightly different in the parameters of convolution layer, and the output is (32, 7, 7), or convolution layer - ReLU layer - pooling layer; Finally, the network output also passes through a linear full connection layer to form a 12 dimensional vector
In pytorch, the channels of tensor (i.e. input / output layer) are sorted as: [batch, channel, height, width]
import torch.nn as nn class LeNet(nn.Module): def __init__(self): super(LeNet, self).__init__() self.conv1 = nn.Sequential( nn.Conv2d(in_channels=1, # Enter the depth of the characteristic matrix out_channels=16, # The depth of the output characteristic matrix is also equal to the number of convolution kernels kernel_size=5, # Size of convolution kernel stride=1, # Step size of convolution kernel padding=2), # Zero filling operation nn.ReLU(), nn.MaxPool2d(kernel_size=2) # 2 * 2 sampling ) # The calculation formula of output layer size after convolution is (W - F + 2P) / S + 1 self.conv2 = nn.Sequential( nn.Conv2d(16, 32, 5, 1, 2), nn.ReLU(), nn.MaxPool2d(2) ) self.out = nn.Linear(32 * 7 * 7, 12) def forward(self, x): x = self.conv1(x) x = self.conv2(x) x = x.view(x.size(0), -1) output = self.out(x) return output
The structure of the initial train.py file is:
train.py
Use DataLoader to load the data set, convert the picture data of training set and test set into tensor format, build the network, define the optimizer and loss function
import torch import torchvision from model import * from torch.autograd import Variable import torch.utils.data as data import torchvision.datasets as dset import matplotlib.pyplot as plt import numpy as np import torchvision.transforms as transforms torch.manual_seed(1) # Super parameter EPOCH = 20 BATCH_SIZE = 5 LR = 0.001 train_data = dset.ImageFolder('train', transform=transforms.Compose([ transforms.Grayscale(1), # single channel transforms.ToTensor() # Convert picture data to tensor format ])) # Load the training set and convert it into easy to operate tensor data train_loader = data.DataLoader(dataset=train_data, batch_size=BATCH_SIZE, shuffle=True) # Load the test set and convert it into easy to operate tensor data test_data = dset.ImageFolder('test', transform=transforms.Compose([ transforms.Grayscale(1), # single channel transforms.ToTensor() # Convert picture data to tensor format ])) test_loader = data.DataLoader(dataset=test_data) cnn = LeNet() # optimizer optimizer = torch.optim.Adam(cnn.parameters(), lr=LR) loss_func = nn.CrossEntropyLoss() for epoch in range(EPOCH): count = 0 total_loss = 0 for i, (train_x, train_y) in enumerate(train_loader): batch_x = Variable(train_x) batch_y = Variable(train_y) # Enter training data output = cnn(batch_x) loss = loss_func(output, batch_y) # Clear last gradient optimizer.zero_grad() # Error reverse transfer loss.backward() # Optimizer parameter update optimizer.step() print("Completed the second", epoch + 1, "Second iteration") count = 0 total_loss = 0 for step, (test_x, test_y) in enumerate(test_loader): test_output = cnn(test_x) pred_y = torch.max(test_output, 1)[1].data.numpy().squeeze() batch_x = Variable(test_x) batch_y = Variable(test_y) output = cnn(batch_x) total_loss += loss_func(output, batch_y).item() index = torch.max(output, 1)[1].data.numpy().squeeze() if index == batch_y.item(): count += 1 print("accuracy: " + str(count / len(test_data)))
The experimental results are as follows:
Design experiment improvement network
The learning of deep network architecture requires a lot of data and high computing power. A large number of connections between neurons and parameters need to be adjusted iteratively through gradient descent and its variants. In addition, some architectures may have over fitting of test data due to strong characterization. At this time, we can use regularization and optimization techniques to solve these two problems.
Including data enhancement, L1 regularization, L2 regularization, Dropout, DropConnect and Early stopping methods
Regularization technique
Regularization technology is an effective tool to ensure the generalization ability of the algorithm. Regularization is also a key model for in-depth learning with more training parameters than the training data set, and can avoid algorithm over fitting (over fitting usually occurs when the input data of algorithm learning cannot reflect the real distribution and there are some noises)
L1/2 regularization
L1/L2 regularization is the most commonly used regularization method. L1 regularization adds a regularization term to the objective function to reduce the sum of the values of the parameters; In L2 regularization, the purpose of adding regularization term is to reduce the sum of squares of parameters. According to previous studies, many parameter vectors in L1 regularization are sparse vectors, because many models lead to parameters close to 0, so it is often used in feature selection settings. The most commonly used regularization method in machine learning is to impose L2 norm constraints on weights.
The standard regularization cost function is as follows:
θ
=
a
r
g
m
i
n
1
N
∑
i
=
1
N
(
L
(
y
i
,
y
)
+
λ
R
(
w
)
)
\theta = argmin\frac{1}{N}\sum_{i=1}^{N}{(L(y_i,y) + {\lambda}R(w))}
θ=argminN1i=1∑N(L(yi,y)+λR(w))
Where the regularization term R( ω) Yes:
R
L
2
(
w
)
=
∣
∣
W
∣
∣
2
2
R_{L_2}(w)=||W||_2^2
RL2(w)=∣∣W∣∣22
L1 regularization:
R
L
1
(
w
)
=
∑
k
=
1
Q
∣
∣
W
∣
∣
1
R_{L_1}(w)=\sum_{k = 1}^{Q}||W||_1
RL1(w)=k=1∑Q∣∣W∣∣1
L1 regularization is non differentiable at zero, so the weight increases with a constant factor approaching zero. Many neural networks use the first-order step in the weight attenuation formula to solve the nonconvex L1 regularization problem. The approximate variant of L1 norm is:
∣
W
∣
1
=
∑
k
=
1
Q
w
k
2
+
ε
|W|_1=\sum_{k=1}^{Q}\sqrt{w_k^2+\varepsilon}
∣W∣1=k=1∑Qwk2+ε
L1 regularization
Instead of simply calling CrossEntropyLoss() each time the loss is calculated, additional regular items are added and backpropagate is performed:
# Using L1 regularization reg_loss = 0 for papam in cnn.parameters(): reg_loss += torch.sum(torch.abs(papam)) classify_loss = loss_func(output, batch_y) loss = classify_loss + 0.01 * reg_loss
epoch | Accuracy using L1 regularization | Accuracy without L1 regularization |
---|---|---|
8 | 0.865555 | 0.974444 |
10 | 0.887777 | 0.987777 |
12 | 0.87 | 0.98 |
15 | 0.878333 | 0.979 |
20 | 0.896666 | 0.983888 |
The above results can be obtained through experiments
It can be clearly observed from the results that the accuracy decreases after L1 regularization. This may be because the twelve classification tasks are too simple, the network structure is simple, there are few parameters, and there is no obvious over fitting phenomenon. Therefore, this strategy is not adopted
L2 regularization
In this lab, L2 regularization is used to set the parameter weight_decaly when setting the optimizer:
#Using L2 regularization optimizer = torch.optim.Adam(cnn.parameters(), lr=LR, weight_decay=0.01)
The following data can be obtained through experiments:
epoch | Accuracy using L2 regularization | Accuracy without L2 regularization |
---|---|---|
8 | 0.9594444/0.972222 | 0.974444 |
10 | 0.9383333/0.974444 | 0.987777 |
12 | 0.9472222/0.98222 | 0.98 |
15 | 0.9527777/0.979444 | 0.979 |
20 | 0.9633333/0.976666 | 0.983888 |
It can be seen that when weight_decaly is taken as 1e-5, the accuracy is slightly higher than that without L2 regularization
Dropout
Bagging is a technology to reduce generalization error by combining multiple models. The main method is to train several different models respectively, and then let all models vote on the output of test samples. Dropout can be considered as a bagging method integrating a large number of deep neural networks, so it provides a cheap bagging integration approximation method, which can train and evaluate value data Number of neural networks.
Dropout refers to temporarily discarding some neurons and their connections. Randomly discarding neurons can prevent overfitting and connect different network architectures exponentially and efficiently. The probability of neurons being discarded is 1 − P, reducing the co adaptation between neurons. The hidden layer usually discards neurons with a probability of 0.5. Use the complete network (the output weight of each node is p) The sample average of all 2^n dropout neurons is approximately calculated. Dropout significantly reduces the over fitting, and improves the learning speed of the algorithm by avoiding the training nodes on the training data.
When building a cnn network, add a Dropout layer before the full connection layer
nn.Dropout()#drop out
epoch | Accuracy with Dropout | Accuracy without Dropout |
---|---|---|
8 | 0.972222 | 0.974444 |
10 | 0.9688889 | 0.987777 |
12 | 0.9783333 | 0.98 |
15 | 0.975 | 0.979 |
20 | 0.9794 | 0.983888 |
Early stop method
Each time the epoch is completed, the accuracy of this epoch is compared with that of the last epoch. When the accuracy decreases, the training is stopped immediately, so as to prevent over fitting
Different optimization functions
I tried many different optimization functions such as random gradient descent and small batch gradient descent. When other conditions are the same, epoch takes 20 and compares them. It is found that there is no obvious difference in accuracy
optimizer = torch.optim.Adam(cnn.parameters(), lr=LR, weight_decay=1e-5)#L2 regularization optimizer = torch.optim.SGD(cnn.parameters(), lr = LR)#Random gradient descent optimizer = torch.optim.ASGD(params, lr=0.01, lambd=0.0001, alpha=0.75, t0=1000000.0, weight_decay=0)#Average random gradient descent algorithm optimizer = torch.optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0)#AdaGrad algorithm optimizer = torch.optim.Adadelta(params, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0)#Adaptive learning rate adjustment Adadelta algorithm optimizer = torch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)#RMSprop algorithm optimizer = torch.optim.Adamax(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)#Adamax algorithm (infinite norm variant of Adamd) optimizer = torch.optim.SparseAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08)#SparseAdam algorithm optimizer = torch.optim.LBFGS(params, lr=1, max_iter=20, max_eval=None, tolerance_grad=1e-05, tolerance_change=1e-09, history_size=100, line_search_fn=None)#L-BFGS algorithm optimizer = torch.optim.Rprop(params, lr=0.01, etas=(0.5, 1.2), step_sizes=(1e-06, 50))#Elastic back propagation algorithm
Batch Normalization
When constructing cnn network, a BatchNorm2d layer is added after the convolution layer
nn.BatchNorm2d(16),#batch normalization
epoch | Accuracy after Batch Normalization | Accuracy without Batch Normalization |
---|---|---|
8 | 0.9661111 | 0.974444 |
10 | 0.9711111 | 0.987777 |
12 | 0.9816666 | 0.98 |
15 | 0.9833333 | 0.979 |
20 | 0.9877777 | 0.983888 |
Understanding of network design
Introduction to CNN structure
The above is a simple CNN structure diagram. Input the picture on the first layer and perform the Convolution operation to obtain the feature map with the depth of 3 on the second layer. Pool the characteristic map of the second layer to obtain the characteristic map with the depth of the third layer of 3. Repeat the above operations to obtain the feature map with the depth of the fifth layer of 5. Finally, expand and connect the five feature maps, that is, the five matrices, into vectors according to rows, and transfer them to the full connected layer. The full connected layer is a BP neural network. Each feature map in the figure can be regarded as neurons arranged in matrix form, The following is the calculation process of Convolution and pooling
convolution
For an input picture, it is transformed into a matrix, and the elements of the matrix are the corresponding pixel values. Suppose there is a 55 image, and a 33 convolution kernel is used for convolution to obtain a 3 * 3 characteristic image. The convolution kernel is also called Filter. The convolution kernel slides in the input matrix. Each time it slides to a position, multiply and sum the corresponding numbers to obtain the elements of a characteristic graph matrix. Note that the convolution kernel slides one unit at a time in the moving graph, and the sliding amplitude can be adjusted as needed. If the sliding step is greater than 1, Then the convolution kernel may not slide to the edge. In this case, zero can be filled in the outermost layer of the matrix
In general, the input image matrix, the convolution kernel and the characteristic image matrix are square matrices. Here, let the size of the input matrix be w ww, the size of the convolution kernel be k kk, the step length be SS, and the number of zero filling layers be p pp, then the calculation formula of the characteristic image size generated after convolution is:
w
′
=
w
+
2
p
−
k
s
+
1
w^{'} = \frac{w+2p-k}{s} + 1
w′=sw+2p−k+1
The above is the process of using a convolution kernel to convolute a feature map. In order to extract more features, multiple convolution kernels can be used to convolute respectively, so that multiple feature maps can be obtained. Sometimes, for a three-channel color picture, or as shown in the third layer feature map, a group of matrices are input, and the convolution kernel is no longer one layer, but becomes the corresponding depth
Pooling
Pooling is also called down sampling, as opposed to up sampling. The feature map obtained by convolution generally needs a pooling layer to reduce the amount of data. The pooling operation is shown in the following figure:
Like convolution, pooling also has a sliding core, which can be called a sliding window. The size of the sliding window in the above figure is 2 × 2 2\times 22 × 2. The stride is 2 22. Each time you slide to an area, the maximum value is taken as the output. This operation is called Max Pooling. You can also output the mean value, which is called Mean Pooling
Full connection
After several layers of convolution and pooling operations, the obtained characteristic graphs are expanded in rows, connected into vectors, and input into the fully connected network
Comparison between CNN and BP
Too many parameters. The fully connected neural network has connections between all nodes in the upper layer and all nodes in the lower layer, resulting in the increase of the number of parameters in the form of the product of the number of nodes in each layer, which is too large for image classification, because each image will have hundreds of pixel points or even more input. CNN solves this problem by convolution of local connections, using the same parameters for the same local, and pooling.
**The position information between pixels is not used** In a fully connected neural network, it seems that each node of the same layer is equal - because they connect all nodes of the upper layer and the lower layer indiscriminately. However, for an image, a pixel is usually closely related to the pixels around it. CNN solves this by convolution of local connections.
**Network layer limit** Due to the limitation of gradient descent method, the number of layers of fully connected neural network is very difficult, because there is the problem of gradient disappearance, which is solved by CNN through multiple means, such as ReLU.