Week 5 assignment: convolutional neural network (Part3)



HybridSN hyperspectral classification

Problems and feelings


This week, learn MobileNet v1 and mobilnet v2.

MobileNet is proposed to reduce the parameters of neural network and improve its performance without much loss of accuracy. The basic idea is to decompose the convolution into Depthwise+Pointwise convolution, which reduces the trainable parameters and computational complexity.

Specifically, Depthwise convolution convolutes each input feature by channel, Pointwise convolution uses 1 * 1 convolution to fuse the feature map generated by Depthwise convolution on the channel. We can calculate that if we input 3-channel images and output 64 channel characteristic images, we need 3 * 3 * 3 * 64 = 1728 parameters for traditional convolution, while only 3 * 3 * 3 + 3 * 64 = 219 parameters for convolution in MobileNet form. We can see that the number of parameters is greatly reduced.

The following is the analysis of the code part. The code is from the OucTheoryGroup. The code of the data processing and training part is no different from the previous network. Focus on the implementation of DepthWise and Pointwise convolution.

class Block(nn.Module):
    '''Depthwise conv + Pointwise conv'''
    def __init__(self, in_planes, out_planes, stride=1):
        super(Block, self).__init__()
        # Depthwise convolution, 3 * 3 convolution kernel, divided into in_planes, that is, each layer is convoluted separately
        self.conv1 = nn.Conv2d(in_planes, in_planes, kernel_size=3, stride=stride, padding=1, groups=in_planes, bias=False)
        self.bn1 = nn.BatchNorm2d(in_planes)
        # Pointwise convolution, 1 * 1 convolution kernel
        self.conv2 = nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=1, padding=0, bias=False)
        self.bn2 = nn.BatchNorm2d(out_planes)

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = F.relu(self.bn2(self.conv2(out)))
        return out

conv1 is DepthWise convolution. The implementation method is to use group in the parameter conv2d of pytorch. Group will divide the input characteristic map according to the number of groups. Each group uses a part of convolution cores. The number of convolution cores used is equal to the number of channels of the output characteristic map divided by the number of groups. Finally, the outputs of each group are spliced in the channel dimension to form the final output characteristic map, If we set the number of output channels equal to the number of input channels and set group to the number of input channels, we can achieve the effect of using a convolution kernel for each input channel, that is, DepthWise convolution.

conv2 is pointwise convolution. A 1 * 1 convolution kernel is used, and the output characteristic number is the expected characteristic number. In this way, the fusion of DepthWise convolution results is realized.

The final training effect is as follows:

Training 10 epoch s on cifa10 data set can achieve an accuracy of 76.5.

MobileNet V2 is an improvement of MobileNet. MobileNet does not use skip connection, and the trained convolution kernel has many empty parameters. MobileNetV2 adds the idea of residual learning and proposes an inverted residual block.

The main idea of the inverted residual block is to increase the dimension of the input features (increase the number of feature graph channels), then reduce the dimension and then increase the dimension. MobileNetV2 can do this because its special convolution form greatly reduces the number of parameters and computational complexity. It does not need to use 1 * 1 convolution to reduce the computational complexity like resnet.

The code of the changed part is as follows:

class Block(nn.Module):
    '''expand + depthwise + pointwise'''
    def __init__(self, in_planes, out_planes, expansion, stride):
        super(Block, self).__init__()
        self.stride = stride
        # Increase the number of feature map s through expansion
        planes = expansion * in_planes
        self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=1, stride=1, padding=0, bias=False)
        self.bn1 = nn.BatchNorm2d(planes)
        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=stride, padding=1, groups=planes, bias=False)
        self.bn2 = nn.BatchNorm2d(planes)
        self.conv3 = nn.Conv2d(planes, out_planes, kernel_size=1, stride=1, padding=0, bias=False)
        self.bn3 = nn.BatchNorm2d(out_planes)

        # When the step size is 1, if the feature map channels of in and out are different, use a convolution to change the number of channels
        if stride == 1 and in_planes != out_planes:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=1, padding=0, bias=False),
        # When the step size is 1, if the feature map channels of in and out are the same, the input is returned directly
        if stride == 1 and in_planes == out_planes:
            self.shortcut = nn.Sequential()

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = F.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        # Step size is 1, plus shortcut operation
        if self.stride == 1:
            return out + self.shortcut(x)
        # Step size 2, direct output
            return out

  The main modified parameters are plane and shortcut. Plane is the number of output characteristic channels of the first convolution, and its value is the number of input channels * amplification factor, which plays the role of dimension raising. Shortcut is the part of residual learning. Due to the change of the number of channels, the number of input and output channels may be inconsistent. Therefore, 1 * 1 convolution is used to adjust the number of channels. These improvements make it better than MobileNetV1

  Similarly, when training 10 epoch s, the accuracy rate can reach more than 80%.

HybridSN hyperspectral classification

The difference between this network is 3D convolution. The convolution we used before is 2D convolution, and nn.Conv2d is also used when using pytorch API. What is the difference between 3D convolution and 2D convolution? In short, it is the characteristic dimension of the output of a single convolution kernel.

In terms of a convolution kernel, 2D convolution means that the output characteristic image is 2-dimensional. For example, if we have a 3-channel input image, the characteristic image obtained by processing with a convolution kernel will have only one channel (because the convolution kernel itself is a 3 * 3 * 3 matrix, and only one value is output after convolution with a part of the image), That is, a two-dimensional feature map. 3D convolution means that the output of a convolution kernel has three dimensions, which also means that our input is no longer three dimensions, but four dimensions (regardless of the dimension of batch). For example, if our input has the following shape (100,3512512), it means that we have 100 512 * 512 * 3 images. What does 100 mean? Generally speaking, it means "sequence". For example, video is a sequence of images, and CT in medical images is also a sequence. Then 3D convolution is convolution in four dimensions (sequence, feature channel, height and width). That is, 3D convolution does not treat the images in each sequence as independent, but considers multiple images in the sequence at the same time, that is, the features are calculated and fused in the dimension of the sequence. For example, in video data, the features of the same position of images in multiple frames are considered, and for medical images, the features of images in a spatial range are considered at the same time (the sequence of video represents the meaning of time, and the sequence of CT is the meaning of space).

In terms of use, we use nn.Conv3d of pytorch, which provides three convolution kernel sizes (sequence size, height size and width size). The sequence size is how many images the convolution kernel processes at the same time. The remaining two represent the spatial size (window size) of the convolution kernel like 2D convolution.

The code still uses the code in OucTheoryGoup. The completed parts are as follows:

class_num = 16

class HybridSN(nn.Module):
  def __init__(self):
    self.conv3d_part = nn.Sequential(
    self.conv2d_part = nn.Sequential(
    self.linear_part = nn.Sequential(nn.Linear(18496,256),

    self.out = nn.Linear(128,class_num)
  def forward(self,x):
    out_3d = self.conv3d_part(x)
    out_3d = nn.Flatten(-2)(out_3d.permute(0,-2,-1,1,2))
    ipt_2d = out_3d.permute(0,-1,1,2)
    out_2d = self.conv2d_part(ipt_2d)
    out_linear = self.linear_part(out_2d)
    output = self.out(out_linear)
    return output

Let's check the input and output size according to the content on the teacher's blog.

Input and output of our code  

  The first dimension is batch_size. You can see that the dimensions of features are consistent.

After 3D convolution, we need to adjust the dimension of the features for 2D convolution,

We use permute to adjust the number of channels and sequences of input features to the last

  Then we stretch the last two dimensions into one

  Later, we will adjust the feature dimension to make the channel dimension in the second, so as to be consistent with the expression in the teacher's blog

Then 2D convolution:

  The output feature number is consistent with the expression in the teacher's blog. Then, linear mapping will be carried out to output the classification results. We can take a look at the final output dimension of the network:

  It can be seen that it meets the requirements.

The following code directly uses the content of the teacher's blog to make and train the data set. The final results are as follows

  We can see that the accuracy rate is about 79%, which is not as good as the results on the teacher's blog, but it's also reasonable

Problems and feelings

After that, we should practice more code, at least improve some of the current results.

The result of cat and dog classification using mobilenet is not good. The main reason may be that the network model on cifa10 data set is directly applied and the processing of cat and dog data set is inappropriate. Later, the network structure needs to be modified for cat and dog data set, hoping to get better results.

  3D convolution is interesting. Although the network model meets the requirements, it is not clear why there is such a big gap in the effect of the same number of training rounds. Perhaps it is the reason for initialization. It needs to take some time to improve later.

Tags: neural networks Pytorch Deep Learning

Posted on Sun, 03 Oct 2021 21:41:04 -0400 by latino.ad7