DenseNet Paper Details and PyTorch Replication

DenseNet

1. ResNet and DenseNet

ResNet (Deep Residual Network): By establishing a "short-circuit connection" between the front and back layers, this helps to reverse the propagation of gradients during training, resulting in a deeper CNN network.

DenseNet: uses a dense join mechanism, where all layers are connected to each other, and each layer is concat enated with all the previous layers in the channel dimension to achieve feature reuse as input to the next layer.

Figure 1 Short-circuit connection mechanism for Resnet network (where + represents element-level addition)

This not only slows down the disappearance of the gradient, but also enables it to achieve better performance than ResNet with fewer parameters and computations.


Figure 2 DenseNet network's dense connection mechanism (where c represents channel-level connection operations)

Formula representation

Traditional networks in l l Layer l output is:
x l = H l ( x l − 1 ) x_l=H_l(x_{l-1}) xl​=Hl​(xl−1​)
For ResNet, add input from the previous level:
x l = H l ( x l − 1 ) + x l − 1 x_l=H_l(x_{l-1}) + x_{l-1} xl​=Hl​(xl−1​)+xl−1​
In DenseNet, all the previous layers are connected as input:
x l = H l ( [ x 0 , x 1 , . . . , x l − 1 ] ) x_l=H_l([x_0, x_1, ..., x_{l-1}]) xl​=Hl​([x0​,x1​,...,xl−1​])

H l ( ⋅ ) H_l(·) Hl () stands for a non-linear transformation function and is a combination operation that may include a series of BN(Batch Normalization), ReLU, Pooling, and Conv operations.
Instead of having an arrow pointing to all the layers behind, the feature transfer is done by concat ing the features of all the previous layers directly to the next layer.

network structure

DenseNet's dense connection requires consistent signature size. So the DenseNet network uses the DenseBlock+Transition structure.
DenseBlock is a module that contains many layers, each of which has the same size of feature map, with dense connections between layers.

The Transition module connects two adjacent enseBlocks and reduces the size of the signature map through Pooling.

Figure 4 uses DenseBlock+Transition's DenseNet network

Figure 5 Network structure of DenseNet

DenseBlock

In DenseBlock, the feature maps for each layer are the same size and can be connected on the channel dimension. Nonlinear Combinatorial Functions in DenseBlock H l ( ⋅ ) H_l(·) Hl () uses B N + R e L U + 3 x 3 C o n v BN+ReLU+3x3 Conv Structure of BN+ReLU+3x3Conv.
Assuming the channel number of the input layer's signature graph is k0, and K signature graphs are output after each layer is convoluted in DenseBlock, that is, the channel number of the obtained signature graph is k, then the channel number of the input layer is K0 + (l_1)k. We call k the growth rate of the network.
Because each layer accepts the feature map of all the previous layers, that is, the feature transfer method is to pass the feature concat of all the previous layers directly to the next layer, generally using a smaller K (such as 12), it is important to note that the actual meaning of this K is the newly extracted feature of this layer.
Dense Block uses the order of activation functions before and convolution layer after, that is, BN-ReLU-Conv, which is also known as pre-activation. The usual activation functions, such as model relu, are after convolution conv and batch normalization batchnorm, namely Conv-BN-ReLU, also known as post-activation. The author has shown that if the post-activation design is used, performance will deteriorate.

bottleneck layer

Since the input to the back layer is very large, the bottleneck layer can be used inside DenseBlock to reduce computational load, mainly by adding 1x1 Conv to the original structure, that is
B N + R e L U + 1 × 1 ⋅ C o n v + B N + R e L U + 3 × 3 ⋅ C o n v BN+ReLU+1\times 1·Conv+BN+ReLU+3\times 3· Conv BN+ReLU+1×1⋅Conv+BN+ReLU+3×3⋅Conv
This is called the DenseNet-B structure. Among them, 1x1 Conv produces 4k feature maps. It reduces the number of features and improves computational efficiency.

The number of signature channels for each Bottleneck output is the same.
Here 1 × 1 Convolution works by fixing the number of output channels to reduce the dimension. × 1 The number of channels for convolution output is usually four times that of GrowthRate. When dozens of Bottleneck s are connected, the number of channels after concat increases to thousands, if not 1 × Convolution of 1 to reduce dimensions, followed by 3 × 3 The amount of parameters required for convolution increases dramatically.
For example, the number of input channels 64, the growth rate K=32, after 15 Bottleneck s, the number of channels output 64+15*32=544,
If you do not use 1 × 1 convolution, the 16th Bottleneck layer parameter is 3*3*544*32=156672,
If 1 is used × 1 Convolution, the 16th Bottleneck layer parameter quantity is 1*1*544*128+3*3*128*32=106496, you can see that the parameter quantity is greatly reduced.

Transition Layer

It mainly connects two adjacent enseBlocks and reduces the size of the signature map. The Transition layer consists of a convolution of 1x1 and AvgPooling of 2x2, structured as
B N + R e L U + 1 × 1 C o n v + 2 × 2 A v g P o o l i n g BN+ReLU+1\times 1 Conv+2\times 2 AvgPooling BN+ReLU+1×1Conv+2×2AvgPooling
The Transition layer acts as a compression model. Assuming that the number of channels in the signature graph obtained by DenseBlock on Transition is m, the Transition layer can be generated θ m \theta m θ m features (through a convolution layer), where θ \theta θ Is the compression rate. When Θ= When the compression factor is less than 1, this structure is called DenseNet-C, which is commonly used. Θ= 0.5. DenseNet-BC is called a DenseBlock structure with bottleneck layer and a Transition combination structure with a compression factor less than 1.

Experimental results and discussion

There are four network framework structures mentioned in the article as shown in the table:

DenseNet's comparison results with ResNet on CIFAR-100 and ImageNet datasets.
As you can see from Figure 8, only 0.8M of DSENENet-100 has exceeded ResNet-1001 in performance, with a parameter size of 10.2M. As you can see from Figure 9, DenseNet is also better than ResNet when the parameter size is the same.

Figure 8 ResNet vs DenseNet on CIFAR-100 dataset

Figure 9 ResNet vs DenseNet on ImageNet dataset

DenseNet Advantage

1. Stronger gradient flow:
DenseNet enhances the backward propagation of gradients due to dense connections, making the network easier to train. Because each layer can reach the final error signal, an implicit "deep supervision" is implemented. Error signals can easily be propagated to earlier layers, so earlier layers can be directly supervised (supervised) from the final classification layer.
The problem of vanishing-gradient over-gradient disappearance is alleviated. The deeper the network is, the more likely it is to occur because the input information and gradient information are transferred across many layers. Now this dense connection is equivalent to input and loss being connected directly to each layer, thus alleviating the phenomenon of vanishing-gradient disappearance, so a deeper network is not a problem.

2. Reduce the number of parameters

3. The features of low dimensions are preserved

In standard convolution networks, the final output will only take advantage of extracting the highest level of features.

In DenseNet, however, it uses different levels of features and tends to give smoother decision boundaries. This also explains why DenseNet still performs well when training data is insufficient.


Deficiencies of DenseNet

The disadvantage of DenseNet is that because of the need for multiple Concatnate operations, the data needs to be copied many times, the display memory can easily increase very quickly, and a certain amount of display memory optimization technology is required. In addition, DenseNet is a more special kind of network and ResNet is more general, so ResNet has a wider range of applications.

PyTorch implementation:

import torch
import torch.nn as nn
import torch.nn.functional as F
from collections import OrderedDict

class _DenseLayer(nn.Sequential):
    def __init__(self, num_input_features, growth_rate, bn_size, drop_rate):
        super(_DenseLayer, self).__init__()
        self.add_module('norm1', nn.BatchNorm2d(num_input_features)),
        self.add_module('relu1', nn.ReLU(inplace=True))
        self.add_module('conv1', nn.Conv2d(num_input_features, bn_size *
                                           growth_rate, kernel_size=1, stride=1, bias=False)),
        self.add_module('norm2', nn.BatchNorm2d(bn_size * growth_rate)),
        self.add_module('relu2', nn.ReLU(inplace=True))
        self.add_module('conv2', nn.Conv2d(bn_size*growth_rate, growth_rate,
                                           kernel_size=3, stride=1, padding=1, bias=False))

        self.drop_rate = drop_rate

    def forward(self, x):
        new_features = super(_DenseLayer, self).forward(x)
        if self.drop_rate > 0:
            new_features = F.dropout(new_features, p=self.drop_rate,training=self.training)
        return torch.cat([x, new_features], 1)

class _DenseBlock(nn.Sequential):
    def __init__(self, num_layers, num_input_features, bn_size, growth_rate, drop_rate):
        super(_DenseBlock, self).__init__()
        for i in range(num_layers):
            layer = _DenseLayer(num_input_features + i * growth_rate, growth_rate, bn_size, drop_rate)
            self.add_module('denselayer%d' % (i+1), layer)


class _Transition(nn.Sequential):
    def __init__(self, num_input_features, num_output_features):
        super(_Transition, self).__init__()
        self.add_module('norm', nn.BatchNorm2d(num_input_features))
        self.add_module('relu', nn.ReLU(inplace=True))
        self.add_module('conv', nn.Conv2d(num_input_features, num_output_features,
                                          kernel_size=1, stride=1, bias=False))
        self.add_module('pool', nn.AvgPool2d(kernel_size=2, stride=2))

class DenseNet(nn.Module):
    def __init__(self, growth_rate=32, block_config=(6, 12, 24, 16),
                 num_init_features=64, bn_size=4, drop_rate=0, num_classes=1000):
        super(DenseNet, self).__init__()

        # First convolution
        self.features = nn.Sequential(OrderedDict([
            ('conv0', nn.Conv2d(3, num_init_features, kernel_size=7, stride=2, padding=3, bias=False)),
            ('norm0', nn.BatchNorm2d(num_init_features)),
            ('relu0', nn.ReLU(inplace=True)),
            ('pool0', nn.MaxPool2d(kernel_size=3, stride=2, padding=1))
        ]))

        # Each denseblock
        num_features = num_init_features
        for i, num_layers in enumerate(block_config):
            block = _DenseBlock(num_layers=num_layers, num_input_features=num_features,
                                bn_size=bn_size, growth_rate=growth_rate, drop_rate=drop_rate)
            self.features.add_module('denseblock%d' % (i+1), block)
            num_features = num_features + num_layers * growth_rate
            if i != len(block_config) - 1:
                trans = _Transition(num_input_features=num_features, num_output_features=num_features // 2)
                self.features.add_module('transition%d' % (i+1), trans)
                num_features = num_features // 2

        # Final batch norm
        self.features.add_module('norm5', nn.BatchNorm2d(num_features))

        # Linear layer
        self.classifier = nn.Linear(num_features, num_classes)

        # Official init from torch repo.
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal(m.weight.data)
            elif isinstance(m, nn.BatchNorm2d):
                m.weight.data.fill_(1)
            elif isinstance(m, nn.Linear):
                m.bias.data.zero_()

        def forward(self, x):
            features = self.features(x)
            out = F.relu(features, inplace=True)
            out = F.avg_pool2d(out, kernel_size=7, stride=1).view(features.size(0), -1)
            out = self.classifier(out)

            return out

def densenet121(**kwargs):
    model = DenseNet(num_init_features=64, growth_rate=32, block_config=(6, 12, 24, 16), **kwargs)
    return model

def densenet169(**kwargs):
    model = DenseNet(num_init_features=64, growth_rate=32, block_config=(6, 24, 32, 32), **kwargs)
    return model

def densenet201(**kwargs):
    model = DenseNet(num_init_features=64, growth_rate=32, block_config=(6, 12, 48, 32), **kwargs)
    return model

def densenet264(**kwargs):
    model = DenseNet(num_init_features=64, growth_rate=32, block_config=(6, 12, 64, 48), **kwargs);
    return model

if __name__ == '__main__':
    # 'DenseNet' 'DenseNet121' 'DenseNet169' 'DenseNet201' 'DenseNet264'
    # Example
    net = DenseNet()
    print(net)

Content reference:
Station B: https://www.bilibili.com/video/BV1Ly4y1z7Gh?spm_id_from=333.999.0.0
https://www.cnblogs.com/lyp1010/p/11820967.html

Tags: Pytorch Computer Vision Deep Learning Object Detection

Posted on Wed, 01 Dec 2021 05:16:07 -0500 by rsasalm