Let's focus on the local part of the neural network: as shown in the figure below, let's assume that our original input is x x x. And the ideal mapping I want to learn is f ( x ) f(\mathbf{x}) f(x) (as input to the activation function above the figure below). The part in the dotted line box on the left of the figure below needs to fit the mapping directly f ( x ) f(\mathbf{x}) f(x), and the part in the dotted line box in the right figure needs to fit the residual mapping f ( x ) − x f(\mathbf{x})-\mathbf{x} f(x)−x. Residual mapping is often easier to optimize in reality. At the beginning of this section, we mention the identity mapping as the ideal mapping we want to learn f ( x ) f(\mathbf{x}) f(x), we only need to set the weight and offset parameters of the weighting operation (such as affine) at the top of the dotted line box in the right figure of the following figure to 0, then f ( x ) f(\mathbf{x}) f(x) is an identity map. In practice, when ideal mapping f ( x ) f(\mathbf{x}) When f(x) is very close to the identity map, the residual is easy to capture the subtle fluctuations of the identity map. The figure on the right shows the infrastructure of ResNet - residual block. In the residual block, the input can propagate forward faster through the cross layer data line.
ResNet uses VGG's complete KaTeX parse error: Undefined control sequence: \time at position 2: 3\ ̲ t ̲ i ̲ m ̲ e ̲ 3. Convolution design. There are first two KaTeX parse error: Undefined control sequence: \time at position 2: 3 with the same number of output channels in the residual block\ ̲ t ̲ i ̲ m ̲ e ̲ 3. Convolution. Each convolution layer is followed by a batch normalization layer and ReLU activation function. Then we skip the two convolution operations through the cross layer data path and add the input directly before the final ReLU activation function. Such a design requires that the output and input shapes of the two convolution layers are the same, so they can be added. If you want to change the number of channels, you need to introduce an additional channel
1
×
1
1\times1
one × In the convolution layer, the input is transformed into the required shape and then added. The implementation of residual block is as follows:
import torch from torch import nn from torch.nn import functional as F from d2l import torch as d2l class Residual(nn.Module): def __init__(self, input_channels, num_channels, use_1x1conv=False, strides=1): super().__init__() self.conv1 = nn.Conv2d(input_channels, num_channels, kernel_size=3, padding=1, stride=strides) self.conv2 = nn.Conv2d(num_channels, num_channels, kernel_size=3, padding=1) if use_1x1conv: self.conv3 = nn.Conv2d(input_channels, num_channels, kernel_size=1, stride=strides) else: self.conv3 = None self.bn1 = nn.BatchNorm2d(num_channels) self.bn2 = nn.BatchNorm2d(num_channels) self.relu = nn.ReLU(inplace=True) def forward(self, X): Y = F.relu(self.bn1(self.conv1(X))) Y = self.bn2(self.conv2(Y)) if self.conv3: X = self.conv3(X) Y += X return F.relu(Y)
As shown in the figure below, in addition, the code generates two types of networks: one is in use_1x1conv=False, add input to output before applying ReLU nonlinear function. The other is in use_ When 1x1conv = true, add through 1 × 1 1\times1 one × 1. Adjust the channel and resolution.
Let's see if the input and output shapes are consistent.
blk = Residual(3, 3) X = torch.rand(4, 3, 6, 6) Y = blk(X) Y.shape
torch.Size([4, 3, 6, 6])
We can also increase the number of output channels and halve the output height and width.
blk = Residual(3, 6, use_1x1conv=True, strides=2) blk(X).shape
torch.Size([4, 6, 3, 3])
ResNet model
The first two layers of ResNet are: when the number of output channels is 64 and the step is 2 7 × 7 7\times7 seven × 7 after the convolution, followed by a step of 2 3 × 3 3\times3 three × 3 Maximum convergence layer. The difference is that a batch normalization layer is added after each convolution layer of ResNet.
b1 = nn.Sequential(nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3), nn.BatchNorm2d(64), nn.ReLU(), nn.MaxPool2d(kernel_size=3, stride=2, padding=1))
ResNet uses four modules composed of residual blocks, and each module uses several residual blocks with the same number of output channels. The number of channels of the first module is consistent with the number of input channels. Since the maximum convergence layer with a step of 2 has been used before, there is no need to reduce the height and width. Each subsequent module doubles the number of channels of the previous module in the first residual block, and halves the height and width.
Let's implement this module. Note that we have made special treatment for the first module.
def resnet_block(input_channels, num_channels, num_residuals, first_block=False): blk = [] for i in range(num_residuals): if i == 0 and not first_block: blk.append(Residual(input_channels, num_channels, use_1x1conv=True, strides=2)) else: blk.append(Residual(num_channels, num_channels)) return blk
Then add all residual blocks in ResNet, where each module uses 2 residual blocks.
b2 = nn.Sequential(*resnet_block(64, 64, 2, first_block=True)) b3 = nn.Sequential(*resnet_block(64, 128, 2)) b4 = nn.Sequential(*resnet_block(128, 256, 2)) b5 = nn.Sequential(*resnet_block(256, 512, 2))
Finally, the global average aggregation layer and the full connection layer output are added to ResNet.
net = nn.Sequential(b1, b2, b3, b4, b5, nn.AdaptiveAvgPool2d((1, 1)), nn.Flatten(), nn.Linear(512, 10))
Each module has 4 convolution layers (excluding identity mapping) 1 × 1 1\times1 one × 1) convolution. Add the first one 7 × 7 7\times7 seven × 7 convolution layer and the last fully connected layer, a total of 18 layers. Therefore, this model is commonly referred to as ResNet-18. Different ResNet models can be obtained by configuring different channel numbers and residual blocks in modules, such as ResNet-152 in deeper layer 152. ResNet has simpler structure and easier modification. These factors have led to the rapid and widespread use of ResNet. The following figure describes the complete ResNet-18.
Before training ResNet, let's observe how the input shapes of different modules in ResNet change. In all previous architectures, the resolution decreases and the number of channels increases until the global average aggregation layer aggregates all features.
X = torch.rand(size=(1, 1, 224, 224)) for layer in net: X = layer(X) print(layer.__class__.__name__,'output shape:\t', X.shape)
Sequential output shape: torch.Size([1, 64, 56, 56]) Sequential output shape: torch.Size([1, 64, 56, 56]) Sequential output shape: torch.Size([1, 128, 28, 28]) Sequential output shape: torch.Size([1, 256, 14, 14]) Sequential output shape: torch.Size([1, 512, 7, 7]) AdaptiveAvgPool2d output shape: torch.Size([1, 512, 1, 1]) Flatten output shape: torch.Size([1, 512]) Linear output shape: torch.Size([1, 10])
Training model
As before, we trained ResNet on the fashion MNIST dataset.
lr, num_epochs, batch_size = 0.05, 10, 256 train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=96) d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())
loss 0.014, train acc 0.996, test acc 0.895 4680.2 examples/sec on cuda:0