preface
This paper mainly records some classical convolution network architectures and the corresponding pytorch code.Tip: the following is the main content of this article. The following cases can be used for reference
1, EfficientNetV2
1.1 network structure
EfficientNetV2-s network structure
In the source code, the output channels of stage6 is equal to 256, and the output channels of stage7 is 1280
Fused mbconv module
In addition, there is no SE module in the source code, and the corresponding Dropout module is not the random deactivation of nodes, but the deactivation of the whole module.
Similarly, the shortcut branch exists only when stripe = 1 and the input and output channel s are the same.
1.1.1 thesis ideas
The author hopes to use NAS and neural network scaling to jointly optimize the training speed and parameter efficiency of the network.
At the same time, in the training process, the progressive learning method is used to adjust the regularization factor (dropout and data augmentation) according to the image size to accelerate the network training and reduce the performance loss.
The author found in the study
1. Training is slow with very large image size
2. Using depthwise revolutions in the shallow layer of the network will be very slow
3. It is suboptimal that each stage is enlarged according to the same scale
Based on the above analysis, the author proposes the fused mbconv structure and asymptotic learning.
For progressive learning, the author believes that different degrees of regularization should be used for different image sizes, that is, in the early training, small image size and weak regularization are used to train the network, and then gradually increase the image size and add stronger regularization, and adjust the size based on progressive learning. Thus, the training speed of the network can be accelerated without causing a decrease in accuracy.
Depthwise revolutions is slow in the early layers, mainly because it usually does not take full advantage of modern accelerators. Therefore, the author removes DW convolution in the shallow network structure.
For non-uniform scaling, the author does not mention how to get the corresponding scaling parameters.
Additional optimization
(1) We limit the maximum inference image size to 480, because very large images usually lead to expensive memory and training speed overhead;
(2) As a heuristic, we also gradually add more layers to later stages (such as stages 5 and 6 in Table 4) to increase network capacity without adding too much runtime overhead.
Start with small image size and weak regularization (epoch=1), and then gradually increase the learning difficulty with larger image size and stronger regularization: greater loss rate, randaugust amplitude and mixing ratio
1.1.2 summary and highlights
1. Remove DW convolution in shallow network and use fused mbconv module.
2. Use progressive learning and gradually enhanced regularization to accelerate network training
3. Use non-uniform scaling strategy
reference
EfficientNetV2 : Smaller Model and Faster Training
1.2 code
import torch import torch.nn as nn import torch.nn.functional as F from torch.utils.data import DataLoader from typing import Callable,List,Optional from functools import partial from torch.tensor import Tensor def _make_divisible(ch,divisor=8,min_ch=None): if min_ch is None: min_ch=divisor ## Find the integer closest to the corresponding multiple (up or down) new_ch=max(min_ch,int(ch+divisor/2)//divisor*divisor) if new_ch <0.9*ch: new_ch+=divisor return new_ch class ConvBNActivation(nn.Sequential): def __init__(self,in_planes:int ,out_planes:int,kernel_size:int=3,stride:int =1, groups:int =1,norm_layer:Optional[Callable[...,nn.Module]]=None, activation_layer:Optional[Callable[...,nn.Module]]=None ): ## Calculating padding padding=(kernel_size-1)//2 if norm_layer is None: norm_layer=nn.BatchNorm2d if activation_layer is None: activation_layer=nn.ReLU6 super(ConvBNActivation, self).__init__(nn.Conv2d(in_planes,out_planes, kernel_size=kernel_size, padding=padding, groups=groups, bias=False), norm_layer(out_planes), activation_layer(inplace=True) ) ## SE module class SqueezeExcitaion(nn.Module): def __init__(self,input_c:int,squeeze_factor:int=4): super(SqueezeExcitaion,self).__init__() sequeeze_c=_make_divisible(input_c//squeeze_factor,8) self.fc1=nn.Conv2d(input_c,sequeeze_c,1) self.fc2=nn.Conv2d(sequeeze_c,input_c,1) def forward(self,x:Tensor)-> Tensor: scale=F.adaptive_avg_pool2d(x,output_size=(1,1)) scale=self.fc1(scale) scale=F.relu(scale,inplace=True) scale=self.fc2(scale) scale=F.hardsigmoid(scale) return scale*x ## width_factor controls the super parameters of the channel class InvertedResidualConfig: def __init__(self, input_c:int, output_c:int, expsize:int, kernel_size:int, use_se:bool, activation_func:str, stride:int, width_factor:float ): self.input_c=self.changeSize(input_c,width_factor) self.output_c=self.changeSize(output_c,width_factor) self.kernel_size=kernel_size self.use_se=use_se self.use_hs=activation_func=="HS" self.stride=stride self.expsize=self.changeSize(expsize,width_factor) @staticmethod def changeSize(ch:int,factor:float,divisor:int=8): return _make_divisible(ch*factor,divisor) class InvertedResidual(nn.Module): def __init__(self, cfg:InvertedResidualConfig, norm_layer:Callable[...,nn.Module] ): super(InvertedResidual,self).__init__() if cfg.stride not in [1,2]: raise ValueError('illegal stride value') if cfg.output_c==cfg.input_c and cfg.stride==1: self.use_shortcut=True layers:List[nn.Module]=[] activation_func=nn.Hardswish if cfg.use_hs else nn.ReLU if cfg.input_c!=cfg.expsize: layers.append(ConvBNActivation(cfg.input_c, cfg.expsize, kernel_size=1, norm_layer=norm_layer, activation_layer=activation_func )) layers.append(ConvBNActivation(cfg.expsize, cfg.expsize, groups=cfg.expsize, kernel_size=cfg.kernel_size, stride=cfg.stride, norm_layer=norm_layer, activation_layer=activation_func )) if cfg.use_se: layers.append(SqueezeExcitaion(cfg.expsize)) layers.append(ConvBNActivation(cfg.expsize, cfg.output_c, kernel_size=1, norm_layer=norm_layer, activation_layer=nn.Identity )) self.block=nn.Sequential(*layers) self.out_channel=cfg.output_c self.is_strided=cfg.stride>1 def forward(self,x:Tensor)->Tensor: result=self.block(x) if self.use_shortcut: result+=x return result class MobileNetV3(nn.Module): def __init__(self,inverted_setting:List[InvertedResidualConfig], last_channel:int, num_classes:int=1000, block:Optional[Callable[...,nn.Module]]=None, norm_layer:Optional[Callable[...,nn.Module]]=None ): super(MobileNetV3,self).__init__() if not inverted_setting: raise ValueError("The Inverted_setting should not be empty") elif not isinstance(inverted_setting,List) and all([isinstance(s,InvertedResidualConfig) for s in inverted_setting]): raise TypeError("illegal type of Inverted_setting ") if block is None: block=InvertedResidual if norm_layer is None: norm_layer=partial(nn.BatchNorm2d,eps=0.001,momentum=0.01) layers:List[nn.Module]=[] firstconv_output_c=inverted_setting[0].input_c layers.append(ConvBNActivation(3,firstconv_output_c, kernel_size=3,stride=2,norm_layer=norm_layer, activation_layer=nn.Hardswish )) for cnf in inverted_setting: layers.append(block(cnf,norm_layer)) lastconv_input_c=inverted_setting[-1].output_c ## Fixed in the paper is 6 times lastconv_output_c=6*lastconv_input_c layers.append(ConvBNActivation(lastconv_input_c, lastconv_output_c, kernel_size=1, norm_layer=norm_layer, activation_layer=nn.Hardswish )) self.features=nn.Sequential(*layers) self.avgpool=nn.AdaptiveAvgPool2d(1) self.classifier=nn.Sequential(nn.Linear(lastconv_output_c,last_channel), nn.Hardswish(inplace=True), nn.Dropout(0.2,inplace=True), nn.Linear(last_channel,num_classes) ) def forward_impl(self,x:Tensor)->Tensor: x=self.features(x) x=self.avgpool(x) x=torch.flatten(x,1) x=self.classifier(x) return x def forward(self,x:Tensor)->Tensor: return self.forward_impl(x) def mobilenet_v3_large(num_classes:int=100,reduced_tail:bool=False)->MobileNetV3: width_multi=1.0 ## Control the corresponding channel attenuation number bneck_conf=partial(InvertedResidualConfig,width_factor=width_multi) changeSize=partial(InvertedResidualConfig.changeSize,factor=width_multi) ## The parameters officially set by pytorch are used to control the number of parameters in the following three layers reduce_divider=2 if reduced_tail else 1 inverted_residual_setting=[ bneck_conf(16,3,16,16,False,"RE",1), bneck_conf(16,3,64,24,False,"RE",2), bneck_conf(24,3,72,24,False,"RE",1), bneck_conf(24,5,72,40,True,"RE",2), bneck_conf(40,5,120,40,True,"RE",1), bneck_conf(40,5,120,40,True,"RE",1), bneck_conf(40,3,240,80,False,"HS",2), bneck_conf(80,3,200,80,False,"HS",1), bneck_conf(80,3,184,80,False,"HS",1), bneck_conf(80,3,184,80,False,"HS",1), bneck_conf(80,3,480,80,True,"HS",1), bneck_conf(112,3,672,112,True,"HS",1), bneck_conf(112,5,672,160//reduce_divider,True,"HS",2), bneck_conf(160//reduce_divider,5,960//reduce_divider,160//reduce_divider,True,"HS",1), bneck_conf(160//reduce_divider,5,960//reduce_divider,160//reduce_divider,True,"HS",1) ] last_channel=changeSize(1280//reduce_divider) return MobileNetV3(inverted_setting=inverted_residual_setting, last_channel=last_channel, num_classes=num_classes) def mobilenet_v3_small(num_classes:int=100,reduced_tail:bool=False)->MobileNetV3: width_multi=1.0 ## Control the corresponding channel attenuation number bneck_conf=partial(InvertedResidualConfig,width_factor=width_multi) changeSize=partial(InvertedResidualConfig.changeSize,factor=width_multi) ## The parameters officially set by pytorch are used to control the number of parameters in the following three layers reduce_divider=2 if reduced_tail else 1 inverted_residual_setting=[ bneck_conf(16,3,16,16,True,"RE",2), bneck_conf(16,3,72,24,False,"RE",2), bneck_conf(24,3,88,24,False,"RE",1), bneck_conf(24,5,96,40,True,"RE",2), bneck_conf(40,5,240,40,True,"HS",1), bneck_conf(40,5,240,40,True,"HS",1), bneck_conf(40,5,120,48,True,"HS",1), bneck_conf(48,5,144,48,True,"HS",1), bneck_conf(48,5,288,96//reduce_divider,False,"HS",1), bneck_conf(96//reduce_divider,5,576//reduce_divider,96//reduce_divider,True,"HS",1), bneck_conf(96//reduce_divider,5,576//reduce_divider,96//reduce_divider,True,"HS",1) ] last_channel=changeSize(1024//reduce_divider) return MobileNetV3(inverted_setting=inverted_residual_setting, last_channel=last_channel, num_classes=num_classes)