This project implements the Paddle version of the paper "Context Encoding for Semantic Segmentation (CVPR2018)" by using PaddlePaddle and PaddleSeg suite, and achieves fairly good results.
This paper introduces the context encoding module to capture the global context information and highlight the category information associated with the scene, which is equivalent to adding the prior knowledge of the scene, similar to the attention mechanism. Combining the advanced extended convolution strategy and multi-scale strategy, a semantic segmentation framework EncNet(Context Encoding Network) is proposed. EncNet proposed in this paper obtains state of the art results on multiple data sets.
Encoding Module: use the full connection layer FC to encode the rich features extracted from the previous network, and one coding branch is directly sent to se loss to predict the categories in the scene; The other branch predicts the weighted scale factor for each category, and then the weighted scale factor weights the category of each channel proposed earlier, and then carries out subsequent up sampling to finally calculate loss.
Featuremap Attention: deny feature map gets context embedding through an encoding layer, and then gets a classwise score through FC as the weight.
Semantic Encoding loss: an FC layer with Sigmoid activation is added above the coding layer to separately predict the target categories in the scene and learn the binary cross entropy loss. Unlike per pixel loss, SE loss has the same contribution to targets of different sizes, which can improve the detection performance of small targets.
2, Reproduction accuracy
|Resnet101 encnet (paddle of the project)||78.3|
Due to the randomness of training, it is a little lower than the existing reproduction... But it is also a successful reproduction ~
NOTE: training is the result of 4 cards v100 batchsize=8 (2 per card) 80k.
3, Data set
The data sets used are: Cityscapes
- Data set size: 19 categories of dense pixel annotation, 5000 1024 * 2048 high-quality pixel level annotation images / 20000 weak annotation frames
- Training set: 2975 images
- Validation set: 500 images
- Test set: 1525 images
The data set should have the following structure:
data/ ├── cityscapes │ ├── gtFine │ │ ├── test │ │ ├── train │ │ └── val │ ├── leftImg8bit │ │ ├── test │ │ │ ├── berlin │ │ │ ├── ... │ │ │ └── munich │ │ ├── train │ │ │ ├── aachen │ │ │ ├── ... │ │ │ └── zurich │ │ └── val │ │ ├── frankfurt │ │ ├── lindau │ │ └── munster │ ├── train.txt │ ├── val.txt │ ├── test.txt
The detailed content of the paper still needs to be combined with the code to understand more deeply. The project is based on the PaddleSeg development kit. All training evaluation methods are the same as PaddleSeg. Available PaddleSeg.
The code is mainly divided into two parts: model and loss.
- MODEL: located in ENCNet/paddleseg/models/encnet.py
class EncModule(nn.Layer): """Encoding Module used in EncNet. Args: in_channels (int): Input channels. num_codes (int): Number of code words. """ def __init__(self, in_channels, num_codes): super(EncModule, self).__init__() self.encoding_project = layers.ConvBNReLU(in_channels,in_channels,1) # change to 1d self.encoding = nn.Sequential( Encoding(in_channels, num_codes), nn.BatchNorm1D(num_codes), nn.ReLU()) self.fc = nn.Sequential( nn.Linear(in_channels, in_channels), nn.Sigmoid()) def forward(self, x): """Forward function.""" encoding_projection = self.encoding_project(x) encoding_feat = self.encoding(encoding_projection).mean(axis=1) batch_size, channels, _, _ = x.shape gamma = self.fc(encoding_feat) y = gamma.reshape((batch_size, channels, 1, 1)) output = F.relu_(x + x * y) return encoding_feat, output
- SE-LOSS: located in encnet / paddleseg / Models / losses / encoding_ cross_ entropy_ EncodingBCELoss for loss.py
It mainly calculates the binary cross entropy loss with the semantic encoding result output by the model after converting the semantic segmentation label into one hot coding.
Thesis address: Context Encoding for Semantic Segmentation
Original project: https://github.com/zhanghang1989/PyTorch-Encoding
# Create cityscape folder !mkdir data/cityscapes/ # Unzip gtFine in the dataset !unzip -nq -d data/gtFine/ data/data48855/gtFine_train.zip !unzip -nq -d data/gtFine/ data/data48855/gtFine_val.zip !unzip -nq -d data/gtFine/ data/data48855/gtFine_test.zip !mv data/gtFine/ data/cityscapes/ # Decompress leftImg8bit in the dataset !unzip -nq -d data/leftImg8bit/ data/data48855/leftImg8bit_train.zip !unzip -nq -d data/leftImg8bit/ data/data48855/leftImg8bit_val.zip !unzip -nq -d data/leftImg8bit/ data/data48855/leftImg8bit_test.zip !mv data/leftImg8bit/ data/cityscapes/
# Generation of training train.txt, val.txt and other files !python ENCNet/tools/create_dataset_list.py /home/aistudio/data/cityscapes/ --type cityscapes --separator ","
The project provides pre training models and logs, which have been attached to the dataset and can be viewed after decompression. There are training logs and visual logs in the output folder.
!python ENCNet/train.py --config ENCNet/configs/encnet/encnet_r101_d8_512x1024_80k_cityscapes.yml --num_workers 0 --use_vdl --do_eval --save_interval 1000 --save_dir encnet_r101_d8_512x1024_80k_cityscapes
Select the pre training model and evaluate it by modifying "– model_path /home/aistudio/output/model.pdparams".
!python ENCNet/val.py --config ENCNet/configs/encnet/encnet_r101_d8_512x1024_80k_cityscapes.yml --model_path data/data118662/model.pdparams
Context content is very important for pixel task. Modeling the relationship between pixels is very beneficial to improve the results of semantic segmentation.
If you like, you can pay attention to it