[Special Topic on semantic segmentation] work related to semantic segmentation -- Introduction to UNet network

U-Net: Convolutional Networks for Biomedical Image Segmentation

U-Net architecture includes a contraction path to capture context information and a symmetric extension path to support precise localization. U-Net proved that such a network can carry out end-to-end training using very small images, and achieved the best method in the biological segmentation challenge.

The contraction path consists of two 3 * 3 convolutions, each followed by a RELU activation function and a 2 * 2 large pooling operation for down sampling.

The expansion phase includes up sampling of a characteristic channel. Followed by 2 * 2 transpose convolution, which can halve the number of feature channels and increase the feature map at the same time.

The last layer is 1 * 1 convolution. The feature vector composed of this convolution is mapped to the required number of categories.

  • a.U-Net is based on the network architecture of FCN. The author modifies and expands this network framework so that it can get very accurate segmentation results with very few training images.
  • b. The upsampling stage is added, and many feature channels are added to allow more information of the original image texture to be propagated in high-resolution layers
  • c.U-Net has no FC layer and uses valid for convolution in the whole process. In this way, it can ensure that the segmentation results are obtained based on no missing context features. Therefore, the input and output image sizes are different (but the code on keras is the same revolution). For large image input, overlap strategy can be used for seamless image output
  • d. In order to predict the edge part of the input image, the missing context is extrapolated by mirroring the input image. In fact, it is also possible to input a large image, but this side rate is proposed when the GPU memory is insufficient.
  • e. Another difficulty in cell segmentation is to separate cells of the same category and in contact with each other. Therefore, weighted loss is proposed, that is, giving higher weight to the background label between two cells in contact with each other.

Innovation and key points

  • 1.U-Net simply splices the characteristic diagram of encoder to the up sampling characteristic diagram of decoder pair in each stage, so as to form a trapezoidal structure.
  • 2. The architecture connected by jump splicing allows the decoder to learn the relevant features lost in the encoder at each stage.
  • 3. Transpose convolution is used for up sampling

Thesis structure

  • Introduction
  • Network Architecture
  • Training
    • The weight map is calculated in advance to obtain the weight of each pixel in the loss function, which supplements the different frequencies of each type of pixel in the training data.
    • The softmax formula and cross entropy formula can be used for reference
    • Data Augmentation data augmentation operation Dropout layer, implicit enhanced Data Augmentation
    • An improved loss with weight is proposed
  • Experiments
  • Conclusion

Formula part

Code architecture

encoder part:

def vgg_encoder(n_classes,input_height,input_width):

    assert  input_height % 32 == 0
    assert  input_width % 32 == 0

    img_input = Input(shape=(input_height, input_width, 3))

    # Block1
    x = Conv2D(64,(3,3),activation="relu",padding="same",name="block1_conv1")(img_input)
    x = Conv2D(64,(3,3),activation="relu",padding="same",name="block1_conv2")(x)
    x = MaxPooling2D((2,2),strides=(2,2),name="block1_pool")(x)
    f1 = x

    # Block2
    x = Conv2D(128,(3,3),activation="relu",padding="same",name="block2_conv1")(x)
    x = Conv2D(128,(3,3),activation="relu",padding="same",name="block2_conv2")(x)
    x = MaxPooling2D((2,2),strides=(2,2),name="block2_pool")(x)
    f2 = x

    # Block3
    x = Conv2D(256,(3,3),activation="relu",padding="same",name="block3_conv1")(x)
    x = Conv2D(256,(3,3),activation="relu",padding="same",name="block3_conv2")(x)
    x = Conv2D(256,(3,3),activation="relu",padding="same",name="block3_conv3")(x)
    x = MaxPooling2D((2, 2), strides=(2, 2), name="block13_pool")(x)
    f3 = x

    # Block4
    x = Conv2D(512,(3,3),activation="relu",padding="same",name="block4_conv1")(x)
    x = Conv2D(512,(3,3),activation="relu",padding="same",name="block4_conv2")(x)
    x = Conv2D(512,(3,3),activation="relu",padding="same",name="block4_conv3")(x)
    x = MaxPooling2D((2,2),strides=(2,2),name="block4_pool")(x)
    f4 = x

    # Block5
    x = Conv2D(512,(3,3),activation="relu",padding="same",name="block5_conv1")(x)
    x = Conv2D(512,(3,3),activation="relu",padding="same",name="block5_conv2")(x)
    x = Conv2D(512,(3,3),activation="relu",padding="same",name="block5_conv3")(x)
    x = MaxPooling2D((2,2),strides=(2,2),name="block5_pool")(x)
    f5 = x

    return img_input, [f1, f2, f3, f4, f5]



decoder section

def UNet_decoder(n_classes,levels):

    [f1, f2, f3, f4, f5] = levels

    # Decoding layer 1
    y = UpSampling2D((2,2))(f5)
    y = concatenate([f4,y],axis=-1)
    y = Conv2D(512,(3,3),padding="same")(y)
    y = BatchNormalization()(y)

    # Decoding layer 2
    y = UpSampling2D((2,2))(y)
    y = concatenate([f3,y],axis=-1)
    y = Conv2D(256,(3,3),padding="same")(y)
    y = BatchNormalization()(y)

    # Decoding layer 3
    y = UpSampling2D((2,2))(y)
    y = concatenate([f2,y],axis=-1)
    y = Conv2D(128,(3,3),padding="same")(y)
    y = BatchNormalization()(y)

    # Decoding layer 4
    y = UpSampling2D((2,2))(y)
    y = concatenate([f1,y],axis=-1)
    y = Conv2D(64,(3,3),padding="same")(y)
    y = BatchNormalization()(y)

    # Decoding layer 5
    y = UpSampling2D((2,2))(y)
    y = Conv2D(64,(3,3),padding="same")(y)
    y = BatchNormalization()(y)

    # The last softmax layer
    y = Conv2D(n_classes,(1,1),padding="same")(y)
    y = BatchNormalization()(y)
    y = Activation("relu")(y)

    y = Reshape((-1,n_classes))(y)
    y = Activation("softmax")(y)

    return y

UNet section

def UNet(n_classes,input_height,input_width):

    assert input_height % 32 == 0
    assert input_width % 32 == 0

    img_input,levels =vgg_encoder(n_classes,input_height,input_width)

    output = UNet_decoder(n_classes, levels)

    Vgg_UNet = Model(img_input, output)
    return Vgg_UNet

Tags: neural networks Pytorch Computer Vision Deep Learning

Posted on Mon, 18 Oct 2021 23:44:36 -0400 by bretticus