Depth residual shrinkage network: a new depth attention mechanism algorithm

This paper interprets a new algorithm of deep attention, which is Deep Residual Shrinkage Network. In terms of function, the Deep Residual Shrinkage Network is a feature learning method for strong noise or highly redundant data. This paper first reviews the relevant basic knowledge, then introduces the motivation and implementation of the Deep Residual Shrinkage Network, hoping to help you.

1. Relevant basis

Deep residual shrinkage network is mainly based on three parts: deep residual network, soft threshold function and attention mechanism.

1.1 deep residual network

Deep residual network is undoubtedly one of the most successful deep learning algorithms in recent years, and has been cited more than 40000 times in Google's academic field. Compared with the common convolution neural network, the deep residual network adopts the way of cross layer identical path, which alleviates the training difficulty of the deep network. The backbone of the deep residual network is made up of many residual modules, one of which is shown in the figure below.

1.2 soft threshold function

Soft threshold function is the core step of most noise reduction methods. First, we need to set a positive threshold. The threshold value cannot be too large, that is, it cannot be greater than the maximum value of the absolute value of the input data, otherwise the output will all be zero. Then, the soft threshold function will set the input data whose absolute value is lower than the threshold value to zero, and the input data whose absolute value is higher than the threshold value will also shrink towards zero. The relationship between the input and output is shown in (a) below.

The derivative of the output y of the soft threshold function to the input x is shown in (b) above. We can find that the derivative is either 0 or 1. From this point of view, there are some similarities between the soft threshold function and the ReLU activation function, which is also conducive to the back propagation of gradient during the training of deep learning algorithm. It is worth noting that the selection of threshold has a direct impact on the results of soft threshold function, which is still a problem.

1.3 attention mechanism

In recent years, attention mechanism has been a hot topic in the field of deep learning, and squeeze and exception network (SENet) is one of the most classical attention algorithms. As shown in the figure below, SENet learns a set of weight coefficients through a small network, which are used for weighting each feature channel. This is actually a kind of attention mechanism: first, evaluate the importance of each feature channel, and then give appropriate weight to each feature channel according to its importance.

As shown in the figure below, SENet can be integrated with the residual module. In this mode, SENet can be trained more easily due to the existence of cross layer identity paths. In addition, it is worth noting that the weight coefficients of each sample are set according to its own; that is to say, each sample can have its own unique set of weight coefficients.

2. Depth residual shrinkage network

Next, this part introduces the motivation, implementation, advantages and verification of the deep residual shrinkage network.

2.1 motivation

First of all, most real-world data, including pictures, voice or vibration, contain more or less noise or redundant information. In a broad sense, in a sample, any information unrelated to the current pattern recognition task can be considered as noise or redundant information. These noises or redundant information are likely to have an adverse impact on the current pattern recognition task.

Secondly, for any two samples, their noise or redundancy content is often different. In other words, some samples contain more noise or redundancy, while others contain less. This requires that when we design the algorithm, we should make the algorithm have the ability to set the relevant parameters independently according to the characteristics of each sample.

Driven by the above two points, can we introduce the soft threshold function of traditional signal denoising algorithm into the depth residual network? How to choose the threshold in the soft threshold function? The deep residual shrinkage network gives an answer.

2.2 implementation

Depth residual shrinkage network combines depth residual network, SENet and soft threshold function. As shown in the figure below, the deep residual shrinkage network replaces "reweighting" with "soft thresholding" in SENet in residual mode. In SENet, the embedded small network is used to obtain a set of weight coefficients; in the deep residual shrinkage network, the small network is used to obtain a set of thresholds.

In order to obtain the appropriate threshold, compared with the original SENet, the structure of the small network in the deep residual shrinkage network is also adjusted. Specifically, the output threshold of the small network is (the average of the absolute values of each characteristic channel) × (a set of coefficients between 0 and 1). In this way, the depth residual shrinkage network not only ensures that all thresholds are positive, but also the thresholds are not too large (not all outputs are 0).

As shown in the figure below, the overall structure of the deep residual shrinkage network is consistent with the ordinary deep residual network, including the input layer, the convolution layer at the beginning, a series of basic modules, and the final global mean pooling and full connection output layer.

First of all, the threshold required by the soft threshold function is set automatically through a small network, which avoids the professional knowledge required by manual threshold setting.

Then, the depth residual shrinkage network ensures that the threshold value of the soft threshold function is a positive number, and within the appropriate value range, it avoids the situation that the output is all zero.

At the same time, each sample has its own unique set of thresholds, which makes the depth residual shrinkage network suitable for different samples with different noise content.

3. conclusion

Because the noise or redundant information is ubiquitous, the deep residual shrinkage network, or the idea of "attention mechanism" + "soft threshold function", may have a broad space for expansion and application.

Original text:

M. Zhao, S. Zhong, X. Fu, B. Tang, and M. Pecht, "Deep Residual Shrinkage Networks for Fault Diagnosis," IEEE Transactions on Industrial Informatics, 2019, DOI: 10.1109/TII.2019.2943898

Keras Code:

``````#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Sat Dec 28 23:24:05 2019
Implemented using TensorFlow 1.0.1 and Keras 2.2.1

M. Zhao, S. Zhong, X. Fu, et al., Deep Residual Shrinkage Networks for Fault Diagnosis,
IEEE Transactions on Industrial Informatics, 2019, DOI: 10.1109/TII.2019.2943898
@author: super_9527
"""

from __future__ import print_function
import keras
import numpy as np
from keras.datasets import mnist
from keras.layers import Dense, Conv2D, BatchNormalization, Activation
from keras.layers import AveragePooling2D, Input, GlobalAveragePooling2D
from keras.regularizers import l2
from keras import backend as K
from keras.models import Model
from keras.layers.core import Lambda
K.set_learning_phase(1)

# Input image dimensions
img_rows, img_cols = 28, 28

# The data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

if K.image_data_format() == 'channels_first':
x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
input_shape = (1, img_rows, img_cols)
else:
x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)

# Noised data
x_train = x_train.astype('float32') / 255. + 0.5*np.random.random([x_train.shape[0], img_rows, img_cols, 1])
x_test = x_test.astype('float32') / 255. + 0.5*np.random.random([x_test.shape[0], img_rows, img_cols, 1])
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, 10)
y_test = keras.utils.to_categorical(y_test, 10)

def abs_backend(inputs):
return K.abs(inputs)

def expand_dim_backend(inputs):
return K.expand_dims(K.expand_dims(inputs,1),1)

def sign_backend(inputs):
return K.sign(inputs)

inputs = K.expand_dims(inputs,-1)
return K.squeeze(inputs, -1)

# Residual Shrinakge Block
def residual_shrinkage_block(incoming, nb_blocks, out_channels, downsample=False,
downsample_strides=2):

residual = incoming
in_channels = incoming.get_shape().as_list()[-1]

for i in range(nb_blocks):

identity = residual

if not downsample:
downsample_strides = 1

residual = BatchNormalization()(residual)
residual = Activation('relu')(residual)
residual = Conv2D(out_channels, 3, strides=(downsample_strides, downsample_strides),
kernel_regularizer=l2(1e-4))(residual)

residual = BatchNormalization()(residual)
residual = Activation('relu')(residual)
residual = Conv2D(out_channels, 3, padding='same', kernel_initializer='he_normal',
kernel_regularizer=l2(1e-4))(residual)

# Calculate global means
residual_abs = Lambda(abs_backend)(residual)
abs_mean = GlobalAveragePooling2D()(residual_abs)

# Calculate scaling coefficients
scales = Dense(out_channels, activation=None, kernel_initializer='he_normal',
kernel_regularizer=l2(1e-4))(abs_mean)
scales = BatchNormalization()(scales)
scales = Activation('relu')(scales)
scales = Dense(out_channels, activation='sigmoid', kernel_regularizer=l2(1e-4))(scales)
scales = Lambda(expand_dim_backend)(scales)

# Calculate thresholds
thres = keras.layers.multiply([abs_mean, scales])

# Soft thresholding
sub = keras.layers.subtract([residual_abs, thres])
zeros = keras.layers.subtract([sub, sub])
n_sub = keras.layers.maximum([sub, zeros])
residual = keras.layers.multiply([Lambda(sign_backend)(residual), n_sub])

# Downsampling (it is important to use the pooL-size of (1, 1))
if downsample_strides > 1:
identity = AveragePooling2D(pool_size=(1,1), strides=(2,2))(identity)

# Zero_padding to match channels (it is important to use zero padding rather than 1by1 convolution)
if in_channels != out_channels:

return residual

# define and train a model
inputs = Input(shape=input_shape)
net = Conv2D(8, 3, padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(1e-4))(inputs)
net = residual_shrinkage_block(net, 1, 8, downsample=True)
net = BatchNormalization()(net)
net = Activation('relu')(net)
net = GlobalAveragePooling2D()(net)
outputs = Dense(10, activation='softmax', kernel_initializer='he_normal', kernel_regularizer=l2(1e-4))(net)
model = Model(inputs=inputs, outputs=outputs)
model.fit(x_train, y_train, batch_size=100, epochs=5, verbose=1, validation_data=(x_test, y_test))

# get results
K.set_learning_phase(0)
DRSN_train_score = model.evaluate(x_train, y_train, batch_size=100, verbose=0)
print('Train loss:', DRSN_train_score[0])
print('Train accuracy:', DRSN_train_score[1])
DRSN_test_score = model.evaluate(x_test, y_test, batch_size=100, verbose=0)
print('Test loss:', DRSN_test_score[0])
print('Test accuracy:', DRSN_test_score[1])``````

TFLearn code

``````#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Mon Dec 23 21:23:09 2019
Implemented using TensorFlow 1.0 and TFLearn 0.3.2

M. Zhao, S. Zhong, X. Fu, B. Tang, M. Pecht, Deep Residual Shrinkage Networks for Fault Diagnosis,
IEEE Transactions on Industrial Informatics, 2019, DOI: 10.1109/TII.2019.2943898

@author: super_9527
"""

from __future__ import division, print_function, absolute_import

import tflearn
import numpy as np
import tensorflow as tf
from tflearn.layers.conv import conv_2d

from tflearn.datasets import cifar10
(X, Y), (testX, testY) = cifar10.load_data()

X = X + np.random.random((50000, 32, 32, 3))*0.1
testX = testX + np.random.random((10000, 32, 32, 3))*0.1

# Transform labels to one-hot format
Y = tflearn.data_utils.to_categorical(Y,10)
testY = tflearn.data_utils.to_categorical(testY,10)

def residual_shrinkage_block(incoming, nb_blocks, out_channels, downsample=False,
downsample_strides=2, activation='relu', batch_norm=True,
bias=True, weights_init='variance_scaling',
bias_init='zeros', regularizer='L2', weight_decay=0.0001,
trainable=True, restore=True, reuse=False, scope=None,
name="ResidualBlock"):

# residual shrinkage blocks with channel-wise thresholds

residual = incoming
in_channels = incoming.get_shape().as_list()[-1]

# Variable Scope fix for older TF
try:
vscope = tf.variable_scope(scope, default_name=name, values=[incoming],
reuse=reuse)
except Exception:
vscope = tf.variable_op_scope([incoming], scope, name, reuse=reuse)

with vscope as scope:
name = scope.name #TODO

for i in range(nb_blocks):

identity = residual

if not downsample:
downsample_strides = 1

if batch_norm:
residual = tflearn.batch_normalization(residual)
residual = tflearn.activation(residual, activation)
residual = conv_2d(residual, out_channels, 3,
downsample_strides, 'same', 'linear',
bias, weights_init, bias_init,
regularizer, weight_decay, trainable,
restore)

if batch_norm:
residual = tflearn.batch_normalization(residual)
residual = tflearn.activation(residual, activation)
residual = conv_2d(residual, out_channels, 3, 1, 'same',
'linear', bias, weights_init,
bias_init, regularizer, weight_decay,
trainable, restore)

# get thresholds and apply thresholding
abs_mean = tf.reduce_mean(tf.reduce_mean(tf.abs(residual),axis=2,keep_dims=True),axis=1,keep_dims=True)
scales = tflearn.fully_connected(abs_mean, out_channels//4, activation='linear',regularizer='L2',weight_decay=0.0001,weights_init='variance_scaling')
scales = tflearn.batch_normalization(scales)
scales = tflearn.activation(scales, 'relu')
scales = tflearn.fully_connected(scales, out_channels, activation='linear',regularizer='L2',weight_decay=0.0001,weights_init='variance_scaling')
scales = tf.expand_dims(tf.expand_dims(scales,axis=1),axis=1)
thres = tf.multiply(abs_mean,tflearn.activations.sigmoid(scales))
# soft thresholding
residual = tf.multiply(tf.sign(residual), tf.maximum(tf.abs(residual)-thres,0))

# Downsampling
if downsample_strides > 1:
identity = tflearn.avg_pool_2d(identity, 1,
downsample_strides)

# Projection to new dimension
if in_channels != out_channels:
if (out_channels - in_channels) % 2 == 0:
ch = (out_channels - in_channels)//2
[[0, 0], [0, 0], [0, 0], [ch, ch]])
else:
ch = (out_channels - in_channels)//2
[[0, 0], [0, 0], [0, 0], [ch, ch+1]])
in_channels = out_channels

residual = residual + identity

return residual

# Real-time data preprocessing
img_prep = tflearn.ImagePreprocessing()

# Real-time data augmentation
img_aug = tflearn.ImageAugmentation()

# Build a Deep Residual Shrinkage Network with 3 blocks
net = tflearn.input_data(shape=[None, 32, 32, 3],
data_preprocessing=img_prep,
data_augmentation=img_aug)
net = tflearn.conv_2d(net, 16, 3, regularizer='L2', weight_decay=0.0001)
net = residual_shrinkage_block(net, 1, 16)
net = residual_shrinkage_block(net, 1, 32, downsample=True)
net = residual_shrinkage_block(net, 1, 32, downsample=True)
net = tflearn.batch_normalization(net)
net = tflearn.activation(net, 'relu')
net = tflearn.global_avg_pool(net)
# Regression
net = tflearn.fully_connected(net, 10, activation='softmax')
mom = tflearn.Momentum(0.1, lr_decay=0.1, decay_step=20000, staircase=True)
net = tflearn.regression(net, optimizer=mom, loss='categorical_crossentropy')
# Training
model = tflearn.DNN(net, checkpoint_path='model_cifar10',
max_checkpoints=10, tensorboard_verbose=0,