# 2020CVPR -- Deep Unfolding Network for Image Super-Resolution

[paper] : Deep Unfolding Network for Image Super-Resolution

[github] : https://github.com/cszn/USRNet

## Abstract

Learning-based single image super-resolution (SISR) methods are continuously showing superior effectiveness and efficiency over traditional model-based methods, largely due to the end-to-end training. However, different from model-based methods that can handle the SISR problem with different scale factors, blur kernels and noise levels under a unified MAP (maximum a posteriori) framework, learning-based methods generally lack such flexibility.

Ask questions:

First, affirm the superiority of in-depth learning SISR method; then, point out the problem based on in-depth learning SISR, that is:

Model-based (non-deep learning) SISR methods can study and process LR images with different scale factors, fuzzy kernels and noise levels under the same standard maximum a posteriori (MAP); however, deep learning-based SISR lacks this flexibility.

To address this issue, this paper proposes an end-to-end trainable unfolding network which leverages both learning-based methods and model-based methods. Specifically, by unfolding the MAP inference via a half-quadratic splitting algorithm, a fixed number of iterations consisting of alternately solving a data subproblem and a prior subproblem can be obtained. The two subproblems then can be solved with neural modules, resulting in an end-to-end trainable, iterative network. As a result, the proposed network inherits the flexibility of model-based methods to super-resolve blurry, noisy images for different scale factors via a single model, while maintaining the advantages of learning-based methods.

Solutions to problems:

To solve this problem, an end-to-end trainable deployable network is proposed, which utilizes both learning-based and model-based methods.

Specifically, by expanding MAP inference with the half-quadratic splitting algorithm, a fixed number of iterations consisting of alternately solving a data subproblem and a priori subproblem can be obtained.

These two subproblems can be solved with a neural module to form an end-to-end trainable iterative network.

Therefore, the network inherits the flexibility of model-based methods, and can super-distinguish blurred and noisy images with different scale factors by a single model, while maintaining the advantages of learning-based methods.

Extensive experiments demonstrate the superiority of the proposed deep unfolding network in terms of flexibility, effectiveness and also generalizability.

The experimental results.

## Introduction

Let's see what the Introduction tells.

Single image super-resolution (SISR) refers to the process of recovering the natural and sharp detailed highresolution (HR) counterpart from a low-resolution (LR) image. It is one of the classical ill-posed inverse problems in low-level computer vision and has a wide range of realworld applications, such as enhancing the image visual quality on high-definition displays [42, 53] and improving the performance of other high-level vision tasks [13].

A general introduction to what SISR is.

Despite decades of studies, SISR still requires further study for academic and industrial purposes [35, 64].

The difficulty is mainly caused by the inconsistency between the simplistic degradation assumption of existing SISR methods and the complex degradations of real images[16]. Actually, for a scale factor of , the classical (traditional) degradation model of SISR [17, 18, 37] assumes the LR image is a blurred, decimated, and noisy version of an HR image . Mathematically, it can be expressed by, (1)

where ⊗ represents two-dimensional convolution of with blur kernel , denotes the standard s-fold downsampler, i.e., keeping the upper-left pixel for each distinct patch and discarding the others, and is usually assumed to be additive, white Gaussian noise (AWGN) specified by standard deviation (or noise level) [71]. With a clear physical meaning, Eq. (1) can approximate a variety of LR images by setting proper blur kernels, scale factors and noises for an

underlyingHR images. In particular, Eq. (1) has been extensively studied in model-based methods which solve a combination of a data term and a prior term under the MAP framework.

Continuing with the detailed introduction of SISR, from the mathematical model point of view: the differences between academic research and industrial applications are mainly due to the simplicity of the degradation assumptions of existing SISR methods and the inconsistency with the complex degradation of real images.Later, the mathematical meaning of this problem is explained in detail.For a given HR image, the LR image is determined by the fuzzy kernel, scale factor (dimension reduction scale), and noise.

Eq.(1) has been extensively studied in the model-based approach, which solves the combination of data items and prior items under the MAP framework.

On the basis of Eq.(1), the next section on model-based and in-depth learning of SISR algorithms is presented.

Though

model-based methodsare usually algorithmically interpretable, theytypically lack a standard criterion for their evaluationbecause, apart from the scale factor, Eq. (1) additionally involves ablur kernel and added noise. For convenience, researchers resort tobicubic degradation without consideration of blur kernel and noise level[14,56, 60]. However,bicubic degradation is mathematically complicated[25], which in turn hinders the development of model-based methods.For this reason, recently proposed SISR solutions are dominated by

learning-based methodsthat learn a mapping function from abicubicly downsampled LR imageto its HR estimation. Indeed, significant progress on improving PSNR [26, 70] and perceptual quality [31, 47, 58] for the bicubic degradation has been achieved by learning-based methods, among which convolutional neural network (CNN) based methods are the most popular, due to their powerful learning capacity and the speed of parallel computing. Nevertheless, little work has been done on applying CNNs to tackle Eq. (1) via a single model. Unlike model-based methods, CNNs usuallylack flexibility to super-resolve blurry, noisy LR images for different scale factorsvia a single end-to-end trained model (see Fig. 1).

Figure 1. While a single degradation model (i.e., Eq. (1)) can result in various LR images for an HR image, with different blur kernels, scale factors and noise, the study of learning a single deep model to invert all such LR images to HR image is still lacking.

Questions raised:

There are no more than two types of SISR s, model-based (non-in-depth learning) and in-depth learning.Both methods have their own problems.

Model-based problem: Bicubic degeneration is used regardless of fuzzy kernel and noise level [14,56,60].However, bicubic degeneration is mathematically complex [25], which in turn hinders the development of model-based methods.

Based on in-depth learning: CNNs usually lack flexibility to achieve super-resolution of blurred and noisy LR images through a single end-to-end training model for different scale factors.

The question indicates what to do, and the next paragraph describes the work of this article.

In this paper, we propose a

deep unfolding super-resolution network (USRNet) to bridge the gap between learning-based methods and model-based methods. On one hand,similar to model-based methods,USRNet can effectively handle the classical degradation model (i.e., Eq. (1)) with different blur kernels, scale factors and noise levels via a single model. On the other hand,similar to learning-based methods,USRNet can be trained in an end-to-end fashion to guarantee effectiveness and efficiency.To achieve this, we

first unfold the model-based energy function via a halfquadratic splitting algorithm. Correspondingly, we can obtain an inference which iteratively alternates between solving two subproblems, one related to a data term and the other to a prior term. Wethentreat the inference as a deep network, by replacing the solutions to the two subproblems with neural modules.Since the two subproblems correspond respectively to enforcing degradation consistency knowledge and guaranteeing denoiser prior knowledge, USRNet is well-principled with explicit degradation and prior constraints, which is a distinctive advantage over existing learning-based SISR methods.

It is worth noting that since USRNet involves a hyper-parameter for each subproblem, the network contains an additional module for hyper-parameter generation. Moreover, in order to reduce the number of parameters, all the prior modules share the same architecture and same parameters.

This work:

In introducing the work of this article, the author's thoughts are as follows:

1. First, the core idea of this article is introduced: combining the learning-based method with the model-based method.

On the one hand, similar to model-based methods, USRNet can effectively handle classical degenerate models (that is, Formula (1)) with different fuzzy kernels, scale factors and noise levels through a single model.

On the other hand, similar to the learning-based approach, USRNet can provide end-to-end training to ensure effectiveness and efficiency.

2. How does the core idea work?

First, the model-based energy function is expanded by the semiquadratic splitting algorithm.Accordingly, an inference can be derived that iteratively and alternately solves two subproblems, one relating to the data item and the other to a priori item.

Next, the inference is treated as a deep network, and the neural modules are used instead of solving the two subproblems.

3. Explain the proposed network:

Since these two sub-problems correspond to enhanced degradation consistency knowledge and guaranteed denoiser prior knowledge, respectively, USRNe t is principled for explicit degradation and prior constraint, which is a significant advantage over the existing learning-based SISR methods.

It is worth noting that since USRNet involves hyperparameters for each subproblem, the network contains an additional module for generating hyperparameters.

In addition, to reduce the number of parameters, all previous modules share the same architecture and parameters.

(Of course, you only read here. You don't know what these sentences are saying at all. Look at the text, maybe you can see it.)

The main contributions of this work are as follows:

1) An end-to-end trainable unfolding super-resolution network (USRNet) is proposed. USRNet is the first attempt to handle the classical degradation model with different scale factors, blur kernels and noise levels via a single end-to-end trained model.

2) USRNet integrates the flexibility of model-based methods and the advantages of learning-based methods, providing an avenue to bridge the gap between model-based and learning-based methods.

3) USRNet intrinsically imposes a degradation constraint (i.e., the estimated HR image should accord with the degradation process) and a prior constraint (i.e., the estimated HR image should have natural characteristics) on the solution.

4) USRNet performs favorably on LR images with different degradation settings, showing great potential for practical applications.

The main contributions of this article: (translated directly and inexpensively)

1) An end-to-end trainable deployable superresolution network (USRNet) is proposed.USRNet is the first attempt to deal with classical degenerate models with different scale factors, the fuzzy kernel and noise levels through a single end-to-end training model.

2) USRNet combines the flexibility of model-based methods with the advantages of learning-based methods, providing a way to bridge the gap between model-based methods and learning-based methods.

3) USRNet essentially imposes a degradation constraint (that is, the estimated HR image should conform to the degradation process) and a prior constraint (that is, the estimated HR image should have natural characteristics).

4) USRNet performs well on LR images with different degradation settings, showing great potential for practical application.

## Related work

### Degradation models

Knowledge of the degradation model is crucial for the success of SISR [16, 59] because it defines how the LR image is degraded from an HR image. Apart from the classical degradation model and bicubic degradation model, several others have also been proposed in the SISR literature.

In some early works, the degradation model assumes the LR image is directly downsampled from the HR image without blurring, which corresponds to the problem of image interpolation [8]. In [34, 52], the bicubicly downsampled image is further assumed to be corrupted by Gaussian noise or JPEG compression noise. In [15, 42], the degradation model focuses on Gaussian blurring and a subsequent downsampling with scale factor 3. Note that, different from Eq. (1), their downsampling keeps the center rather than upper-left pixel for each distinct 3×3 patch. In [67], the degradation model assumes the LR image is the blurred, bicubicly downsampled HR image with some Gaussian noise. By assuming the bicubicly downsampled clean HR image is also clean, [68] treats the degradation model as a composition of deblurring on the LR image and SISR with bicubic degradation.

While many degradation models have been proposed, CNN-based SISR for the classical degradation model has received little attention and deserves further study.

### Flexible SISR methods

Although CNN-based SISR methods have achieved impressive success to handle bicubic degradation, applying them to deal with other more practical degradation models is not straightforward. For the sake of practicability, it is preferable to design a flexible super-resolver that takes the three key factors, i.e., scale factor, blur kernel and noise level, into consideration.

Several methods have been proposed to tackle bicubic degradation with different scale factors via a single model, such as LapSR [30] with progressive upsampling, MDSR [36] with scales-specific branches, Meta-SR [23] with meta-upscale module. To flexibly deal with a blurry LR image, the methods proposed in [44, 67] take the PCA dimension reduced blur kernel as input. However, these methods are limited to Gaussian blur kernels. Perhaps the most flexible CNN-based works which can handle various blur kernels, scale factors and noise levels, are the deep plug-and-play methods [65, 68]. The main idea of such methods is to plug the learned CNN prior into the iterative solution under the MAP framework. Unfortunately, these are essentially model-based methods which suffer from a high computational burden and they involve manually selected hyper-parameters. How to design an end-to-end trainable model so that better results can be achieved with fewer iterations remains uninvestigated.

While learning-based blind image restoration has recently received considerable attention [12, 39, 43, 50, 62], we note that this work focuses on non-blind SISR which assumes the LR image, blur kernel and noise level are known beforehand. In fact, non-blind SISR is still an active research direction. First, the blur kernel and noise level can be estimated, or are known based on other information (e.g., camera setting). Second, users can control the preference of sharpness and smoothness by tuning the blur kernel and noise level. Third, non-blind SISR can be an intermediate step towards solving blind SISR.

### Deep unfolding image restoration

Apart from the deep plug-and-play methods (see, e.g., [7, 10, 22, 57]), deep unfolding methods can also integrate model-based methods and learning-based methods. Their main difference is that the latter optimize the parameters in an end-to-end manner by minimizing the loss function over a large training set, and thus generally produce better results even with fewer iterations. The early deep unfolding methods can be traced back to [4, 48, 54] where a compact MAP inference based on gradient descent algorithm is proposed for image denoising. Since then, a flurry of deep unfolding methods based on certain optimization algorithms (e.g., half-quadratic splitting [2], alternating direction method of multipliers [6] and primal-dual [1, 9]) have been proposed to solve different image restoration tasks, such as image denoising [11, 32], image deblurring [29, 49], image compressive sensing [61, 63], and image demosaicking [28].

Compared to plain learning-based methods, deep unfolding methods are interpretable and can fuse the degradation constraint into the learning model. However, most of them suffer from one or several of the following drawbacks. (i) The solution of the prior subproblem without using a deep CNN is not powerful enough for good performance. (ii) The data subproblem is not solved by a closed-form solution, which may hinder convergence. (iii) The whole inference is trained via a stage-wise and fine-tuning manner rather than a complete end-to-end manner. Furthermore, given that there exists no deep unfolding SISR method to handle the classical degradation model, it is of particular interest to propose such a method that overcomes the above mentioned drawbacks.

## Method

### Degradation model: classical vs. bicubic

Since bicubic degradation is well-studied, it is interesting to investigate its relationship to the classical degradation model. Actually, the bicubic degradation can be approximated by setting a proper blur kernel in Eq. (1). To achieve this, we adopt the data-driven method to solve the following kernel estimation problem by minimizing the reconstruction error over a large HR/bicubic-LR pairs

(2)

Fig. 2 shows the approximated bicubic kernels for scale factors 2, 3 and 4. It should be noted that since the downsamlping operation selects the upper-left pixel for each distinct s × s patch, the bicubic kernels for scale factors 2, 3 and 4 have a center shift of 0.5, 1 and 1.5 pixels to the upper-left direction, respectively.

Since bicubic degradation has been well studied, it is interesting to study its relationship with classical degradation models.In fact, the bicubic degeneration can be approximated by setting an appropriate fuzzy kernel in Formula (1).To achieve this, a data-driven approach is used to solve the following kernel estimation problem, that is, a larger HR/ bicubicr - LR pairMinimize the reconstruction error on

(2)

Figure 2 shows the approximate bicubic nuclei for scale factors 2, 3, and 4.It is important to note that because the downsamlping operation is different for each patch selects the upper left pixel, so the bicubic cores of scale factors 2, 3, and 4 are offset by 0.5, 1, and 1.5 pixels to the upper left, respectively.

Figure 2. Approximated bicubic kernels for scale factors 2, 3 and 4 under the classical SISR degradation model assumption. Note that these kernels contain negative values.

### Unfolding optimization

According to the MAP framework, the HR image could be estimated by minimizing the following energy function

where is the

data term, is theprior term, and is atrade-off parameter.

The MAP framework is given.

In order to obtain an unfolding inference for Eq. (3), the

half-quadratic splitting (HQS) algorithmis selected due to its simplicity and fast convergence in many applications. HQStacklesEq. (3) by introducing an auxiliary variable , leading to the following approximate equivalencewhere is the

penalty parameter.

The form of HQS algorithm processing MAP is given.(You don't know anything about HQS. It's sort of like replacing the first X with Z and adding an L2(x, z) penalty to make Z as similar as x.)

Such problem can be addressed by iteratively solving subproblems for and

According to Eq. (5), should be large enough so that and are approximately equal to the fixed point. However, this would also result in slow convergence. Therefore, a good rule of thumb is to iteratively increase . For convenience, the in the -th iteration is denoted by .

The method to solve the formula (4) is given, that is, to split (4) into two parts (5) and (6).(5) Used to understand.(6) Used to understand.

Where,The sum should be large enough to be approximately equal to the fixed point.However, this can also lead to slow convergence.A good rule of thumb, therefore, is to iterate incrementally .For convenience, Mark as in the iteration.

(5) is the data term mentioned earlier.

(6) is the prior term mentioned earlier.

It can be observed that the

data termand theprior termare decoupled into Eq. (5) and Eq. (6), respectively.

For the solution of Eq. (5), the fast Fourier transform (FFT) can be utilized by assuming the convolution is carried out with circular boundary conditions.Notably, it has aclosed-form expression[71]where is defined as

with and where the and denote FFT and inverse FFT, denotes complex conjugate of , denotes the distinct block processing operator with element-wise multiplication, i.e., applying elementwise multiplication to the distinct blocks of , denotes the distinct block downsampler, i.e., averaging the distinct blocks, denotes the standard s-fold upsampler, i.e., upsampling the spatial size by filling the new entries with zeros.

It is especially noteworthy that Eq. (7) also works for the special case of deblurring when .

For the solution of Eq. (6), it is known that, from a Bayesian perspective, it actually corresponds to a denoising problem with noise level [10].

The solutions of formulas (5) and (6) are given.(I don't understand the solution of formula (6), perhaps refer to the reference [10].)

Formula (5) is solved using Fast Fourier Transform, which may require reference to the reference [71].

It is worth noting that whenFormula (7) also applies to the deblurring algorithm.

### Deep unfolding network

Once the unfolding optimization is determined, the next step is to design the unfolding super-resolution network (USRNet). Because the unfolding optimization mainly consists of iteratively solving a data subproblem (i.e., Eq. (5)) and a prior subproblem (i.e., Eq. (6)), USRNet should alternate between a data module D and a prior module P. In addition, as the solutions of the subproblems also take the hyper-parameters αk and βk as input, respectively, a hyper-parameter module H is further introduced into USRNet. Fig. 3 illustrates the overall architecture of USRNet with K iterations, where K is empirically set to 8 for the

speed-accuracy trade-off. Next, more details on D, P and H are provided.

Once the spread optimization is determined, the next step is to design the spread superresolution network (USRNet).Because unfolding optimization mainly involves iteratively solving a data subproblem (formula (5) and a priori subproblem (formula (6), USRNet should alternate a data module D and a previous module P.In addition, the solution to the subproblem adds hyper-parameters As input, USRNet hyper-parameter module H is further introduced.Figure 3 shows the diagram with Overall architecture of USRNet with iterations, where Empirically set to 8 to balance speed with accuracy.Next, provide more details on D, P, and H.

Data module DThe data module plays the role of Eq. (7) which is the closed-form solution of the data subproblem.

Intuitively, it aims to find a clearer HR image which minimizes a weighted combination of thedata termand thequadratic regularization termwithtrade-off hyper-parameter.Because the

data term corresponds to the degradation model, the data module thus not only has the advantage of taking thescale factorandblur kernelas input but alsoimposes a degradation constraint on the solution. Actually, it is difficult to manually design such a simple but useful multiple-input module. For brevity, Eq. (7) is rewritten as. (8)

Note that is initialized by interpolating with scale factor via the simplest nearest neighbor interpolation. It should be noted that

Eq. (8) contains no trainable parameters, which in turn results in better generalizability due to the complete decoupling between data term and prior term.For the implementation, we use PyTorch where the main FFT and inverse FFT operators can be implemented bytorch.rfftandtorch.irfft, respectively.

The purpose of the data module Eq.(7) is to find a clearer HR image that minimizes the weighted combination of data items and quadratic regular items.

Because the data items correspond to the degenerate model, the data module not only has the advantage of using the scale factor and the blur kernel as input, but also imposes a degenerate constraint on the solution.In fact, it is difficult to design such a simple but useful multi-input module manually.

For brevity, rewrite Formula (7) to Formula (8).The data module does not contain trainable parameters (that is, the data module does not use in-depth learning) and has better generalization ability because the data item is completely decoupled from the prior item.

Specific implementation, Fourier and inverse transformations are availableTorch.rfftAndTorch.irfftFunction implementation.

Figure 3. The overall architecture of the proposed USRNet with iterations. USRNet can flexibly handle the classical degradation (i.e., Eq. (1)) via a single model as it takes the LR image , scale factor , blur kernel and noise level as input. Specifically, USRNet consists of three main modules, including the data module D that makes HR estimation clearer, the prior module P that makes HR estimation cleaner, and the hyper-parameter module H that controls the outputs of D and P.

Prior module P

The prior module aims toobtain a cleaner HR image by passing through a denoiser with noise level . Inspired by[66], we propose a deep CNN denoiser that takes the noise level as input. (9)

The proposed denoiser, namely

ResUNet, integrates residual blocks[21]into U-Net [45]. U-Net is widely used for image-to-image mapping, while ResNet owes its popularity to fast training and its large capacity with many residual blocks. ResUNet takes the concatenated and noise level map as input and outputs the denoised image . By doing so, ResUNet can handle various noise levels via a single model, which significantly reduces the total number of parameters. Following the common setting of U-Net,ResUNet involves four scales, each of which has an identity skip connection between downscaling and upscaling operations. Specifically, the number of channels in each layer from the first scale to the fourth scale are set to64, 128, 256 and 512, respectively. For the downscaling and upscaling operations, strided convolution (SConv) and transposed convolution (TConv) are adopted, respectively. Note thatno activation functionis followed by SConv and TConv layers,as well as the first and the last convolutional layers. For the sake of inheriting the merits of ResNet, a group of2 residual blocksare adopted in the downscaling and upscaling of each scale. As suggested in[36], each residual block is composed of two convolution layers with ReLU activation in the middle and an identity skip connection summed to its output.

The purpose of a priori module is to be transmitted through a noise level denoiser To get a clearer HR image.Inspired by [66], a deep CNN noise canceller with noise level as input is presented.

A priori module consists of ResUNet.Details include:

1) U-Net includes 4 layers;

2) Downsampling for each layerStep convolution SConv; not with activation function;

3) Upsampling on each layerDeconvolution TConv; not with activation function;

4) Each codec consists of two residual block s; the convolution isOf; structured as document [2017CVPRW] Enhanced deep residual networks for single 9 image super-resolution]

Hyper-parameter module HThe hyper-parameter module acts as a 'slide bar' to control the outputs of the data module and prior module. For example, the solution would gradually approach as increases. According to the definition of and , is determined by and , while depends on and . Although it is possible to learn a fixed and , we argue that a performance gain can be obtained if and vary with two key elements, i.e., scale factor and noise level , that influence the degree of ill-posedness. Let and , we use a single module to predict and

. (10)

The

hyper-parameter module consists of three fully connected layers with ReLU as the first two activation functions and Softplus [19] as the last. The number ofhidden nodesin each layer is64. Considering the fact that and should be positive, and Eq. (7) should avoid division by extremely small , the output Softplus layer is followed by an extra addition of1e-6. We will show how the scale factor and noise level affect the hyper-parameters in Sec. 4.4.

The hyperparameter module consists of three fully connected layers, ReLU as the first two activation functions and Softplus[19] as the last two activation functions.The number of hidden nodes per layer is 64.In consideration of And It should be positive, and Eq.(7) should avoid being divided by a minimum, with the additional 1e-6 appended behind the output Softplus layer.In section 4.4, we show how scale factors and noise levels affect superparameters.

Among the three modules, D uses a model-based approach (non-in-depth learning); P and H are deep learning models; this is what Introduction s says, and the method in this paper combines the two.

### End-to-end training

The end-to-end training aims to learn the trainable parameters of USRNet by minimizing a loss function over a large training data set. Thus, this section mainly describe the training data, loss function and training settings. Following [58], we use

DIV2K[3]andFlickr2K [55]as the HR trainingdataset. The LR images are synthesized via Eq. (1). Although USRNet focuses on SISR, it is also applicable to the case of deblurring with . Hence, thescale factors are chosen from {1, 2, 3, 4}. However, due to limited space, this paper does not consider the deblurring experiments.For the blur kernels, we useanisotropic Gaussian kernelsas in[44, 51, 67]andmotion kernelsas in[5]. We fix the kernel size to25×25.For the noise level,we set itsrange to [0, 25].With regard to the loss function, we adopt the

L1 lossfor PSNR performance. Following [58], once the model is obtained, we further adopt a weighted combination of L1 loss,VGG perceptual lossandrelativistic adversarial loss[24]with weights 1, 1 and 0.005 for perceptual quality performance. We refer to suchfine-tuned modelasUSRGAN. As usual,USRGAN only considers scale factor 4. We do not use additional losses to constrain the intermediate outputs since the above losses work well. One possible reason is that the prior module shares parameters across iterations.To optimize the parameters of

USRNet, we adopt theAdamsolver [27]with mini-batch size 128. The learning rate starts from and decays by a factor of 0.5 every iterations and finally ends with . It is worth pointing out that due to the infeasibility of parallel computing for different scale factors, each min-batch only involves one random scale factor. ForUSRGAN, its learning rate is fixed to . The patch size of the HR image for both USRNet and USRGAN is set to . We train the models with PyTorch on 4 Nvidia Tesla V100 GPUs in Amazon AWS cloud. It takes about two days to obtain the USRNet model.

I talked about setting network parameters.

USRNet: Only L1 Loss is used.

USRGAN: Use L1 Loss, VGG Loss, relativistic adversarial Loss.

They also have different learning rate settings.

## Experiments

We choose the widely-used color

BSD68dataset[40, 46]to quantitatively evaluate different methods. The dataset consists of68 imageswith tiny structures and fine textures and thus is challenging to improve the quantitative metrics, such as PSNR. For the sake of synthesizing the corresponding testing LR images via Eq. (1), blur kernels and noise levels should be provided. Generally, it would be helpful to employ a large variety of blur kernels and noise levels for a thorough evaluation, however, it would also give rise toburdensomeevaluation process. For this reason, as shown in Table 1, we only consider 12 representative and diverse blur kernels, including4 isotropic Gaussian kernelswith different widths (i.e., 0.7, 1.2, 1.6 and 2.0),4 anisotropic Gaussian kernelsfrom [67], and4 motion blur kernelsfrom [5, 33]. While it has been pointed out that anisotropic Gaussian kernels are enough for SISR task [44, 51], the SISR method that can handle more complex blur kernels would be a preferred choice in real applications. Therefore, it is necessary to further analyze the kernel robustness of different methods, we will thus separately report the PSNR results for each blur kernel rather than for each type of blur kernels. Although it has been pointed out that the proper blur kernel should vary with scale factor [64], we argue that the 12 blur kernels are diverse enough to cover a large kernel space.For the noise levels, we choose 2.55 (1%) and 7.65 (3%).

Evaluate some settings in the experiment, such as dataset selection, kernel selection, noise level selection.

### PSNR results

The compared methods include RCAN [70], ZSSR [51], IKC [20] and IRCNN [65]. Specifically,

RCANis stateof-the-art PSNR oriented method for bicubic degradation;ZSSRis a non-blind zero-shot learning method with the ability to handle Eq. (1) for anisotropic Gaussian kernels;IKCis a blind iterative kernel correction method for isotropic Gaussian kernels;IRCNNa non-blind deep denoiser based plug-and-play method.

Table 1. Average PSNR(dB) results of different methods for different combinations of scale factors, blur kernels and noise levels. The best two results are highlighted in red and blue colors, respectively.

Several typical SISR algorithms are introduced and compared with PSNR.

Although USRNet is not designed for bicubic degradation, it is interesting to test its results by taking the approximated bicubic kernels in Fig. 2 as input. From Table 2, one can see that USRNet still performs favorably without training on the bicubic kernels.

Table 2. The average PSNR(dB) results of USRNet for bicubic degradation on commonly-used testing datasets.

Although USRNet is not designed for bicubic degeneration, it is interesting to test the results using the approximate bicubic kernel as input in Figure 2.As can be seen from Table 2, USRNet is still performing well without having undergone two or three core training sessions.

### Visual results

Figure 4. Visual results of different methods on super-resolving noise-free LR image with scale factor 4. The blur kernel is shown on the upper-right corner of the LR image. Note that RankSRGAN and our USRGAN aim for perceptual quality rather than PSNR value.

### Analysis on D and P

Because the proposed USRNet is an iterative method, it is interesting to investigate the HR estimations of data module D and prior module P in different iterations. Fig. 5 shows the results of USRNet and USRGAN in different iterations for an LR image with scale factor 4. As one can see, D and P can facilitate each other for iterative and alternating blur removal and detail recovery. Interestingly, P can also act as a detail enhancer for high-frequency recovery due to the task-specific training. In addition, it does not reduce blur kernel induced degradation which verifies the decoupling between D and P. As a result, the end-to-end trained USRNet has a task-specific advantage over Gaussian denoiser based plug-and-play SISR. To quantitatively analyze the role of D, we have trained an USRNet model with 5 iterations, it turns out that the average PSNR value will decreases about 0.1dB on Gaussian blur kernels and 0.3dB on motion blur kernels. This further indicates that D aims to eliminate blur kernel induced degradation. In addition, one can see that USRGAN has similar results with USRNet in the first few iterations, but will instead recover tiny structures and fine textures in last few iterations.

Figure 5. HR estimations in different iterations of USRNet (top row) and USRGAN (bottom row). The initial HR estimation is the nearest neighbor interpolated version of LR image. The scale factor is 4, the noise level of LR image is 2.55 (1%), the blur kernel is shown on the upper-right corner of .

Since the proposed USRNet is an iterative method, it is meaningful to study HR estimates of data module D and a priori module P under different iterations.

Figure 5 shows the results of USRNet and USRGAN iterations for a scale factor 4 LR image.

D and P can mutually promote iterative and alternating blurring and detail recovery.Interestingly, P can also be used as a detail enhancer for high frequency recovery due to specific task training.Furthermore, it does not reduce the degradation caused by fuzzy kernels, which verifies the decoupling between D and P. Therefore, end-to-end training USRNet has task-specific advantages over Gauss-based Plug and Play SISR.

To quantitatively analyze the role of D, a five-iteration USRNet model is trained. The results show that the average PSNR value decreases by about 0.1dB on the Gaussian fuzzy kernel and 0.3dB on the motion fuzzy kernel.

This further indicates that the goal of D is to eliminate the degradation caused by fuzzy kernels.In addition, you can see that USRGAN has similar results as USRNet in the previous iterations, but will recover fine structures and fine textures in the last few iterations.

### Analysis on H

Fig. 6 shows outputs of the hyper-parameter module for different combinations of scale factor and noise level . It can be observed from Fig. 6(a) that is positively correlated with and varies with s. This actually accords with the definition of in Sec. 3.2 and our analysis in Sec. 3.3. From Fig. 6(b), one can see that has a decreasing tendency with the number of iterations and increases with scale factor and noise level. This implies that the noise level of HR estimation is gradually reduced across iterations and complex degradation requires a large to tackle with the illposeness. It should be pointed out that the learned hyperparameter setting is in accordance with that of IRCNN [65]. In summary, the learned H is meaningful as it plays the proper role.

Figure 6. Outputs of the hyper-parameter module H, i.e., (a) and (b) , with respect to different combinations of and .

Figure 6 shows the output scaling factor and noise of different combinations of hyper-parameter modulesHorizontal.It can be observed from Figure 6 (a),AndPositively correlated with Conversely.This actually corresponds toDefinition is in Section 3.2.

From Figure 6 (b), one can see thatThere is a decreasing trend with the increase of scale factor and noise level.

To sum up, H is meaningful because it plays an appropriate role.

### Generalizability

As mentioned earlier, the proposed method enjoys good generalizability due to the decoupling of data term and prior term. To demonstrate such an advantage, Fig. 7 shows the visual results of USRNet and USRGAN on LR image with a kernel of much larger size than training size of 25×25. It can be seen that both USRNet and USRGAN can produce visually pleasant results, which can be attributed to the trainable parameter-free data module. It is worth pointing out that USRGAN is trained on scale factor 4, while Fig. 7(b) shows its visual result on scale factor 3. This further indicates that the prior module of USRGAN can generalize to other scale factors. In summary, the proposed deep unfolding architecture has superiority in generalizability.

Figure 7. An illustration to show the generalizability of USRNet and USRGAN. The sizes of the kernels in (a) and (c) are 67×67 and 70×70, respectively. The two kernels are chosen from [41].

As mentioned earlier, this method has good generalization due to the decoupling of data items and prior items.To illustrate this advantage, Figure 7 shows the visual results of USRNet and USRGAN on LR images with a kernel size greater than the training size. Much larger.You can see that both USRNet and USRGAN can produce visually pleasant results, thanks to trainable parameterless data modules.It is worth noting that USRGAN is trained on scale factor 4, and Figure 7(b) shows its visualization on scale factor 3.This further indicates that the prior module of USRGAN can be extended to other scale factors.In summary, the proposed depth expansion structure has general advantages.

### Real image super-resolution

Because Eq. (7) is based on the assumption of circular boundary condition, a proper boundary handling for the real LR image is generally required. We use the following three steps to do such pre-processing. First, the LR image is interpolated to the desired size. Second, the boundary handling method proposed in

[38]is adopted on the interpolated image with the blur kernel. Last, the downsampled boundaries are padded to the original LR image. Fig. 8 shows the visual result of USRNet on real LR image with scale factor 4. The blur kernel is manually selected as isotropic Gaussian kernel with width 2.2 based on user preference. One can see from Fig. 8 that the proposed USRNet can reconstruct the HR image with improved visual quality.Figure 8. Visual result of USRNet (×4) on a real LR image.

Since Formula (7) is based on the assumption of circular boundary conditions, it is often necessary to appropriately boundary the real LR image.We use the following three steps to do this preprocessing.First, interpolate the LR image to the desired size.Secondly, the boundary processing method proposed in [38] is used for interpolated images with fuzzy kernels.Finally, the downsampling boundary is filled into the original LR image.Figure 8 shows the visual result of USRNet on the actual LR image with a scale factor of 4.Based on user preferences, a fuzzy kernel is manually selected as an isotropic Gaussian kernel with a width of 2.2.From Figure 8, it can be seen that the proposed USRNet can reconstruct HR images with improved visual quality.

This summary is added because the abstract and preface of this article mentioned that it is valid for real images. This section is to verify these sentences.