Convolutional neural networks (CNN) are becoming mainstream in computer vision. In particular, CNNs are widely used for high-level vision tasks, like image classification (AlexNet*, for example). This article (and associated tutorial) describes an example of a CNN for image super-resolution (SR), which is a low-level vision task, and its implementation using the Intel® Distribution for Caffe* framework and Intel® Distribution for Python*. This CNN is based on the work described by1 and2, proposing a new approach to performing single-image SR using CNNs.
Introduction
Some modern camera sensors, present in everyday electronic devices like digital cameras, phones, and tablets, are able to produce reasonably high-resolution (HR) images and videos. The resolution in the images and videos produced by these devices is in many cases acceptable for general use.
However, there are situations where the image or video is considered low resolution (LR). Examples include the following situations:
- Device does not produce HR images or video (as in some surveillance systems).
- The objects of interest in the image or video are small compared to the size of the image or video frame; for example, faces of people or vehicle plates located far away from the camera.
- Blurred or noisy images.
- Application using the images or videos demands higher resolution than that present in the camera.
- Improving the resolution as a pre-processing step improves the performance of other algorithms that use the images; face detection, for example.
Super-resolution is a technique to obtain an HR image from one or several LR images. SR can be based on a single image or on several frames in a video sequence.
Single-image (or single-frame) SR uses pairs of LR and HR images to learn the mapping between them. For this purpose, image databases containing LR and HR pairs are created3 and used as a training set. The learned mapping can be used to predict HR details in a new image.
On the other hand, multiple-frame SR is based on several images taken from the same scene, but from slightly different conditions (such as angle, illumination, and position). This technique uses the non-redundant information present in multiple images (or frames in an image sequence) to increase the SR performance.
In this article, we will focus on a single-image SR method.
Single-Image Super-Resolution Using Convolutional Neural Networks
In this method, a training set is used to train a neural network (NN) to learn the mapping between the LR and HR images in the training set. There are many references in the literature about SR. Many different techniques have been proposed and used for about 30 years. Methods using deep CNNs have been developed in the last few years. One of the first methods was created by1, who described a three-layer CNN and named it Super-Resolution Convolutional Neural Network (SRCNN). Their pioneering work in this area is important because, besides demonstrating that the mapping from LR to HR can be cast as a CNN, they created a model often used as a reference. New methods compare its performance to the SRCNN results. The same authors have recently developed a modified version of their original SRCNN, which they named Fast Super-Resolution Convolutional Neural Network (FSRCNN), that offers better restoration quality and runs faster2.
In this article, we describe both the SRCNN and the FSRCNN, and, in a separate tutorial, we show an implementation of the improved FSRCNN. Both the SRCNN and the FSRCNN can be used as a basis for further experimentation with other published network architectures, as well as others that the readers might want to try. Although the FSRCNN (and other recent network architectures for SR) show clear improvement over the SRCNN, the original SRCNN is also described here to show how this pioneer network has evolved from its inception to newer networks that use different topologies to achieve better results. In the tutorial, we will implement the FSRCNN network using the Intel Distribution for Caffe deep learning framework and Intel Distribution for Python, which will let us take advantage of Intel® Xeon® processors and Intel® Xeon Phi™ processors, as well as Intel® libraries to accelerate training and testing of this network.
Super-Resolution Convolutional Neural Network (SRCNN) Structure
The authors of the SRCNN describe their network, pointing out the equivalence of their method to the sparse-coding method4, which is a widely used learning method for image SR. This is an important and educational aspect of their work, because it shows how example-based learning methods can be adapted and generalized to CNN models.
The SRCNN consists of the following operations1:
- Preprocessing: Up-scales LR image to desired HR size.
- Feature extraction: Extracts a set of feature maps from the up-scaled LR image.
- Non-linear mapping: Maps the feature maps representing LR to HR patches.
- Reconstruction: Produces the HR image from HR patches.
Operations 2–4 above can be cast as a convolutional layer in a CNN that accepts as input the preprocessed images from step 1 above, and outputs the HR image. The structure of this SRCNN consists of three convolutional layers:
- Input Image: LR image up-sampled to desired higher resolution and c channels (the color components of the image)
- Conv. Layer 1: Patch extraction
- n1 filters of size c× f1× f1
- Activation function: ReLU (rectified linear unit)
- Output: n1 feature maps
- Parameters to optimize: c× f1× f1× n1 weights and n1 biases
- Conv. Layer 2: Non-linear mapping
- n2 filters of size n1× f2× f2
- Activation function: ReLU
- Output: n2 feature maps
- Parameters to optimize: n1× f1× f1× n2 weights and n2 biases
- Conv. Layer 3: Reconstruction
- One filter of size n2× f3× f3
- Activation function: Identity
- Output: HR image
- Parameters to optimize: n2× f3× f3× c weights and c biases
- Loss Function: Mean squared error (MSE) between the N reconstructed HR images and the N original true HR images in the training set (N is the number of images in the training set).
In their paper, the authors of SRCNN implement and test their SRCNN using several settings varying the number of filters. They get better SR performance when they increase the number of filters, at the expense of increasing the number of parameters (weights and biases) to optimize, which in turns increases the computational cost. Next is their reference model, which shows good overall results in terms of accuracy/performance (Figure 1):
- Input Image: LR single-channel image up-sampled to desired higher resolution
- Conv. Layer 1: Patch extraction
- 64 filters of size 1 x 9 x 9
- Activation function: ReLU
- Output: 64 feature maps
- Parameters to optimize: 1 x 9 x 9 x 64 = 5184 weights and 64 biases
- Conv. Layer 2: Non-linear mapping
- 32 filters of size 64 x 1 x 1
- Activation function: ReLU
- Output: 32 feature maps
- Parameters to optimize: 64 x 1 x 1 x 32 = 2048 weights and 32 biases
- Conv. Layer 3: Reconstruction
- 1 filter of size 32 x 5 x 5
- Activation function: Identity
- Output: HR image
- Parameters to optimize: 32 x 5 x 5 x 1 = 800 weights and 1 bias
Figure 1. Structure of SRCNN showing parameters for reference model.
Fast Super-Resolution Convolutional Neural Network (FSRCNN) Structure
The authors of the SRCNN recently created a new CNN which accelerates the training and prediction tasks, while achieving comparable or better performance compared to SRCNN. The new FSRCNN consists of the following operations2:
- Feature extraction: Extracts a set of feature maps directly from the LR image.
- Shrinking: Reduces dimension of feature vectors (thus decreasing the number of parameters) by using a smaller number of filters (compared to the number of filters used for feature extraction).
- Non-linear mapping: Maps the feature maps representing LR to HR patches. This step is performed using several mapping layers with filter size smaller than used in SCRNN.
- Expanding: Increases dimension of feature vectors. This operation performs the inverse operation as the shrinking layers, in order to more accurately produce the HR image.
- Deconvolution: Produces the HR image from HR features.
The authors explain in detail the differences between SRCNN and FSRCNN, but things particularly relevant for a quick implementation and experimentation (which is the scope of this article and the associated tutorial) are the following:
- FSRCNN uses multiple convolution layers for the non-linear mapping operation (instead of a single layer in SRCNN). The number of layers can be changed (compared to the author’s version) in order to experiment. Performance and accuracy of reconstruction will vary with those changes. Also, this is a good example for fine-tuning a CNN by keeping the portion of FSRCNN fixed up to the non-linear mapping layers, and then adding or changing those layers to experiment with different lengths for the non-linear LR-HR mapping operation.
- The input image is directly the LR image. It does not need to be up-sampled to the size of the expected HR image, as in the SRCNN. This is part of why this network is faster; the feature extraction stage uses a smaller number of parameters compared to the SRCNN.
As seen in Figure 2, the five operations shown above can be cast as a CNN using convolutional layers for operations 1–4, and a deconvolution layer for operation 5. Non-linearities are introduced via parametric rectified linear unit (PReLU) layers (described in5), which the authors for this particular model chose because of better and more stable performance, compared to rectified linear unit (ReLU) layers. See Appendix 1 for a brief description of ReLUs and PReLUs.
Figure 2. Structure of FSRCNN(56, 12, 4).
The overall best performing model reported by the authors is the FSRCNN (56, 12, 4) (Figure 2), which refers to a network with a LR feature dimension of 56 (number of filters both in the first convolution and in the deconvolution layer), 12 shrinking filters (the number of filters in the layers in the middle of the network, performing the mapping operation), and a mapping depth of 4 (the number of convolutional layers that implement the mapping between the LR and the HR feature space). This is the reason why this network looks like an hourglass; it is thick (more parameters) at the edges and thin (fewer parameters) in the middle. The overall shape of this reference model is symmetrical and its structure is as follows:
- Input Image: LR single channel.
- Conv. Layer 1: Feature extraction
- 56 filters of size 1 x 5 x 5
- Activation function: PReLU
- Output: 56 feature maps
- Parameters: 1 x 5 x 5 x 56 = 1400 weights and 56 biases
- Conv. Layer 2: Shrinking
- 12 filters of size 56 x 1 x 1
- Activation function: PReLU
- Output: 12 feature maps
- Parameters: 56 x 1 x 1 x 12 = 672 weights and 12 biases
- Conv. Layers 3–6: Mapping
- 4 x 12 filters of size 12 x 3 x 3
- Activation function: PReLU
- Output: HR feature maps
- Parameters: 4 x 12 x 3 x 3 x 12 = 5184 weights and 48 biases
- Conv. Layer 7: Expanding
- 56 filters of size 12 x 1 x 1
- Activation function: PReLU
- Output: 12 feature maps
- Parameters: 12 x 1 x 1 x 56 = 672 weights and 56 biases
- DeConv Layer 8: Deconvolution
- One filter of size 56 x 9 x 9
- Activation function: PReLU
- Output: 12 feature maps
- Parameters: 56 x 9 x 9 x 1 = 4536 weights and 1 bias
Total number of weights: 12464 (plus a very small number of parameters in PReLU layers)
Figure 3 shows an example of using the trained FSRCNN on one of the test images. The protobuf file describing this network, as well as training and testing data preparation and implementation details, will be covered in the associated tutorial.
Figure 3. An example of inference using a trained FSRCNN. The left image is the original. In the center, the original image was down-sampled and blurred. The image on the right is the reconstructed HR image using this network.
Summary
This article presented an overview of two recent CNNs for single-image super-resolution. The networks we chose were representative of the state of the art methods for SR and, having been one of the first published CNN-based methods, show interesting insights about how a non-CNN method (sparse coding) inspired a CNN-based method. In the tutorial, an implementation of FSRCNN is shown using the Intel® Distribution for Caffe* framework and Intel® Distribution for Python*. This reference implementation can be used to experiment with variations of this network and as a base for implementing newer networks for super-resolution that have been published recently. This is a good example for fine-tuning a network. New networks with varying architectures have been published recently. They show improvements in reconstruction or training/inference speed, and some of them attempt to solve the multi-frame SR problem. The reader is encouraged to experiment with these new networks.
Appendix 1: Rectified Linear Units (Rectifiers)
Rectified activation units (rectifiers) in neural networks are one way to introduce non-linearities in the network. A non-linear layer (also called activation layer) is necessary in a NN to prevent it from becoming a pure linear model with limited learning capabilities. Other possible activation layers are, among others, a sigmoid function or a hyperbolic tangent (tanh) layer. However, rectifiers have better computational efficiency, improving the overall training of the CNN.
The most commonly used rectifier is the traditional rectified linear unit (ReLU), which performs an operation defined mathematically as:
where xi is the input on the i-th channel.
Another rectifier introduced recently5 is the parametric rectified linear unit (PReLU), defined as:
which includes parameters pi controlling the slope of the line representing the negative inputs. These parameters will be learned jointly with the model during the training phase. To reduce the number of parameters, the pi parameters can be collapsed into one learnable parameter for all channels.
A particular case of the PReLU is the leaky ReLU (LReLU), which is a PReLU with pi defined as a small constant k for all input channels.
In Caffe, a PReLU layer can be defined (in a protobuf file) as
layer { name: "reluXX" type: "PReLU" bottom: "convXX" top: "convXX" prelu_param { channel_shared: 1 } }
Where, in this case, the negative slopes are shared across channels. A different option is to use LReLU with a fixed slope:
layer { name: "reluXX" type: "PReLU" bottom: "convXX" top: "convXX" prelu_param { filler: 0.1 } }
References
1. C. Dong, C. C. Loy, K. He and X. Tang, "Learning a Deep Convolutional Network for Image Super-Resolution," 2014.
2. C. Dong, C. C. Loy and X. Tang, "Accelerating the Super-Resolution Convolutional Neural Network," 2016.
3. P. B. Chopade and P. M. Patil, "Single and Multi Frame Image Super-Resolution and its Performance Analysis: A Comprehensive Survey," February 2015.
4. J. Yang, J. Wright, T. Huang and Y. Ma, "Image Super-Resolution via Sparse Representation,"IEEE Transactions on Image Processing, pp. 2861-2873, 2010.
5. K. He, X. Zhang, S. Ren and J. Sun, "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,"arxiv.org, 2015.
6. A. Greaves and H. Winter, "Multi-Frame Video Super-Resolution Using Convolutional Neural Networks," 2016.
7. J. Kim, J. K. Lee and K. M. Lee, "Accurate Image Super-Resolution Using Very Deep Convolutional Networks," 2016.