Art’Em – Artistic Style Transfer to Virtual Reality Final Update

Art’Em is an application that uses computer vision to bring artistic style transfer to real time speeds in VR compatible resolutions. This program takes in a feed from any source (OpenCV - Webcam, User screen, Android Phone camera (IP Webcam) etc.) and returns a stylized Image.

1. Introduction

Various tools were used to build this application.

This paper has been divided into 3 sections, each of which explore different ways of accelerating the current process of Artistic style transfer.

The first section introduces the concept of XNOR-Nets and conducts an in-depth case study of how parallelization can be performed efficiently without any approximations. While the method functions, integrating the kernels with trainable deep learning models prove to not be feasible within the time frame of this project.

The second section studies generator networks and what is in some ways ‘one-shot’ image stylization. The method works well at VR compatible resolutions but is largely limited by the fact that each network takes a long time to learn a new style.

The third section studies a technique called Adaptive Instance normalization, which allows the user to not only stylize extremely fast, but also instantly switch to any style of choice.

2. XNOR Networks

XNOR-nets (Exclusively NOT OR) have been explored in depth in the following articles ^[1]^[2]. The image below summarizes how you can replace matrix multiplication operations with simple operations (XNOR followed by Population count) provided every matrix element is either 1 or -1. This shows great promise in terms of the network speedup.

Upon in-depth study of binarization, we saw great loss of Image semantics, which can be clearly seen in the image below. This however is the result of an extremely destructive method of binarization. If the network is trained with the constraint that every matrix in the network must be a scalar real value multiplied by a matrix containing only +1 and -1, it has been proven ^[3] that very decent Image recognition results can be obtained. The results obtained by AllenAI proved that a trained AlexNet XNOR network can obtain top-1 accuracies of 43.3%.

As shown in my previous technical article, creating an unoptimized XNOR general matrix multiply kernel can give speedups as great as 6 times over the cuBLAS kernel. This is simply because XNOR networks are applicable only to a very specific use case. Thus is it not too surprising that it is so much faster.

While this is great, the most important part of a convolutional neural network is the convolution. However, the promise of speedup in a convolutional layer isn’t as great as it was for a fully connected layer. This is revealed quite simply when one tries to execute the basic methodology adopted for both actions.

[Source ^[4]]

As you can see, convolution requires us to select submatrices, and multiply them with a kernel. Whereas, in matrix multiplication, an entire row is to be multiplied with an entire column of a matrix.

The packing of bits to a data type is the most time consuming process in an XNOR-net. So, if we were to imagine a MxM matrix convolving over a NxN matrix, we will have to pack (N-M+1)²submatrices to a data type and then execute a XNOR function followed by a population count to generate the ‘convolved feature’.

However, for a matrix multiplication if we were to multiply two NxN matrices together, we will only have to pack 2N submatrices (Rows and columns respectively) to be able to implement this function.

Thus we lose out on a low of the expected performance gains. To confirm my suspicion, I implemented a low precision convolutional network using CUDA (Compute Unified Device Architecture) C programming. The implementation can be found here ^[5]. To run this code you must have a CUDA compatible device. Your ability to work with different Image dimensions also largely depends upon the VRAM available to you. The network has been parallelized by utilizing the shared memory available. It currently functions only for kernels of sizes 4x4. However, extending that will not be a big issue.

Allocation of binarized kernels is done before the convolution timer begins. The benchmarking is performing using the nvprof tool. The choice of timing the algorithm after binarizing the kernels is due to the fact that a saved network would already have its kernels saved in the form we desire. A shared array of length 256 unsigned ints per block is allocated. The variables necessary for indexing are also allocated. The block size is set to (16,16), this is done so that we can allocate a shared array within the shared memory constraint. The binarization code then begins to loop through the kernel depth, for which it iterates through every channel in the input ‘image’. Within each of these iterations, every thread populates its own element of the array with an unsigned int which is packing the sub-matrix which has been binarized. After this is complete, we simply execute a line of code which populates the output array with the value of the multiplication of the kernel with the respective submatrix generated from the input ‘image’. Note that the output array is flattened, but this can easily be handled. One very important thing to be careful about when attempting to parallelize the channels to make this applicable to frameworks like TensorFlow would be to not exceed the maximum grid and block size limitations. These would give rise to memory allocation and access errors.

This is a very basic implementation, and while the network has been parallelized per convolution, channel convolutions have not been parallelized. There is scope for increasing the speeds mentioned below many-fold, but the same applies to a full precision convolutional operation. Even though the bitwise convolution would perform better with further parallelization, it has been omitted from the code.

While the network does run around 2 times faster than a basic general convolution kernel, the network speed up doesn’t increase enough by my methods to justify the loss in accuracy.

One of the other notable advantages of using a binarized network is the fact that the entire network can be stored in much lesser space. Thus an entire VGG-19 model can be reduced from ~512 MB to only ~16 MB. This will be great for embedded devices which do not have the space to hold such large models, and even for GPU and CPUs, as the entire model could be loaded at once into the VRAM of less powerful hardware or hold larger datasets.

We must thus continue to a better technique for this use case, one which preserves the accuracy to a large extent and also gives significant speedup.

3. Generator networks

Adapting the research done in this paper ^[6], we will study the implementation of a generator stylization network and Instance Normalization.

Essentially, implementing an optimization based model of artistic style transfer will generally give better, more varied results but is extremely time consuming. We will thus learn a generator neural network for every style we wish to adapt to. This will give much faster stylization speed at the expense of inferior quality and diversity compared to generating stylized images by optimization.

There are two steps involved in building a generator network.

Designing a generator network

This is one of the most important parts of this endeavor. A generator network must transfer style well but also be small enough to give good frame rates. If you do not want to train your own network, it is suggested to use this ^[7] implementation of the generator network. The network consists of three convolutional layer, followed by 5 residual blocks, followed by two transposed convolution which ends with a convolutional layer.

The generator network utilizes a method of normalization known as “Instance Normalization (IN)”. This method is different from batch normalization as it computes the statistics for each batch element independently. The illustration below (Taken from the poster made by Dmitry Ulyanov ^[8]) demonstrates how Batch Normalization and Instance Normalization differ. IN stands for Instance normalization, and BN stands for Batch Normalization.

Training of generator network

Once the generator network has been set up and the MS-COCO dataset has been saved along with the VGG19 model, we need to implement a loss function.

The loss function includes a style loss, content loss and a total variational loss. Including total variational loss helps removal of noise from the generated images.

The input image is fed to the generator network, we will call the output of the generator network the generated image. This is then forwarded to a VGG-19 model, along with the content and style targets. A loss function value is calculated by comparing the layers of the VGG-19 when passing the generated image, style target and the content target.

For this purpose, the ADAM optimizer was utilized. However, the L-BFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno) optimization algorithm has been shown to give much better results and faster convergence speeds for optimization based style transfer.

The image below depicts the training process.

[Source ^[9]]

The gif below is indicative of the learning process of such a generator network. The entire learning process was not captured here, but every 50 iterations a snap was recorded. The training was performed on the Intel® Nervana™ DevCloud. In only 900 iterations, it learnt a lot about how to stylize an image.

After the training, we can simply take an Image Transform Net and run the entire process on it.

Dataset

The dataset used to train generator networks was the MS-COCO dataset.^[10]. The MS-COCO dataset is a large-scale object detection, segmentation, and captioning dataset. We will not be needing any of the labels or object information. We only require a large number of images to train the network on. These will be our content images. As for our style image, a generator network can only learn one style. This will be discussed later with a new technique being introduced, but for now we are limited to having only one style trained per network.

With the generator network at default configuration, we achieved speeds of around 15 frames per second at a resolution of 600x540. However upon training a smaller generator network we achieved around 17 frames per second at full VR resolution. This metric involves utilizing the ImageGrab feature along with the stylization. The ImageGrab feature by default maxes out at around 30 frames per second. While this result is impressive, the result of pruning the network led to extreme quality degradation. All of this was performed on a modestly powerful graphcs processing unit on a laptop. Thus we can expect at least 25 frames per second at full VR resolution on a desktop grade GTX 1080 Ti along with a powerful Intel processor.

The stylization on the pruned network is not too great. The result below is obtained after 7000 iterations. Much better results can be expected if the training is done for longer. However the pruned model gives very high network speeds.

Further training of this network will be done. The corner artifacts are also undesirable, however I have not been able to tackle them effectively while training yet.

The GIF below demonstrates the network stylizing at default configuration. At half VR resolution, the GIF is a real time depiction of stylization. Results similar to this can be expected at VR resolution for the pruned network.

The current GUI of the application is a simple Tkinter interface which allows you to use the ImageGrab feature to stylize your screen at any resolution. Since the concept is under development, making the code transparent is very important and thus no proper front end planning has been done, as any such planning would have to adapt to a VR interface. The VR platform also becomes important and the camera feed source needs to also be taken into consideration. With the current test phase interface, 5 preset stylization options are also provided. The models must be downloaded from this link. ^[11] Anyone downloading the original code can easily train their own models by using this ^[12] code base. The OpenCV model which is webcam compatible has not been shared yet as there are plans to integrate the same with an external camera (Phone) via software like IP Webcam for Android.

4. Adapting to multiple styles

One of the most notable problems with the implementation above is the fact that there needs to be a new model trained for every new style we would like. On powerful hardware, it takes around 4-6 hours to train one such model. This is clearly not feasible if we want to make a large scale application that supports people to use different styles tuned to their liking.

While maintaining the earlier code for fast stylization, I attempted to work with the Adaptive Instance Normalization technique. This is a great methodology to adapt to any style instantly and deliver promising results. You can find the original implementation here ^[13].

The idea behind the technique of Adaptive instance normalization shown above is as follows:

VGG Encoder

VGG-19 is an ImageNet model. An image recognition model such as the VGG-net has several convolutional layers followed by fully connected layers. The fully connected layers ‘decompose’ the convolutional layer data into any number of categories. We do not need the fully connected layers as our aim is not to classify images, but extract information about the image. Since the VGG-net is a good image recognition model, we can imagine that every layer would contain some relevant data about the image. These layer outputs can be ‘imagined’ using the image below. You can clearly see that as the depth increases, more abstract information about the input will be extracted.

Thus, if we were to pass an image through the VGG19 network without the fully connected layers, we can imagine that the 512 filters in the VGG19 conv5_1 layer would extract a lot of important information about the image.

More specifically, we can utilize this information to encode the ‘style’ information to the ‘content’ information. This is exactly what Adaptive Instance normalization does.

AdaIN

The AdaIN module of the network receives the content ‘information’ x, and the style ‘information’ y. Then it aligns the channel-wise mean and variance of x to match those of y. This is not a trainable module, rather just a transformation of the content information with respect to the style information. You can see the exact relation in the Image which represents the entire model above.

You can imagine a feature channel from the style ‘information’ that detects a certain style of brush strokes. This will have a very high activation, and the same will be scaled to the incoming content ‘information’ x. Thus our stylized encoded data is ready. We now need to decode this information to an actual image.

Decoder

The function of the decoder is to essentially decode the output of the AdaIN module. The decoder mirrors the encoder. The pooling layers are replaced with nearest up-sampling. Normalization is not used in the decoder, this is because of the fact that Instance normalization normalizes each sample to a single style, whereas batch normalization does that to a batch of samples.

Thus, this network brings the encoded data to the original size with appropriate stylization.

Implementation

While utilizing the online torch pretrained models were effective, the network speed could be made much, much faster if we used more compact networks. The dataset consists of the MS-COCO dataset for content image. For the stylization image dataset, we had the default Kaggle dataset for ‘Painter by numbers’. I was unable to download this data set. Thus for training purposes, I have requested access to the BAM (Behance-Artistic-Media) dataset.^[15]

One of my first realizations whilst implementing this model was that when the Adaptive Instance Normalization module was disabled, the Decoder generated very low contrast images. Since I did not have the ability to further train the network till I have the art dataset, I decided to utilize PIL Image modules to increase the contrast of the decoder output till the output looked natural/close to the input image. This is somewhat necessary because of the fact that if there are no transformations being done to the encoder data, a perfect decoder would simply give the original image back. This post processing of the image gave much better stylization results, albeit occasionally a bit over to saturated.

There are two very important observations I realized after implementing the entire model, while the stylization was average the color adaptation was very poor. This might be a fault in my own implementation. However, giving extra weight to the style mean in the channel over the content mean provided me with much better color adaptation.

With the pre-trained models, I was unable to replicate the quality of results presented by the research paper. However, we achieved around 8 frames per second on a mobile GTX 1070 at VR resolution. Much faster results can be expected from a more powerful desktop grade VR capable device. As for the quality, this could also be because the large scale features are not exactly transformed by the decoder. I hope to train the decoder to take inputs from a less deep version of the VGG and apply the AdaIN module. This could be one of the reasons why I never saw any large scale changes in the image. In my own model, there was very little style being transferred fully.

Further reducing the size of the network to create a decoder which takes input from the third convolutional module of the VGG network or using an entire new ImageNet model could give much faster stylization speeds, perhaps even at par with the earlier approach of using a generator network.

The image below depicts the results of stylization by this method. We can see clearly that there is not a lot of change in much of the images. There has been a slight color shift due to the contrast increase as well. However I believe that this makes up for the decoder till improvements can be made by training.

Summary and future work

The project succeeds at bringing artistic style transfer to real time and integrates it with a camera/OpenCV module. We explored two different methodologies of stylizing an image, both of which give very promising results. This framework can very easily be extended for use with VR. It also contains useful performance analysis of XNOR-nets and its feasibility.

There are several tangents which can be made much better. We are yet to train the smaller generator network which gives a pleasant stylization speed of 17-20 frames per second on a laptop grade graphical processing unit. Utilizing a smaller encoder-decoder pair will also allow the user to adapt to any style of his choice. We can also expect to see more large scale changes by virtue of changing the layer we use for the AdaIn module, which is what we wanted. Some parameters need to be tuned to get better results from Adaptive Instance normalization. The generator networks give very good results at a respectable frame rate.

Acknowledgements

I would like to thank the Intel Student Ambassador Program for AI which provided me with the necessary training resources on the Intel DevCloud and the technical support to be able to develop this idea. This project was supported by the Intel Early Innovator Grant.