Ostagram: a neural network-based service that combines photos and ornaments into artistic masterpieces. Styling images using neural networks: no mysticism, just matan Neural network artist


Ever since German researchers from the University of Tübingen presented theirs in August 2015 about the possibility of transferring the style famous artists on other photos, services began to appear that monetized this opportunity. It was launched on the western market, and on the Russian market - its full copy.

To bookmarks

Despite the fact that Ostagram was launched back in December, it began to quickly gain popularity on social networks precisely in mid-April. At the same time, there were less than a thousand people in the project on VKontakte as of April 19.

To use the service, you need to prepare two images: a photo to be processed, and a picture with an example of the style to be superimposed on the original photo.

The service has a free version: it creates an image in minimum resolution up to 600 pixels along the longest side of the picture. The user receives the result of only one of the iterations of applying the filter to the photo.

There are two paid versions: Premium produces an image up to 700 pixels on the longest side and applies 600 iterations of neural network processing to the image (the more iterations, the more interesting and intensive the processing). One such picture will cost 50 rubles.

In the HD version, you can adjust the number of iterations: 100 will cost 50 rubles, and 1000 - 250 rubles. In this case, the image will have a resolution of up to 1200 pixels on the longest side, and it can be used for printing on canvas: Ostagram offers such a service with delivery from 1800 rubles.

In February, representatives of Ostagram, which will not accept requests for image processing from users "from countries with developed capitalism," but then access to photo processing for VKontakte users from all over the world. Judging by the Ostagram code published on GitHub, it was developed by Sergey Morugin, a 30-year-old resident of Nizhny Novgorod.

TJ contacted commercial director project presented by Andrey. According to him, Ostagram appeared before Instapainting, but was inspired by a similar project called Vipart.

Ostagram was developed by a group of students from N.N. Alekseeva: after initial testing on a narrow group of friends at the end of 2015, it was decided to make the project public. Initially, image processing was completely free, and it was planned to make money by selling printed paintings. According to Andrey, printing turned out to be the biggest problem: photos of people processed by a neural network rarely look pleasing to the human eye, and the end client needs a long time to adjust the result before applying it to the canvas, which requires large machine resources.

For image processing, the creators of Ostagram wanted to use Amazon cloud servers, but after the influx of users, it became clear that the costs for them would exceed a thousand dollars a day with a minimal return on investment. Andrey, who is also an investor in the project, rented server facilities in Nizhny Novgorod.

The audience of the project is about a thousand people a day, but on some days it reached 40 thousand people due to transitions from foreign media outlets that have already noticed the project before domestic ones (Ostagram even managed to collaborate with European DJs). At night, when traffic is low, image processing can take 5 minutes, and during the day it can take up to an hour.

If earlier foreign users were deliberately limited access to image processing (they thought to start monetization from Russia), now Ostagram is already counting more on the Western audience.

To date, the prospects for recoupment are arbitrary. If each user paid 10 rubles for processing, then perhaps it would pay off. […]

It is very difficult to monetize in our country: our people are ready to wait a week, but they will not pay a dime for it. Europeans are more supportive of this - in terms of paying for speeding up, improving quality - so the focus goes to that market.

Andrey, representative of Ostagram

According to Andrey, the Ostagram team is working on new version a site with a strong social bias: "It will look like one well-known service, but what to do." Representatives of Facebook in Russia have already been interested in the project, but the deal has not yet reached negotiations on the sale.

Examples of service work

In the feed on the Ostagram website, you can also see the combination of which images resulted in the final images: often this is even more interesting than the result itself. In this case, filters - pictures that were used as an effect for processing - can be saved for future use.

Greetings, Habr! You've probably noticed that the theme of stylizing photos for various art styles is actively discussed on these Internet sites of yours. Reading all these popular articles, you might think that magic is happening under the hood of these applications, and the neural network really fantasizes and redraws the image from scratch. It just so happened that our team faced a similar task: as part of the internal corporate hackathon, we made the video stylization, because application for photos has already been. In this post, we will figure out how this network "redraws" images, and we will analyze the articles that made this possible. I recommend that you read the last post before reading this material and in general with the basics of convolutional neural networks. You will find some formulas, some code (I will give examples on Theano and Lasagne), as well as a lot of pictures. This post is built in chronological order the appearance of articles and, accordingly, the ideas themselves. Sometimes I will dilute it with our recent experience. Here's a boy from hell for your attention.


Visualizing and Understanding Convolutional Networks (28 Nov 2013)

First of all, it is worth mentioning the article in which the authors were able to show that a neural network is not a black box, but a completely interpretable thing (by the way, today this can be said not only about convolutional networks for computer vision). The authors decided to learn how to interpret the activation of neurons of hidden layers, for this they used a deconvolution neural network (deconvnet), proposed several years earlier (by the way, by the same Seiler and Fergus, who are the authors of this publication). A deconvolutional network is actually the same network with convolutions and pools, but applied in reverse order. V original work over deconvnet, the network was used in unsupervised learning mode to generate images. This time, the authors applied it simply for a backward pass from the features obtained after a forward pass through the network to the original image. As a result, an image is obtained that can be interpreted as a signal that caused this activation on neurons. Naturally, the question arises: how to make a reverse pass through convolution and nonlinearity? And even more so through max-pooling, this is certainly not an inverted operation. Let's take a look at all three components.

Reverse ReLu

In convolutional networks, the activation function is often used ReLu (x) = max (0, x) which makes all activations on the layer non-negative. Accordingly, when passing back through the nonlinearity, it is also necessary to obtain non-negative results. For this, the authors suggest using the same ReLu. From a Theano architecture point of view, you need to override the operation gradient function (the infinitely valuable notebook is in lasagna recipes, from there you can get the details of what the ModifiedBackprop class is).

Class ZeilerBackprop (ModifiedBackprop): def grad (self, inputs, out_grads): (inp,) = inputs (grd,) = out_grads #return (grd * (grd> 0) .astype (inp.dtype),) # explicitly rectify return (self.nonlinearity (grd),) # use the given nonlinearity

Reverse convolution

It is a little more complicated here, but everything is logical: it is enough to apply the transposed version of the same convolution kernel, but to the outputs from the reverse ReLu instead of the previous layer used in the forward pass. But I'm afraid that in words it is not so obvious, let's look at the visualization of this procedure (you will find even more visualizations of convolutions).


Convolution with stride = 1

Convolution with stride = 1 Reverse version

Convolution with stride = 2

Convolution with stride = 2 Reverse version

Reverse pooling

This operation (unlike the previous ones), generally speaking, is not invertible. But we would still like to go through the maximum in some way during the return passage. For this, the authors suggest using a map of where the maximum was during the direct pass (max location switches). During the reverse pass, the input signal is unplugged so as to approximately preserve the structure of the original signal, it is really easier to see than describe it.



Result

The visualization algorithm is extremely simple:

  1. Make a straight pass.
  2. Select the layer of interest to us.
  3. Fix the activation of one or several neurons and reset the rest.
  4. Make the opposite conclusion.

Each gray square in the image below corresponds to the rendering of a filter (which is used for convolution) or weights of one neuron, and each color image is that part of the original image that activates the corresponding neuron. For clarity, neurons within one layer are grouped into thematic groups... In general, it suddenly turned out that the neural network learns exactly what Hubel and Weisel wrote about in their work on the structure of the visual system, for which they were awarded Nobel Prize in 1981. Thanks to this article, we got a visual representation of what the convolutional neural network learns on each layer. It is this knowledge that will allow later to manipulate the content of the generated image, but this is still a long way off; the next few years were spent on improving the methods of "trepanning" neural networks. In addition, the authors of the article proposed a way to analyze how best to build the architecture of a convolutional neural network to achieve the best results (although they did not win ImageNet 2013, but they got to the top; UPD: it turns out they won, Clarifai they are).


Feature visualization


Here is an example of visualizing activations using deconvnet, today this result looks so-so, but then it was a breakthrough.


Saliency Maps using deconvnet

Deep Inside Convolutional Networks: Visualizing Image Classification Models and Saliency Maps (19 Apr 2014)

This article is devoted to the study of methods for visualizing knowledge contained in a convolutional neural network. The authors propose two rendering methods based on gradient descent.

Class Model Visualization

So, imagine that we have a trained neural network to solve a classification problem for a certain number of classes. Let us denote the activation value of the output neuron, which corresponds to the class c... Then the following optimization problem gives us exactly the image that maximizes the selected class:



This task is easy to solve using Theano. Usually we ask the framework to take the derivative from the model parameters, but this time we assume that the parameters are fixed, and the derivative is taken from the input image. Next function selects the maximum value of the output layer and returns a function that computes the derivative of the input image.


def compile_saliency_function (net): "" "Compiles a function to compute the saliency maps and predicted classes for a given minibatch of input images." "" inp = net ["input"]. input_var outp = lasagne.layers.get_output (net ["fc8"], deterministic = True) max_outp = T.max (outp, axis = 1) saliency = theano.grad (max_outp.sum (), wrt = inp) max_class = T.argmax (outp, axis = 1) return theano.function (,)

You've probably seen weird dog faces on the internet - DeepDream. In the original article, the authors use the following process to generate images that maximize the selected class:

  1. Initialize the initial image with zeros.
  2. Calculate the value of the derivative from this image.
  3. Change the image by adding the resulting image from the derivative to it.
  4. Return to step 2 or exit the loop.

The resulting images are:




What if we initialize the first image with a real photo and start the same process? But at each iteration, we will select a random class, reset the rest and calculate the value of the derivative, then we get such a deep dream.


Caution 60 MB


Why are there so many dog ​​faces and eyes? It's simple: in the image of 1000 classes, there are almost 200 dogs, they have eyes. There are also many classes where there are simply people.

Class Saliency Extraction

If we initialize this process with a real photo, stop after the first iteration and draw the value of the derivative, then we will get such an image, adding which to the original one, we will increase the activation value of the selected class.


Saliency Maps using derivative


Again the result is "so-so". It is important to note that this new way visualization of activations (nothing prevents us from fixing the values ​​of activations not on the last layer, but in general on any layer of the network and taking the derivative from the input image). The next article will combine both of the previous approaches and give us a tool on how to customize the style transfer, which will be described later.

Striving for Simplicity: The All Convolutional Net (13 Apr 2015)

Generally speaking, this article is not about visualization, but about the fact that replacing the pooling with a convolution with a large stride does not lead to a loss of quality. But as a by-product of their research, the authors proposed a new way to visualize features, which they used to more accurately analyze what the model learns. Their idea is as follows: if we just take the derivative, then during deconvolution those features that were less than zero on the input image do not go back (using ReLu for the input image). And this leads to the fact that negative values ​​appear on the propagated back image. On the other hand, if you use deconvnet, then another ReLu is taken from the ReLu derivative - this allows you not to pass negative values ​​back, but as you saw, the result is "so-so". But what if you combine these two methods?




class GuidedBackprop (ModifiedBackprop): def grad (self, inputs, out_grads): (inp,) = inputs (grd,) = out_grads dtype = inp.dtype return (grd * (inp> 0) .astype (dtype) * (grd > 0) .astype (dtype),)

Then you get a completely clean and interpretable image.


Saliency Maps Using Guided Backpropagation

Go deeper

Now let's think, what does this give us? Let me remind you that each convolutional layer is a function that receives a three-dimensional tensor as input and outputs a three-dimensional tensor as well, perhaps of a different dimension d x w x h; d epth is the number of neurons in the layer, each of them generates a feature map of size w igth x h eight.


Let's try the following experiment on a VGG-19 network:



conv1_2

You see almost nothing, tk. the receptive area is very small, this is the second convolution 3x3, respectively, the total area is 5x5. But if we zoom in, we can see that the feature is just a gradient detector.




conv3_3


conv4_3


conv5_3


pool5


Now let's imagine that instead of the maximum over the plate, we will take the derivative of the value of the sum of all plate elements from the input image. Then the obviously receptive region of the group of neurons will cover the entire input image. For the early layers, we will see bright maps, from which we conclude that these are detectors of colors, then gradients, then borders, and so on in the direction of complicating patterns. The deeper the layer, the dimmer the image is. This is explained by the fact that deeper layers have a more complex pattern that they detect, and a complex pattern appears less often than a simple one, and therefore the activation map fades. The first method is suitable for understanding layers with complex patterns, and the second is just for simple ones.


conv1_1


conv2_2


conv4_3


You can download a more complete database of activations for several images and.

A Neural Algorithm of Artistic Style (2 Sep 2015)

So, a couple of years have passed since the first successful trepanning of the neural network. We (in the sense of humanity) have a powerful tool in our hands that allows us to understand what the neural network is learning, and also to remove what we would not really like it to learn. The authors of this article are developing a method that allows one image to generate a similar activation map for some target image, and perhaps even more than one - this is the basis of styling. We supply white noise to the input, and by a similar iterative process as in deep dream, we bring this image to one with feature maps similar to the target image.

Content loss

As already mentioned, each layer of the neural network produces a three-dimensional tensor of some dimension.




Let us denote the output i-th layer from the input as. Then if we minimize the weighted sum of the residuals between the input image and some image we're aiming for c, then you get exactly what you need. Probably.



To experiment with this article, you can use this magic laptop, calculations take place there (both on the GPU and on the CPU). The GPU is used to calculate the features of the neural network and the value of the cost function. Theano issues a function that can calculate the gradient of the objective function eval_grad by input image x... This is then fed into lbfgs and an iterative process is started.


# Initialize with a noise image generated_image.set_value (floatX (np.random.uniform (-128, 128, (1, 3, IMAGE_W, IMAGE_W)))) x0 = generated_image.get_value (). Astype ("float64") xs = xs.append (x0) # Optimize, saving the result periodically for i in range (8): print (i) scipy.optimize.fmin_l_bfgs_b (eval_loss, x0.flatten (), fprime = eval_grad, maxfun = 40) x0 = generated_image.get_value (). astype ("float64") xs.append (x0)

If we run the optimization of such a function, then we will quickly get an image similar to the target one. Now we are able to recreate images from white noise that look like some content image.


Content Loss: conv4_2



Optimization process




It is easy to see two features of the resulting image:

  • colors are lost - this is the result of the fact that in specific example only the conv4_2 layer was used (or, in other words, the weight w with it was nonzero, and zero for the other layers); as you remember, it is the early layers that contain information about colors and gradient transitions, and the later ones contain information about larger details, which we observe - the colors are lost, but the content is not;
  • some houses "drove off", that is, the straight lines are slightly curved - this is because the deeper the layer, the less information it contains about the spatial position of the feature (the result of using convolutions and pools).

Adding early layers immediately corrects the color situation.


Content Loss: conv1_1, conv2_1, conv4_2


Hopefully by this point, you feel like you can control what gets redrawn onto the white noise image.

Style loss

And now we got to the most interesting thing: how can we convey the style? What is style? Obviously, the style is not something we optimized in Content Loss, because it contains a lot of information about the spatial positions of features. So the first thing to do is to somehow remove this information from the views received on each layer.


The author suggests the following method. Take the tensor at the exit from a certain layer, unfold it along the spatial coordinates and calculate the covariance matrix between the dies. Let us denote this transformation as G... What have we actually done? We can say that we calculated how often the features inside the plate are found in pairs, or, in other words, we approximated the distribution of features in the plates with a multivariate normal distribution.




Then Style Loss is introduced as follows, where s- this is some image with style:



Let's try for Vincent? In principle, we will get something expected - noise in the style of Van Gogh, information about the spatial arrangement of features is completely lost.


Vincent




But what if instead of a stylized image you put a photo? You will get already familiar features, familiar colors, but the spatial position is completely lost.


Photo at style loss


Surely you wondered why we are calculating the covariance matrix, and not something else? After all, there are many ways of how to aggregate features so that spatial coordinates are lost. This is really an open question, and if you take something very simple, the result will not change dramatically. Let's check this, we will not calculate the covariance matrix, but simply the average value of each plate.




simple style loss

Combo Loss

Naturally, there is a desire to mix these two cost functions. Then we will generate such an image from white noise that features from the content image (which have a binding to spatial coordinates) will be saved in it, and there will also be "style" features that are not tied to spatial coordinates, i.e. hopefully the details of the content image remain intact, but redrawn with the style we want.



In fact, there is also a regularizer, but we will omit it for simplicity. It remains to answer the following question: what layers (weights) should be used for optimization? And I'm afraid that I have no answer to this question, and neither do the authors of the article. They have a suggestion to use the following, but this does not mean at all that another combination will work worse, too large space search. The only rule that follows from understanding the model: there is no point in taking adjacent layers, since their signs will not differ much from each other, therefore, a layer from each conv * _1 group is added to the style.


# Define loss function losses = # content loss losses.append (0.001 * content_loss (photo_features, gen_features, "conv4_2")) # style loss losses.append (0.2e6 * style_loss (art_features, gen_features, "conv1_1")) losses.append (0.2e6 * style_loss (art_features, gen_features, "conv2_1")) losses.append (0.2e6 * style_loss (art_features, gen_features, "conv3_1")) losses.append (0.2e6 * style_loss (art_features, gen_features, "conv4_1") ) losses.append (0.2e6 * style_loss (art_features, gen_features, "conv5_1")) # total variation penalty.append (0.1e-7 * total_variation_loss (generated_image)) total_loss = sum (losses)

The final model can be represented as follows.




And here is the result of the houses with Van Gogh.



Trying to control the process

Let's remember the previous parts, as early as two years before the current article, other scientists have researched what the neural network really learns. Armed with all these articles, you can generate feature visualizations different styles, different images, different resolutions and sizes, and try to figure out which layers with which weight to take. But even reweighing the layers does not give complete control over what is happening. The problem here is more conceptual: we are optimizing the wrong function! How so, you ask? The answer is simple: this function minimizes the residual ... you get the idea. But what we really want is that we like the image. The convex combination of content and style loss functions is not a measure of what our minds think is beautiful. It was noticed that if you continue styling for too long, then the cost function naturally drops lower and lower, but the aesthetic beauty of the result drops sharply.




Well, okay, there is another problem. Let's say we found a layer that extracts the features we need. Let's say some textures are triangular. But this layer still contains many other features, for example, circles, which we really do not want to see in the resulting image. Generally speaking, if it was possible to hire a million Chinese people, then it would be possible to visualize all the features of the style image, and by brute force simply mark those that we need and only include them in the cost function. But for obvious reasons, it's not that easy. But what if we just remove any circles that we don't want to see in the result from the style image? Then the activation of the corresponding neurons, which react to the circles, simply will not work. And, of course, then this will not appear in the resulting picture. It's the same with flowers. Imagine a vivid image with large quantity colors. The distribution of colors will be very smeared over the entire space, the same will be the distribution of the resulting image, but during the optimization process, the peaks that were on the original will most likely be lost. It turned out that simply decreasing the bit depth of the color palette solves this problem. The density of most colors will be near zero, and there will be large peaks in several areas. Thus, by manipulating the original in Photoshop, we are manipulating the features that are extracted from the image. It is easier for a person to express their desires visually than to try to formulate them in the language of mathematics. Till. As a result, designers and managers, armed with photoshop and scripts for visualizing features, achieved results three times faster than what mathematicians and programmers did.


An example of manipulating the color and size of features


Or you can take a simple image as a style.



results








And here is a vidosik, but only with the required texture

Texture Networks: Feed-forward Synthesis of Textures and Stylized Images (10 Mar 2016)

It seems that it was possible to stop at this, if not one nuance. The above styling algorithm takes a very long time. If you take an implementation where lbfgs is run on the CPU, the process takes about five minutes. If we rewrite so that the optimization goes to the GPU, then the process will take 10-15 seconds. This is no good. Perhaps the authors of this and the next article thought about the same. Both publications were published independently 17 days apart, almost a year after the previous article. The authors of the current article, like the authors of the previous one, were engaged in the generation of textures (if you just reset the Style Loss to something like this, you will succeed). They proposed to optimize not an image obtained from white noise, but some neural network that generates a stylized image.




Now, if the styling process does not involve any optimization, only a forward pass needs to be done. And optimization is required only once to train the generator network. This article uses a hierarchical generator where each next z size larger than the previous one and is sampled from noise in the case of texture generation, and from some base of images for training the stylist. It is critical to use something other than the training part of the imajnet, since features inside the Loss network are calculated by the network trained just at the training part.



Perceptual Losses for Real-Time Style Transfer and Super-Resolution (27 Mar 2016)

As the name implies, the authors, who were just 17 days late with the idea of ​​the generating network, were busy increasing the image resolution. They seem to have been inspired by the success of residual learning on the latest imagnet.




Respectively residual block and conv block.



Thus, now we have in our hands, in addition to control over styling, a fast generator (thanks to these two articles, the generation time for one image is measured in tens of ms).

Ending

We used the information from the reviewed articles and the authors' code as a starting point to create another styling app for the first video styling app:



Generates something like that.


In the most ordinary photographs, numerous and not quite distinguishable entities appear. Most often, for some reason, dogs. The Internet began to fill up with such images in June 2015, when Google's DeepDream was launched - one of the first open services based on neural networks for image processing.

It happens approximately like this: the algorithm analyzes photographs, finds fragments in them that remind him of some familiar objects - and distorts the image in accordance with this data.

At first, the project was posted in the form of open source, and then online services appeared on the Internet, created according to the same principles. One of the most convenient and popular is the Deep Dream Generator: processing a small photo here takes only about 15 seconds (previously, users had to wait more than an hour).

How do neural networks learn to create such images? And why, by the way, are they called that?

By their design, neural networks imitate real neural networks of a living organism, but they do it using mathematical algorithms. Once you've created a basic structure, you can train it using machine learning techniques. If we are talking about pattern recognition, then thousands of images need to be passed through the neural network. If the task of the neural network is different, then the training exercises will be different.

Algorithms for playing chess, for example, analyze chess games. In the same way, Google DeepMind's AlphaGo algorithm into the Chinese game of go - which was seen as a breakthrough, since go is much more complex and non-linear than chess.

    You can play around with a simplified neural network model and better understand its principles.

    YouTube also has a series of easy-to-read rollers about how neural networks work.

Another popular service is Dreamscope, which can not only dream of dogs, but also imitate various painting styles. Image processing here is also very simple and fast (about 30 seconds).

Apparently, the algorithmic part of the service is a modification of the "Neural style" program, which we have already discussed.

More recently, a program has appeared that realistically colors black and white images. In previous versions, similar programs did their job much less well, and it was considered great achievement if at least 20% of people cannot distinguish a real picture from a computer-colored image.

Moreover, coloring here takes only about 1 minute.

The same developer company also launched a service that recognizes in pictures different types objects.

These services may seem like just fun entertainment, but in reality, everything is much more interesting. New technologies are entering the practice of human artists and are changing our understanding of art. People will likely have to compete with machines in the realm of creativity soon.

Teaching image recognition algorithms is a task that artificial intelligence developers have been struggling with for a long time. Therefore, programs that color old pictures and paint dogs in the sky can be considered part of a larger and more intriguing process.

Editor's Choice
When designing a house that has a basement, it is very important to draw a detailed structural section along the basement wall. It's necessary...

On the benefits of wormwood for the garden Many are dismissive of wormwood, calling it a malicious weed. But I consider her to be my protector from ...

Blueberries have become a fetish in today's healthy food culture. The berry is added to vitamins, promising that its composition and useful ...

Found throughout the European part of Russia, in Western and Eastern Siberia, Ukraine and Belarus, the Kupena (Polygonatum), ...
The well is not just a means of water supply in places with undeveloped infrastructure. And not only decoration of home ownership (see fig.), Fashionable ...
Objectives: To acquaint children with the plant, its features. Consolidate knowledge about the concepts of "species", "endemic", "Red Book". Bring up...
There is an opinion that the brownie is a cousin of the devil himself. Despite this, it is impossible to drive him out of the house in any case! The fact,...
The Norwegian Bukhund is a service dog belonging to the group of Kamchatka, Siberian and Greenland shepherds. These animals were taken out ...
The most humidified part of the walls, located directly on the foundation and made of selected weather and frost-resistant ...