DEV Community

Cover image for Neural Style Transfer Using PyTorch
Mohd Aquib
Mohd Aquib

Posted on

Neural Style Transfer Using PyTorch

Neural Style Transfer is an optimization technique used to take two images i.e content image and style reference image and blend them such that the output image looks like the content image but painted in a style of a style reference image.
It is a deep learning technique that generates artistic images. It extracts the structural features from the content image, whereas the style features from the style image.
Neural style transfer

Define content and style representations

The deep convolution neural networks develop the representations of the images. As we move deeper in the network it will take care of structural features. Reconstruction from the lower layer will reproduce the exact image. In contrast, the higher layer's reconstruction will capture the high-level content and hence we refer to the feature responses from the higher layer as the content representation.

To extract the representation of the style content, we build a feature space on the top of the filter responses in each network layer. It consists of the correlations between the different filter responses over the spatial extent of the feature maps. The filter correlation of different layers captures the texture information of the input image. This creates images that match a given image's style on an increasing scale while discarding information of the global arrangement. This multi-scale representation is called style representation.
This can be easily understood by the diagram given below.


The above architecture of the model proposed in the paper “A Neural Algorithm of Artistic Style”. Here we will use a pre-trained VGG-19 model for content and style reconstruction. By putting structural information from content representation and texture information from style representation together we will generate an artistic image. A strong emphasis on style will result in images that match the artwork's appearance, effectively giving a texturized version of it, but hardly show any of the photograph’s content. When placing a strong emphasis on content, one can identify the photograph, but the painting style is not as well-matched. We perform the gradient descent on the generated image to find another image that matches the original image's feature responses.


You can install PyTorch from here.

Importing Libraries

import torch
import torch.nn as nn
import torchvision
import torchvision.models as models
import torchvision.transforms as transforms
import torch.optim as optim
from torchvision.utils import save_image
from PIL import Image
import matplotlib.pyplot as plt
Enter fullscreen mode Exit fullscreen mode

Loading Model

We will use VGG-19 model from torch.models().
VGG-19 is a convolutional neural network that is 19 layers deep. You can load a pre-trained version of the network trained on more than a million images from the ImageNet database. The pretrained network can classify images into 1000 object categories, such as a keyboard, mouse, pencil, and many animals. As a result, the network has learned rich feature representations for a wide range of images. The network has an image input size of 224-by-224.


  • A fixed size of (224 * 224) RGB image was given as input to this network which means that the matrix was of shape (224,224,3).
  • Used kernels of (3 * 3) size with a stride size of 1 pixel, this enabled them to cover the whole notion of the image.
  • Spatial padding was used to preserve the spatial resolution of the image.
  • MaxPooling has performed over 2 * 2 pixel windows with stride 2.
  • This was followed by a Rectified linear unit(ReLu) to introduce non-linearity to make the model classify better and to improve computational time.
  • implemented three fully connected layers from which the first two were of size 4096 and after that, a layer with 1000 channels for 1000-way ILSVRC classification and the final layer is a softmax function.

The Layers in VGG-19 model are given below
Alt Text

Now let's load the model

model = models.vgg19(pretrained=True)
Enter fullscreen mode Exit fullscreen mode
device = 'cuda' if torch.cuda.is_available() else 'cpu'
Enter fullscreen mode Exit fullscreen mode

This makes sure if the device has GPU then it will load our model in GPU otherwise in CPU.

Feature Representations

we will define a class that will help us to provide feature representations of intermediate layers(as they are complex feature extractor). In this we will use 'block_conv1_1', 'block_conv2_1','block_conv3_1', 'block_conv4_1', 'block_conv5_1' layers whose index values are 0, 5, 10, 19, 28 respectively and then store these activations of 5 convolutional layers in an array and return the array.

class VGG(nn.Module):
    def __init__(self):
        self.req_features = ['0','5','10','19','28']
        self.model = models.vgg19(pretrained=True).features[:29]
    def forward(self,x):
        features = []
        for layer_num,layer in enumerate(self.model):
            x = layer(x)
            if(str(layer_num) in self.req_features):
        return features
Enter fullscreen mode Exit fullscreen mode

Image Preprocessing

Preprocessing is required to make an image suitable for the model.
we will perform preprocessing using torch.transform() like image resizing and converting image into Tensor.
we will define the function with an argument as the path of the image which will return a preprocessed image.

def image_loader(path):
    image =
    loader = transforms.Compose([transforms.Resize((512,512)),
    image = loader(image).unsqueeze(0)
Enter fullscreen mode Exit fullscreen mode

unsqueeze() is used to add extra dimension at 0th index for batch size.

Now, use the image_loader function to load the style and the content image from the local disk. We will use the content image clone as the input base image or the generated image. Since the gradient descent will alter the generated image's pixel values, we will pass the parameter true for require_grads_().

original_image = image_loader('/content/mountain.jpg')
style_image = image_loader('/content/style.jpg')
generated_image = original_image.clone().requires_grad_(True)
Enter fullscreen mode Exit fullscreen mode

Loss Functions

Here we will describe two loss functions i.e 1. Content Loss 2. Style Loss.
The content loss function ensures that the activations of higher layers are similar between content and generated image. The style loss function ensures that the correlation of all the layers are similar between style and generated image.

Content Loss Function
The content_image and generated_image are passed into a model and output is extracted using intermediate layers using VGG class that we have defined above. Then we will calculate the Euclidean Distance between the output of the generated_image and content_image.Therefore content loss for layer1 is Alt Text

def calc_content_loss(gen_feat,orig_feat):
    content_l = torch.mean((gen_feat - orig_feat)**2)
    return content_l
Enter fullscreen mode Exit fullscreen mode

Style Loss Function
To calculate style loss we need to compute Gram Matrix. A gram matrix is a multiplication of a matrix with its transposed matrix.

Alt Text

The gram matrix represents the correlation between each filter in an intermediate representation.
Alt Text

In the above equation, Gˡᵢⱼ is the inner product between the vectorized feature map i and j in layer l.
The following diagram shows how gram matrix is computed in a CNN layer.
Alt Text

The style loss of layer l is the squared error between the gram matrices of the intermediate representation of the generated_image and style image.
Alt Text
Where Eₗ is the style loss for layer l, Nₗ and Mₗ are the numbers of channels and height times width in the feature representation of layer l respectively. Gˡᵢⱼ and Aˡᵢⱼ are the intermediate representation of gram matrices of the generated_image and style image respectively.
Therefore overall style loss is Alt Text
Here w^l is a weight factor contributing to each layer of total style loss.

def calc_style_loss(gen,style):
    batch_size,channel,height,width = gen.shape
    G =,height*width),gen.view(channel,height*width).t())
    A =,height*width),style.view(channel,height*width).t())
    style_l = torch.mean((G-A)**2)
    return style_l
Enter fullscreen mode Exit fullscreen mode

Final Loss
The final loss is defined as,Alt Text
where α and β are user-defined hyperparameters. By controlling α and β we can set how much style and content to be inserted to a generated image.

def calculate_loss(gen_features,orig_features,style_features):
    for gen,con,style in zip(gen_features,orig_features,style_features):
        content_loss += calc_content_loss(gen,con)
        style_loss += calc_style_loss(gen,style)
    total_loss = alpha*content_loss + beta*style_loss
    return total_loss
Enter fullscreen mode Exit fullscreen mode


before training, we should set our hyperparameters and optimizer.
I have chosen Adam Optimizer but if you want to can try out with LBFGS Optimizer(Limited-memory BFGS (L-BFGS or LM-BFGS) is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount of computer memory) using optim.LBFGS.

model = VGG().to(device).eval()
epoch = 6000
lr = 0.004
alpha = 8
beta = 70
optimizer = optim.Adam([generated_image],lr=lr)
Enter fullscreen mode Exit fullscreen mode

Now using for loop we will iterate over the number of epochs. extract feature representations of intermediate layers of content, style, generated image using model. Then calculate the loss function using above define function i.e calculates_loss().Set gradient to zero using optimizer.zero_grad() then backpropagate the loss using total_loss.backward() and update weights(gradient descent) using optimizer.step().

for i in range(epoch):
    gen_features = model(generated_image)
    orig_features = model(original_image)
    style_features = model(style_image)

    total_loss = calculate_loss(gen_features,orig_features,style_features)

Enter fullscreen mode Exit fullscreen mode


The content and style image are as follows:
Alt Text
Alt Text


The output using Neural style Transfer on content and style image is
Alt Text


So in this blog, we learned how Neural Style Transfer works. We loaded the pre-trained VGG-19 model then preprocess the image, then define content and style loss functions, which combined to calculate the total loss function and finally we ran our model and get the artistic image as output.

GitHub Link - Here

References -


Top comments (0)