## DEV Community # 3 practical examples for tricking Neural Networks using GA and FGSM

Hi! I’m Przemysław Przybyt from Profil Software, a software house located in Northern Poland where I’m working as a Python developer. My interests in AI were raised while studying the topic of reinforcement learning and computer vision. I have a strong inner need to see how things are done under the hood so I wanted to check if I could mess with some well known object classification models such as CNNs (Convolutional Neural Networks). They are just a bunch of numbers and mathematical operations, so let’s see if we can play with that!

## Image classification

Image classification refers to a process in computer vision that can classify an image according to its visual content. It should not be mistaken with other similar operations such as localization, object detection or segmentation. The
below image shows the difference to make sure that everything is clear: ## Experiments’ description

For the purpose of this article I’ve chosen two algorithms to go through. The first one is a genetic algorithm used for One Pixel Attack which, as its name suggests, changes only a single pixel value to fool the classification model. The second one is FGSM (Fast Gradient Sign Method) which modifies an image with a little noise which is practically unseen by humans but can manipulate the model’s prediction.

## One Pixel Attack

When I was searching the net to find ways to fool DNN (Deep Neural Network) models, I ran across the very interesting concept of One Pixel Attack, and I knew I needed to check it out. My intuition was telling me that changing only one pixel in the original image wouldn’t be enough to break all those concepts of filters and convolutional layers used in neural networks that do a great job when it comes to object classification.
The only information that was used to manipulate the input image was the probability of classification (percentage values for each label). The way I wanted to achieve that without a brute force method was by using GA (Genetic Algorithm). The idea was easy:

1. Get the true label for a given image.
2. Draw a base population of changed pixels (encoded as xyrgb), where x and y is the position of a pixel and r, g and b are its color components.
3. Do GA magic (crossing, mutation, selection) taking into account the population diversity.
4. End calculations when the probability decreases under 20% or after a certain number of steps without appreciable results.

For the experiments I used a model based on the VGG16 architecture for the cifar10 dataset with pretrained weights (https://github.com/geifmany/cifar-vgg). It was done like this to eliminate the impact from a ‘potentially’ badly trained model. The sample code below is presented to get a kick-start with training your own models on that dataset:

``````# cifar10 dataset preparation
from keras.datasets import cifar10
from keras.utils import to_categorical
cifar_10_categories = {
0: 'airplane',
1: 'automobile',
2: 'bird',
3: 'cat',
4: 'deer',
5: 'dog',
6: 'frog',
7: 'horse',
8: 'ship',
9: 'truck',
}
(X_train, y_train), (X_test, y_test) = cifar10.load_data()
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)
# training and evaluation goes here
...
``````

The results obtained from the attack were really good because for almost 20% of the images, changing only one pixel successfully led to misclassification. ## FGSM

Another method I found is FGSM (Fast Gradient Sign Method), which is extremely easy in its concepts, but also leads to great effects. Without getting too deep into all the technical issues, this method is based on calculating the gradient (between input and output of neural net) for a given image that will increase the classification of the true label.

For an untargeted approach the next step is just to add the sign value of the gradient (-1, 0 or 1 for each pixel component) to an image to avoid a good prediction. Some studies also use a param called epsilon which is a multiplier for the sign value, but in this experiment we considered images that are represented by integer rgb values. This step can be repeated a few times to get satisfying results.

Another approach is a targeted attack which differs in the way the gradient is calculated. For this type of attack it is taken between the input image and target label (not true label). It is then subtracted from image to move the classification closer to the aim. Easy isn’t it? I’ve pasted some sample code below to make it easier to understand.

``````# sample code that calculates the gradients and updates an image
import keras.backend as K
sess = K.get_session()
...
target = K.one_hot(target_class if target_class is not None else       base_class, num_classes)
def get_image_update_function(target_class):
def target(img, delta):
return img - epsilon * delta

def non_target(img, delta):
return img + epsilon * delta

if target_class is not None:
return target
return non_target
update_fun = get_image_update_function(target_class)
# calculate delta - difference noise
loss = losses.categorical_crossentropy(target, model.output)
delta = sess.run(delta, feed_dict={model.input: image})
# update image
image = update_fun(image, delta)
``````

The model that was used in this experiment is resnet18 with imagenet weights. The sample code that enables its loading (using image-classifiers==0.2.2) is pasted below:

``````# loading resnet pretrained models (224x224px, 1000 classes)
from classification_models import Classifiers
ResNet18, preprocess_input = Classifiers.get('resnet18')
resnet_dim = (224, 224)
model = ResNet18(input_shape=(*resnet_dim, 3), weights='imagenet', classes=1000)
``````

The below image presents an original and adversarial example generated using FGSM + generated noise after 2 steps of the algorithm: Noise from red component (white + 2, gray + 0, black — 2)

## Black-box FGSM

The previous method was an easy case where we have full info about the attacked model, but what about when it is not available? Here is a study that estimates the gradient by using a large amount of queries to the target model. I tried to fool the target model using my own model that had a different architecture but did similar tasks. The modified images were prepared based on my model (it took 7 steps to decrease true label prediction under 1%) and checked by the target model (vgg16 cifar10 model used in previous steps). Results from this experiment are shown below: Original and fake image obtained during black-box approach with probabilities from target model. Accumulated (r+g+b) noise generated during 7 steps of algorithm. Chart showing how prediction for true label changes during experiment.

These results look promising but we have to take into account that these are relatively simple tasks (classifying 32x32 pixel images), and the difficulty of fooling other models will probably grow with the complexity of the structures that are used.

## Conclusion

The approaches that were presented show that we can perturb images in a way to manipulate classification results. This is easy when we have full info about model structure. Otherwise it is hard to estimate perturbed samples with limited access to the target model.
The knowledge that comes from these experiments can help to defend from such attacks by extending the training set with slightly modified images.

published at