In a previous post we introduced the field of adversarial machine learning and what it could mean for bringing AI systems into the real world. Now, we’ll dig a little deeper into the concept of adversarial examples and how they work.
For the purpose of illustrating adversarial examples, we’ll talk about them in the context of a deep neural network that classifies images. This means the input to the neural network is always a digital image, and the network’s output is a prediction of that image’s class.
For example, we could train a hot dog/not-hot dog image classifier.

In this case the image’s class would be one of two values: “hot dog” or “not-hot dog”, depending on whether the image contains a hot dog or not. Despite our focus on image classifiers, the exact same concepts regarding adversarial examples also apply to other modalities such as audio.
Imagine all possible images we can input to a neural network. We’ll visualize this set of images with the simple black square depicted below.
This conceptual square contains all possible images of a certain size (e.g. 12 megapixel images). For example, if the pixels of our images could only be black or white, the total number of images in the square would be 2^12M.
Zooming in, every spot inside the square represents a different image. For example, one spot could contain the image of a hot dog, while another spot could contain the image of a puppy.
Let’s also zoom out and try to get a sense of all the different kinds of images the square contains, and where these images are located.
Starting at the absolute top-left of the square, the image at that location will be completely black. As we start moving down the square, left to right and top to bottom, at every new location in the square we’ll encounter a new image that has only changed by a single pixel from the previous one. As a result, most of the images we encounter will not represent anything that occurs naturally, but instead will represent something similar to white noise.
As we continue to move through this conceptual square, we will eventually encounter every possible image that can be represented by a 12 megapixel photograph. At every new spot in the square we see a different image where only a single pixel has changed from the one before it. If we keep doing this long enough, we’ll eventually enumerate all combinations of pixels and pixel values. All possible images of buildings, mountains, animals, even the person reading this blog post, are contained inside of this square. A different name for this is the input space: the complete set of all different images that can be input to a neutral network.
Intuitively, if two spots are very close to each other in the square, the images they contain will also be very similar. This means that during our journey through the input space, we’ll eventually encounter a specific region containing all possible images depicting hot dogs. We could represent this as follows:

Blue squares = some examples of hot dog images inside this region.
As it turns out, when we train a neural network to recognise images of a specific class (like hot dog vs not-hot dog), we’re actually training it to identify where these specific regions are located in the input space.
For example, we could train our neural network with a so-called training set of 500 images of hot dogs (conceptually represented by the 5 blue dots below). The goal of the training set is to teach the neural network to recognise hot dogs in images.
During the training process, the neural network will learn to shift the contours of a second region, called the prediction landscape, to properly fit this training set. An example prediction landscape learned by our neural network might be depicted as the orange region below. The neural network will predict the class “hot dog” for all images located inside the orange region. Because this network was trained properly, all of our training images are also located inside the orange region, meaning our network will correctly predict the class “hot dog” for all of our training images.

Looking at this prediction landscape, it’s also immediately obvious why the neural network can generalise, meaning that it can also correctly predict new images it has never before seen as containing hot dogs. As previously mentioned, all images containing hot dogs will be located close to each other in the input space. This means that most new images of hot dogs that we want to classify will correctly fall inside of the prediction landscape learned by a neural network.
In the example below, the blue spots represent our training images, while the green spots are images of hot dogs the neural network has never before seen. Because the new images are also located inside of the prediction landscape learned by the neural network, it will correctly predict those images as containing hot dogs.
So how does this all relate to adversarial examples? Remember that the goal of an adversarial example is to create an image that fools a neural network, but not a human. In other words, it’s an image that should still look like a hot dog to humans, but that the neural network should classify as “not-hot dog”. The intuition should become clear when we compare the actual region containing hot dogs in the input space with the prediction region learned by the neural network.
The prediction landscape doesn’t perfectly match the entire input region of hot dog images. Our neural network hasn’t learned perfectly. And how could it? After all, we only used 500 training images.
This means that if we want to create an image that fools a neural network, but not a human, we need to alter the image in such a way that a specific set of pixels are changed so as to be unnoticed by humans, while at the same time pushing the image outside of the prediction landscape learned by the neural network. Intuitively, this adversarial example will be located at the purple spot shown above. The adversarial example still closely resembles a hotdog, although the neural network now incorrectly predicts its class as “not-hot dog”.
To make the intuition concrete, below is an example of a regular image, the “perturbation” representing the push outside of the prediction landscape (i.e. changing a number of pixels), and the resulting adversarial example. The adversarial example looks unchanged to the human observer.
And this is the same example drawn on the input space:
There are many ways we can create adversarial examples. Either we perform large changes on a small amount of pixels, or we do the opposite, perform small changes to a large amount of pixels. In both cases, the change should go unnoticed by human observers, but should be sufficient to change the neural network’s prediction of the image’s class. The change can also be targeted or untargeted. The goal could be misclassification in general, or it could be more specific, such as causing the neural network to falsely predict a specific class.
A complete description of the mathematics behind the techniques for generating adversarial images is out of scope for this post. Instead, we’ll list an overview of common techniques and additional sources for interested readers.
- L-BFGS (1): An optimisation algorithm to speed up what is essentially an iterative search of the entire input space, beginning at an original image and moving outwards until the neural network makes a wrong prediction. Although surprisingly effective, it’s still slow and computationally expensive.
- FGSM (2): This algorithm calculates the direction in the input space which appears to be the fastest route to a misclassification, starting from the original image. It will change every pixel value of the original image by a minuscule amount. It also relies on the fact that the neural network model in question exhibits linear behaviour, allowing it to vastly approximate the mathematics involved.
- JSMA (3): Instead of changing many pixels by a small amount, this method does the opposite. It looks for those pixels that are most important to the neural network in making its predictions (called the most salient pixels). If we change only the most salient pixels, this will cause the largest impact if our goal is to move an image towards a certain direction in the input space.
- C&W attack (4): A method to increase the robustness of adversarial examples. Often, images will be preprocessed before they are introduced to a neural network. They could be transformed to a smaller size or converted to greyscale to improve the speed and memory requirements of the neural net. From the perspective of an attacker, this could have the inadvertent effect of rendering the image nonadversarial. The C&W method therefore creates images that are more confidently adversarial to bypass these types of “defences”.
- Boundary attack (5): This method uses both an original image and a target image. We want the neural network to misclassify our original image as being of the same class as the target image. We then essentially move the target image closer and closer to the original image, all the while staying inside of the target image prediction landscape. After enough iterations, the altered target image will closely resemble the original image, while still being predicted as belonging to the target class.
(1) – Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow: “Intriguing properties of neural networks”, 2013
(2) – Ian J. Goodfellow, Jonathon Shlens: “Explaining and Harnessing Adversarial Examples”, 2014
(3) – Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik: “The Limitations of Deep Learning in Adversarial Settings”, 2015
(4) – Nicholas Carlini: “Towards Evaluating the Robustness of Neural Networks”, 2016
(5) – Wieland Brendel, Jonas Rauber: “Decision-Based Adversarial Attacks: Reliable Attacks Against Black-Box Machine Learning Models”, 2017
One thought on “This is not a hot dog: an intuitive view on attacking machine learning models”