3 techniques to defend your Machine Learning models against Adversarial attacks

Following our accounts of what adversarial machine learning means and how it works, we close this series of posts by describing what you can do to defend your machine learning models against attackers.

There are different approaches to solve this issue, and we discuss them in order of least to most effective: target concealment, data preprocessing and model improvement.

Because this post mainly contains technical recommendations, we decided to improve it with GIFs from one of the best TV shows ever made.

Your neural networks after reading this post.

Target concealment

The least effective method we’ll discuss is a well-known concept in cybersecurity: obfuscation. By masking or shielding parts of your IT system from possible attackers, you make it more difficult for them to develop attacks. The less information an attacker has, the more difficult it is to locate the weak points of an IT system.

Obfuscation in practice

Obfuscation is not a very effective defense mechanism. In practice, it will only serve to slow attackers down. For example, the source code of apps published on the Google Play store is often obfuscated. Method and variable names are changed to meaningless tokens, making the code unreadable. However, skilled reverse engineers will still be able to decompile that code and understand exactly what it is doing. It only requires more skill, effort and time.

Obfuscation applied to adversarial machine learning means that you should consider all details surrounding your machine learning model and its entire data processing chain as highly confidential. These details should not be shared outside of your organisation.

If attackers know enough details about your model, it allows them to build a replica model and develop accurate adversarial examples before they even interact with your system. For example, if attackers know your model is built on open source libraries such as scikit-learn or pytorch, they can quickly set up an accurate replica. The same thing applies to the data used to train the model. You might be using publicly or commercially available datasets, such as imagenet, to which the attackers also have access.

Another aspect is the processing chain the input data travels through before it’s passed to the model. If it’s known that images are downscaled to a certain size and transformed to greyscale before they are input to the neural network, attackers can use this information to create more robust adversarial examples.

Lastly, the ability to test adversarial examples on models that have been put in production is also very valuable for developing attacks. This usually requires many requests sent to your model, and there are a number of steps that can be taken to limit this access:

  • Minimize feedback: Mainly, do not return full model scores in your responses and ensure that error messages do not reveal unnecessary information. If your model is classifying 10 separate classes, and you return detailed confidence scores for every class, this is a great help in creating targeted adversarial examples. Instead, the model should return only its highest confidence prediction, i.e. “this image is of class X”.
  • Behavioral detection: If you notice many requests to your system with highly similar but nonidentical inputs, this is an indication of an attacker developing an adversarial example.
  • Throttling queries: Make it more difficult to develop adversarial examples by slowing the speed with which requests can be made.

Data preprocessing

Deep neural networks never exist in isolation. Outside of lab conditions they are always part of a broader processing chain meant to improve its speed, accuracy or memory requirements.


Examples of these preprocessing steps are the following:

  • Normalisation and compression: Many neural networks that process images can only handle images of the exact same size. This means that in practice images often need to be compressed to a smaller size before they’re passed to the neural network. The image is also often normalised, meaning all of its pixel values are transformed to lie between a certain range.
  • Noise reduction: Visual distortion and noise can be removed from images by applying gaussian blur. A gaussian transformation is applied to each pixel in the image, which results in a smoother image that should contain less visual noise.

An interesting side effect of the image preprocessing techniques described above is that they reduce the precision of the images. We can define image precision as its spatial resolution (e.g. image height and width in number of pixels), and pixel density (how many bytes we use to represent each pixel).

It is often the precision of the data that is being exploited to generate adversarial examples. Remember from our previous post that we can either change a large number of pixels by a small amount, or a small number of pixels by a large amount, to create an adversarial example. In both cases, this will be easier with increased image precision. A bigger spatial resolution means we can change a larger number of pixels, and a higher pixel density means we can alter the pixel values to a finer degree.

If we created an adversarial example that depends on a small number of pixels being changed, these changed pixels might be lost when the image is compressed to a reduced spatial resolution, rendering it non-adversarial. Conversely, if the adversarial example depends on a large number of pixels being changed by a minute amount, these minute changes can also be lost during noise reduction, which results in a reduced colour resolution for every pixel.

Same as with target concealment, this technique should not be relied upon exclusively for an effective defense. If adversaries know about the specific preprocessing steps surrounding your neural network, they can take this into account when they create their adversarial examples. As we saw in our introductory post, techniques exists for creating examples that are more confidently adversarial, and as a result might be robust against these preprocessing steps.

Example of a truly effective defense mechanism

Improving the model

The final defense mechanism we’ll discuss alters the neural network itself to be more robust against adversarial examples. This is the most proactive type of defense because it requires the neural network creators to consider how their model handles adversarial examples in the design and programming phase.

Many approaches exist, and while it’s not our goal to list these exhaustively, we’ll discuss three of the most popular ones below.

Gradient masking

A different kind of mask

Gradients are a fundamental concept in machine learning. For the purposes of this blog post it’s not important to know exactly what they are. Instead, we’ll once again give you an intuitive feel for what they do.

In essence, gradients show you the best direction to move an image in the input space if we want to create an adversarial example. “Best”, in this context, meaning the fastest way to a misclassification. It makes sense to hide this information from would-be attackers.

As its name implies, that’s exactly what the gradient masking technique does. It hides or smoothes the gradients of a neural network in such a way as to make them useless to attackers. It’s based on another technique called distillation (1) that was initially created to reduce the size of neural networks.

Adversarial training

If neural networks can be trained to recognise the contents of an image, they might also be trained to recognise whether an image is adversarial or not. That’s the idea behind adversarial training.

Training a model in a nutshell

One way of defending against adversarial examples would be to generate them ourselves, label them as adversarial, and use them to train our neural network. The neural network will probably be able to pick up patterns in the image that indicate its adversarial nature. If we can later correctly predict an image as being adversarial, we can respond appropriately.

While this is a very intuitive approach of building a defense, its greatest weakness is exactly the same reason why adversarial examples work in the first place: the network will only be able to accurately recognise adversarial examples from the same distribution it was trained on.

In the previous post we explored several methods of creating adversarial examples. If our neural network is only trained to recognise adversarial images generated by a subset of those methods, we simply need to generate new examples through the unused methods to bypass this defense.

Ultimately, if adversarial training allows us to filter out even a small sample of adversarial inputs, it will have been worth the effort.

Randomized dropout uncertainty measurements

This final method currently shows the most promising results in detecting adversarial examples. It’s based on another fundamental machine learning technique called dropout regularisation.

In general terms, regularisation techniques are used to prevent neural networks from simply memorising everything they see during training (also called overfitting). If a classifier is overfitted it has lost the ability to generalise, meaning tt cannot correctly predict the class of any new images we show it.

Dropout is one of these regularisation techniques. During training, for every image the network processes, dropout will randomly “switch off” specific neurons. For the next image, the previous neurons are reactivated and a new set of neurons are randomly switched off. While this seems counterintuitive, it actually prevents the network from relying on specific combinations of neurons in making its prediction, effectively preventing any “memorisation” from happening.

As it turns out, dropout can also be used when neural networks are put to production.

When the network is making its predictions, we could once again employ dropout. If we then present the network with the exact same input image multiple times, the dropout will cause it to return slightly different results every time. The variance of these results can then be interpreted as a measure of how certain the network is of the image’s prediction.

This is exactly what is proposed in “Bayesian neural network uncertainty(2). Using dropout in the production phase of a neural network, the network will be more consistently certain of the predictions of non-adversarial images, as opposed to adversarial ones. This means that if an adversarial example was presented to the neural network multiple times, the dropout would cause the variance of the predictions to exceed a certain threshold. We could then simply choose a threshold to label images as adversarial.



(1) – Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha: “Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks”, 2015

(2) – Reuben Feinman, Ryan R. Curtin, Saurabh Shintre: “Detecting Adversarial Samples from Artifacts”, 2017



Leave a Reply