The "universal adversarial preturbation" undetectably alters images so AI can't recognize them


Originally published at:




I know it acting smug feels nice, but 1) This is absolutely not true, and 2) in many cases irrelevant – for instance, if I am using a neural network to categorize my photos, there is no adversarial relationship – I want the algorithm to work. It also doesn’t sound like this would work to e.g., fool a security camera unless I somehow had access to the digital data.

Trivial maybe in the mathematical sense where anything that is proven becomes trivial, but it clearly took a lot of work to do this, and lots of people failed before – one of the reasons captchas have been going away is that it was becoming harder and harder to distort images in a way that would fool image recognition but humans could still understand. So it is a problem that people have worked on, with clear financial incentive, for quite some time. A success now doesn’t mean the problem was trivial, or that it is solved.

In the long run, I wouldn’t bet on anything like this being effective. There is a tremendous amount of research in identifying perceptible vs. imperceptible features of images for data compression. If you put imperceptible errors in an image that confuse a deep neural network, and nobody can find a way to train the networks to be more robust against that, someone is going to find a way to filter that out before analyzing it. It will probably cost some accuracy but that isn’t necessarily a deal breaker.

In particular, one of the key things that makes this attack interesting is that it generalized between different neural networks with different design. This is important, because if you want to post pictures to facebook without facebook being able to analyze them, you don’t have access to their classifier to train your distortion algorithm on. My guess is that the reason it is so effective is that everybody is using the same training data – there are just not that many sets of millions of public images with pre-existing labels.


Adversarial examples are sometimes used for training, though! These networks come pre-trained; that’s their convenience and power. However, it’s possible to compute these perturbations for each category and apply them to your training images, e.g., figure out what your network thinks is most un-horselike (in a way that doesn’t actually have to do with horsiness, like these little random local things) and apply it to a picture of a horse, and tell your network that’s also an example of a horse, and this will move its conception of a horse closer to what horses actually look like. This makes “AI can’t recognize them” a pretty sketchy phrasing, to be honest.


You may have preturbed the word ‘perturbation’ but I still recognized it!


And so one by one we reduce the number of things humans can do and AI systems can’t.


I thought we just got this one back though?


Hmm, so the really neat trick will be to cause the image to be misclassified in a predictable, directable way - getting pictures of puppies to be classified as pornography, for example (or vice versa). That opens up a whole other can of worms…


True, that is what the OP is about. The comment I replied to as pointing out that the perturbation itself can provide the necessary data to improve the AI and close off that weakness.


How can this be used to confuse the ubiquitous public cameras dogging my every step?


The way I understand it, captchas’ popularity waned because of the relative low cost to pay humans to defeat them, i.e. “captcha farms”.


i feel like gödel might have something clever to say about this.

the closest i can get, is i’d suspect there are no rules which could classify all images correctly. even people couldn’t do this. you’d first have to agree what the categories mean. but, there are always disagreements and exceptions. and, the categories themselves are often purely human definition​s, and arbitrary.

the goal is to get close to what some “typical” person might choose. but we can’t even agree what color the dress is.


Agreed. But I think that is mostly because natural language is not a formal system, and words don’t really have definitions in a strict sense. At some point “Is X part of the category word Y attempts to denote?” stops being a meaningful question because there is no fact of the matter either way.

If I ask you to picture a bird, I bet you didn’t picture an ostrich, or a penguin, but you would agree they fit the category. How about one of the later feathered dinosaurs? If I ask you to picture a salad, I bet it wasn’t a fruit salad or egg salad.


These eyes of mine at once determined
The sleeves are velvet, the cape is ermine
The hose are blue and the doublet is a lovely shade of green.


I don’t know about Gödel, but the tortoise* would find a “universal adversarial perturbation” that would blow the computer up, not just confuse it.

* Of Gödel, Escher, Bach by Hofstadter


Kind of worried because I can’t recognize any of those images either. Am I an AI?


Fortunately Skynet hasn’t yet trained on our primary stereograms. For those who see the horsey, uprisingyay atyay idnightmay.


Zoe Heriot did this once by shouting commands at one in ALGOL:


I’m not sure about the details of this technique, but there are other exploits that are highly resistant to such countermeasures. From “Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images”:

We also find that, for MNIST DNNs, it is not easy to prevent the DNNs from being fooled by retraining them with fooling images labeled as such. While retrained DNNs learn to classify the negative examples as fooling images, a new batch of fooling images can be produced that fool these new networks, even after many retraining iterations.

Because the current generation of neural nets are rigid once trained, and because they return their verdicts along with a confidence value, they are by their nature susceptible to evolutionary techniques that iteratively mutate the input to produce incremental gains in the output. Fixing the exploit would mean changing fundamentally what it is that neural networks do, and how they are used.

Done. You want “Intriguing properties of neural networks” C. Szegedy et al. 2013, which “revealed that changing an image (e.g. of a lion) in a way imperceptible to humans can cause a DNN to label the image as something else entirely (e.g. mislabeling a lion a library)”. The only thing new in this study is that state-of-the-art neural nets are still so dumb that they can be consistently fooled by a static transform applied to any input data.

I can’t even begin to imagine what heuristics a net might use that allow it to classify static as a coherent image, but I guess the space of images that we would consider “just static” is absolutely vast in comparison to the space of images that look to us like a thing. To today’s neural nets though, it all looks roughly the same. They have no context, they don’t know what the world is or what “things” are and it’s not clear that showing them a series of representational bitmaps is going to get them there. Humans know a few basic concepts such as “things have a form”, and “some aspect of that form may, at times, be visible,” which is hugely helpful context for looking at pictures but it’s clear on the evidence of studies such as these that neural nets really don’t get that. They don’t know what pictures are to start with.

Here’s an image recognition task that might help illustrate what I’m talking about.

Spoiler here if you’re in a hurry. The picture gives you almost nothing to go on but once you know what the thing is, you can’t unsee it anymore. You know the form of the thing and you recognise the visible aspect, and you reconstruct in your mind an analogue of the scene that the image represents. You understand the representational nature of the image, of images, and you see past it because you live in the world and you’ve walked around in it a bit and maybe picked up a thing and put it down again. The outputs of your neural net go out into the world and flow back to your inputs, allowing you to explore and test and build in your mind a coherent model of the world you inhabit. By comparison it looks like image recognition algorithms are computing epicycles because they don’t understand gravity.

Neural nets are a fantastic invention that will continue to transform the world in many beneficial ways, but I think in their current state there are serious problems deploying them in any sort of adversarial situation. These findings add to the carnage because if simply exposing a neural net’s confidence values is a security risk, that’s likely to be used by people running shitty AI as an excuse to dodge audits.