Well, it doesn’t have anything to do with our neural structure, because this image prototype doesn’t appear in human brains, it appears in this image generating neural network.
That said, it’s likely a local minimum around the neural net’s conception of a human and a few other factors. They generated it by asking for the opposite of Marlon Brando, and then asking for the opposite of that image.
So, what images and visual concepts does this neural network associate with Marlon Brando? Likely a bunch that we don’t have words for or could even understand, but we can make some guesses. The obvious first one is “human”. Probably also “intensity” or “anger”, just given how Brando tends to look in pictures. Probably also concepts about violence, because he’s visually associated with violent movies. Later Brando also adds concepts of things like “jowlyness”, “age”, “wide mouth”, “large chin”, “hair”, and so on, but his youthful image also adds a lot.
So, you generate the opposite of those concepts and untold numbers more, and then try to flip it back. Well, those concepts are still embedded, but without context, boiled down to another layer of abstraction. It’s like running a sentence through an automated translation system and then back; you can kinda see the same ideas, but the structure isn’t there. So we get a picture of a person. They’re an adult but otherwise of indeterminate age; some aspects look old and weathered, some don’t. They’re intensely focused on the camera. They have a masculine jawline, deep eyes, and jowls. And although the surface concept of violence doesn’t come back immediately, because it wasn’t as strong of an association, and it’s there, waiting to come to the surface.
I’m certainly not even close to an expert on the field, and I’m happy to admit I’m making a lot of assumptions, but that’s my best guess for why this image is so persistent across iterations, and why it’s so easy to get it to go gore heavy.