There's an explanation on the page, but if that still has you stumped:
They combined two effects. The first is the McGurk effect, and has to do with how we hear vs. how we lipread. The sound being made is 'ba' not 'va', but because the face we see looks like it's saying 'va' - we tend to hear that. Here's a clearer example of just that effect:
The second effect is called the Thatcher effect, and it has to do with how well programmed we are to see faces correctly. In the original effect, Margaret Thatcher's face was displayed upside down, and her lips and eyes in the photo were reversed to be left right side up (the way they normally look). In this position, the photo looks normal, but right the face and suddenly it's Monster Maggie!
The new illusion combines the two effects, with lips out of sync with the upright face. The idea presented is that, with the face fully inverted the McGurk effect is still strong (we still hear 'va'), but when the lips are blatantly wrong and the face is upright, the effect is reduced (we should hear 'ba'). It didn't work as well as intended on me, and I think the rotation of the image was unnecessary. A better display might have been several different faces, shown sequentially, all just flipping between the upside down and upright positions.