That works if we are selecting people at random from the population but we aren’t. We are selecting people who are fit to be judges, and we are determining that using a metric that is prejudiced against #s. We have three @s and two #s because despite the fact that @s and #s have equal ability, we systematically rate the ability of #s lower.
If @s A, B, C, D, E and #s Z, Y, X, W, V have fitness of 10, 9, 8, 7, and 6 respectively, but we devalue #s by one, then if we form a group of the top five people we will get A, B, C, Z and Y. But our assessment of Z and Y was inaccurate. In this sample we are more likely to get the best person by choosing from the #s than from the @s and the average # is better than the average @.
If we needed to hire a 6th person we would think that D and X are equal, but they aren’t. If we were forced to pick a # then we would actually get the equitable result.
Reality is more complicated. If 51% of judges are men and 49% are women we might reasonably say that’s probably natural variance. If it’s 2/3 and 1/3 then you unless you posit an actual difference in ability, being forced to hire a woman will result in a better candidate than hiring a man (or continuing to hire on using the same prejudiced metric).
Basically your example is exactly reversed. There are 50 men and 50 women who are suitable to be judges, but 30 of the men are already judges and only 20 of the women are. That means that we are selecting from a pool of 30 women and 20 men, so expected result is that the best candidate is a woman.
If women are filtered out of the process at an earlier stage that doesn’t change the result. If they were filtered out because of prejudice then that still leaves the remaining women being better than the remaining men. The reality is if we have 30 male judges of an apparent pool of 60 and 20 female judges of an apparent pool of 40. But in reality we have 30 male judges out of the entire population of men and 20 female judges out of the entire population of women. If there are 10000 men and 10000 women then we are choosing among the best 30 of 9970 men or the best 20 of 9980 women. But if we can actually tell who the best are then that’s the same as picking between 9970 and 9980. Advantage women again. If the lack of female candidates was because of prejudice then we should expect the next female candidate to be better than the next male candidate.
The problem, as I said above, is that when sexist people encounter a affirmative action program, they don’t necessarily hire the next best candidate. In this woman’s case, it sounds as if they might have used nepotism. Now if nepotism is the standard then as long as there is a little bit of merit-based thinking in there somewhere it affirmative action still works. If, however, the normal hiring process is much more merit based and the nepotism was used as a shortcut by a person who was too sexist to actually evaluate the talent of one woman against another (because they are still all women) then the problem is that person’s inability to evaluate candidates for the position. That “token woman” may be an actual token woman, chosen merely for her sex and not for her ability. If that is the case it because the person who made that decision is an idiot. The reality ought to be, and in most cases probably is, that that “token woman” is better than the men who would have been hired had she not been chosen for being a woman.