I have a few thoughts on that:
-
I come from a research background where we purposely include things like race and gender in our models because we want to see if they are treated differently (e.g., wage gap, job discrimination), so it’s second nature for me to include those. You can’t test for racial/gender disparities without including race and gender information.
-
The goal of machine learning is accurate predictions. A variable with predictive power is likely to end up in the model. You don’t make models based on a utopian vision of what you wish did or didn’t have predictive power, but what actually does.
For example, one of the way society treats genders unequally is that some non-work family responsibilities unfairly fall on women (e.g., taking children to doctor appointments, staying home with the kid if the kid is sick, picking up kid from school). You can say this is very unfair. It is! But you cannot say it doesn’t exist. It does! This unfair thing exists. Since this unfair thing does exist, you see real differences in the data between genders regarding hours worked, overtime, time off, shift-scheduling, etc. Therefore, if you want your model to better predict these variables for some set of workers, you should include gender because it has predictive power (even if you wish it didn’t). Maybe you’re trying to model the likelihood of getting arrested. Your inclusion of race as a variable doesn’t mean you think one race is more likely to be criminal. Rather, maybe you think one race is more likely to be unfairly targeted by the police. Either way, race probably has some predictive power in the model.
-
This is somewhat related to the above. Sometimes you don’t have data for some piece of information that you’d like. Maybe you don’t know someone’s income, but that’s important to your model. But, you do have their ZIP code. Since wealth levels often cluster in American cities (rich neighborhoods, poor neighborhoods), you may be able to use ZIP codes as a proxy for income. That is, it’s likely correlated with the variable you really want and you can use it to help capture some of the statistical effects of the variable you don’t have. Many US cities also show geographical clustering by race. This is often a hold-over from the days of redlining and other discriminatory practices. The Bay Area, Chicago, Dallas, LA, etc. are all metropolitan areas with pretty striking geographical segregation. So if race is correlated with geography, and geography is correlated with wealth, school quality, environmental factors, etc. then race becomes a proxy variable. If there’s a concern about children being exposed to lead, asbestos, etc., and I’m tasked with building a model to predict who is most at risk (and thus have more resources directed at them), race is probably something I’d include. The old public schools in the inner city are probably more likely to still have lead paint, asbestos, etc. compared to the schools in the wealthy suburb, and there’s probably going to be a racial disparity between who attends which school.
-
That brings me to your final point:
How about an algorithm that determines a person’s loan interest rate based on their employment status, income, age, and address?
My guess is that you’d get the algorithm spitting out different interest rates for people in different neighborhoods. As mentioned above, neighborhoods in America are often very racially segregated. You’d have minorities saying, “Hey, why do you charge us higher interest rates? Is it because we’re black?” And what’s your answer going to be, “No, race isn’t even a variable! It’s not because you’re black! It’s because of where you live.” And you’d get a response of, “Oh, so it’s not because I’m black, but because I live in a predominantly black neighborhood?” Do you think that explanation is going to silence the critique that the algorithm is racist? The problem here is similar to the one exploited by the GOP’s voter suppression laws. They do things like look up which forms of ID are correlated with which people. So they allow gun license ID’s, but not gov’t issued public assistance ID’s. The point is, outcomes can be discriminatory even if you never explicitly invoke race, gender, etc. Racial, gender, etc. differences are so interwoven into the fabric of society that there is no way to purge those effects from any data-driven process.
ALL of the above amount to the same underlying fact. Society itself in America contains many unjust inequalities and forms of racial and gender discrimination. Even if you ignore “race” as a variable, you’ll still pick up this inequality. You’re modeling students applying for college. Maybe you leave the model race-blind, but you include “quality of high school.” So kids from School X are more likely to get accepted that kids from School Y. Guess what? To an outside observer, you’re going to see a lot of white kids get acceptance letters and a lot of minorities being denied because schools themselves are often racially segregated.
Then, on top of this, there’s capitalism. Imagine one bank uses a model that doesn’t include race or any variable it thinks may inadvertently result in discriminatory racial practices because those variables are often correlated with race (like neighborhood, education levels, school quality, etc.), even though this causes the model to lose predictive power. Another bank uses a model that maximizes predictive ability regardless of the politics of the model. Which bank is going to do better financially? The one with the model with higher predictive power. That bank will have more success, grow larger, faster, etc.
The point is, for various reasons, if society itself is racist, then the model will be racist. If you want the model to be less racist, you have to either change society, or ignore reality. If you ignore reality, you’ll lose to a competing model that doesn’t. The reality-based model will beat your utopian-vision model in predictive power.
I keep thinking about a commercial that’s on TV recently where the parrot of a pirate captain keeps saying aloud, in front of the crew, all the bad things the captain says about the crew. The parrot is just repeating the captain’s private message. The captain gets very mad at the parrot. But is it really the parrot’s fault? It’s just parroting back the captain’s words.
My final thought: I’m not “defending” machine learning. I think the perpetuation of existing discrimination through machine learning is a real problem. Machine learning just parrots society. If society is racist, the machines will be racist. As a result, maybe the machines shouldn’t make decisions for us until the society they reflect is the society we want to perpetuate. That’s a very valid critique. I can get on board with that. But too many people misunderstand what’s going on and instead blame the machines or those who program the machines, and/or think we can just make tweaks to the programs to avoid this problem. We probably can’t do that. The discrimination is too deeply interwoven into the fabric of society, from where you live, to what school you go to, to which types of jobs you pursue, to which health risks you’re exposed to, to which police tactics are you subject to. If you stick neighborhood, income, schooling, etc. into a principle component analysis algorithm to find the “common thread” between those variables, I would not be surprised if that “common thread” is statistically indistinguishable from race. Essentially ALL data is inherently tainted by institutionalized discrimination.