Big Data Ethics: racially biased training data versus machine learning


#1

[Read the post]


#2

Rather than calling for algorithmic transparency, we need to call for data transparency, methodological transparency, and sampling transparency.

We?


#3

GIGO indeed


#4

Any sampling person should try different sample sizes, and if the 0.5%, the 1%, the 2% and the 5% disagree, then there’s a problem. Also, we have enough computation now. Very rare is the dataset that cannot be analyzed with a 100% sample, unless we are talking genomes (edit: or tcp dumps). Sure, you’re going to develop on a sample, but when you’re done building the first draft, everyone is always itching to run it against the full dataset. I know I am. And so is everyone I know. Nobody I know goes, “meh, 1% is fine, ignore the rest.”


#5

Fascinatingly enough it’s tangentially relevant.


#6

data-analysis was used to identify risk-factors for long-term…

Those first two dashes are unnecessary. It’s “data analysis” and “risk factors”. There’s a dash between “long” and “term” because “long-term” forms an adjective.


#7

Sorry – erring on the side of read-ability.


#8

Wait, what are you quoting?

EDIT: nvm, they were quoting OP, who isn’t me.


#9

I’ve been ramping up projects using amazon Mechanical Turk, and one of the aspects that interests me are the profiles of the turkers (basically what technologies do they use) versus, say, data I can get from akamai. Cause I gotta use this data in a machine learning environment, so it has got to be right.


#10

Who else?


#11

Are they similar?


#12

I’m mining device data, and mechanical Turk gives me finer grained data that I need. But akamai being a cdn has about 10,000x more general data points. (I’m doing a web browser study)

So given that I need fine grained detail, how close do the mturks match akamais distribution? Cause it has to be close, or the training set will be weirdly biased.


#13

Good god, man. Hopefully whiskey is involved.


#14

I’ve probably mentioned it here on another machine intelligence post, but I was working on training computers to read and rate college entrance essays years ago…as with a lot of college entrance essays, we pose a dilemma that needs to be analyzed, we look to see how they write, we look for analytical skills, and we ask how this affects them personally.

Our software only focused on the writing aspect and our goal was to have a tool that was objective and not based around the biases of the raters. UNFORTUNATELY, we captured said biases in software. Selection of specific names within personal portion of the essay, the software would elicit positive or negative scoring differences depending on how the name came from the norm (i.e., John or Jane would be a ‘neutral’ name…Tyrone to find a stereotypical name…would be one we’d find only rarely outside of African Americans and it would cause the score to be negative…same with Hispanic names with the exception of Jesus, though the software learned to correlate it positively but only if near God…Asian names, though we only found a few that triggered it due the variance…would be positive).

One of the tasks I took was to ameliorate this issue, either by retraining the model (which meant gong back through 3 years of work) or through building a NEW model that only identified names and neutralized them before they went into the model.

Given that this was designed for college entrance…this was a MAJOR problem. Others working with us didn’t see it as a problem…they were looking for a quick sale to college testing corporations and didn’t want to focus on ‘edge cases’.


#15

Curse of Dimensionality, Overfitting, and Overtrained models are the bane of ML.

Make your training data as simple as possible, but no simpler :smiley:


#16

Insert face palm here.

HOW were the prejudices of the human raters captured in the software? Wouldn’t it have been overly complex to have created the original software in such a way that it distinguished between names based on ethnicity and then applied a value judgement
to them, for example? I mean, two steps instead of one…why would they have done that, originally?


#17

The original algorithms were designed by a man in the late '60s that waited 30 years to actually bring them to live. It wasn’t traditional machine intelligence, however we used modern techniques to shape what he had done – and he was REASONABLY successful with what he had attempted to do. And unfortunately, screwing with the original algorithm in and of itself was off the table as it was his baby…and by this time, Alzheimers was taking its toll, which made him very protective of this. HOWEVER finding ways to shape the data before it made it into the system was something we could do.

Lets just say, when I took over as the lead at my university, we were trying to transition code from supercomputers (you know the things that are slower than your phone these days) to commodity 486s. Twenty years later, a lot of what we didn’t know back then is pretty obvious.

Either way, the SOFTWARE wasn’t programmed to capture things like ethnicity…it was an emergent feature that arose out of human raters evaluating several thousand papers on several metrics. It was one of the reasons that half the team was comprised of programmers and the other half psychologists. Half the time we’d be trying to analyze why the code was giving one result or another. Sometimes a single word would change every rating and this was NOT how the software was supposed to perform. Again, the constraint was to get the code to a level that the consortium to sell it…before our leader succumbed to dementia…


#18

Thank you for the detailed response. If the algorithm was using the raters’ evaluations from the 1960s, I guess that could explain how something in use in recent times was codified to be explicitly prejudiced that way. It’s just hard to imagine that someone actually coded “if Tyrone, then score is less than or equal to C-”, you know?


#19

The ALGORITHM was designed in the '60s. The actual model we designed it on was from essays that were transcribed from the 90s and 2000s. There is no “if Tyrone, then score is less than or equal to C-”…it picked up the subtle score differences between essays rated with different words and phrases grammar rules. We had rules that were hard coded but they were generally grammar…the rest of the rules emerged from the model based around ratings. For instance…a good deal of the essays came from public schools that participated in a writing program (which I will decline to state because IRB rules seemed to have gone out the window with a lot of this! It was over a decade ago and I left the university to purse a different career for a while, but I’m back again!)

The bigger problems came from the source of the essays…public schools that had been in the program because it gave them money and were largely minority…and private schools where private tutors were a norm. One of the keys we used was word usage and what words correlate to higher scores. Names weren’t a part of the dictionary and were automatically classified as a generic. From here, the algorithm looked at what words are being used and the types of words used affected the score. Could we have changed this? Yes. However it was part of the base algorithm that the guy that created this felt was integral for a system that was intended to evolve with the time. So we worked around it…figure out how we could identify and classify names (some which we never saw before) as names – and not some generic construct – and substitute the ‘word’ that was most neutral instead.

It was a learning exercise for me…the folks that I worked with had invested a LONG time in their way of doing things. It was the project that got me to switch out of my computer science program for a neuro / psych program…if I’d know what I did today…and it was my project…things would have been completely different. But you find a lot of folks using this sort of technology. Our software, in one form or another, is still in use in a few areas (we also used it for identifying psychological traits within the subjects)…in this area we were fighting against latent semantic analysis and it sort of won this war, but even it was completely wrong in what is known a decade later.


#20

and @chgoliz, a step back. machine learning is perceived by many to be egalitarian, but in reality it reflects unconscious (or conscious) biases in the data set. so what we call ‘features’, such as name extraction/grammatical errors/fallacies come back to bite us in the same way that classification via zip code does.

and believe me, i am in the thick of it. I get to decide which features are important, which need to be scrubbed, which need to be normalized, which should be hashed, how many vectors are important, and which are necessary conditions. and i’m a damn dropout :smiley: As they say, No Pressure. I barely even understand what I am saying.