Big Data Ethics: racially biased training data versus machine learning

[Read the post]

2 Likes

Rather than calling for algorithmic transparency, we need to call for data transparency, methodological transparency, and sampling transparency.

We?

1 Like

GIGO indeed

1 Like

Any sampling person should try different sample sizes, and if the 0.5%, the 1%, the 2% and the 5% disagree, then thereā€™s a problem. Also, we have enough computation now. Very rare is the dataset that cannot be analyzed with a 100% sample, unless we are talking genomes (edit: or tcp dumps). Sure, youā€™re going to develop on a sample, but when youā€™re done building the first draft, everyone is always itching to run it against the full dataset. I know I am. And so is everyone I know. Nobody I know goes, ā€œmeh, 1% is fine, ignore the rest.ā€

4 Likes

Fascinatingly enough itā€™s tangentially relevant.

4 Likes

data-analysis was used to identify risk-factors for long-termā€¦

Those first two dashes are unnecessary. Itā€™s ā€œdata analysisā€ and ā€œrisk factorsā€. Thereā€™s a dash between ā€œlongā€ and ā€œtermā€ because ā€œlong-termā€ forms an adjective.

3 Likes

Sorry ā€“ erring on the side of read-ability.

1 Like

Wait, what are you quoting?

EDIT: nvm, they were quoting OP, who isnā€™t me.

Iā€™ve been ramping up projects using amazon Mechanical Turk, and one of the aspects that interests me are the profiles of the turkers (basically what technologies do they use) versus, say, data I can get from akamai. Cause I gotta use this data in a machine learning environment, so it has got to be right.

1 Like

Who else?

1 Like

Are they similar?

Iā€™m mining device data, and mechanical Turk gives me finer grained data that I need. But akamai being a cdn has about 10,000x more general data points. (Iā€™m doing a web browser study)

So given that I need fine grained detail, how close do the mturks match akamais distribution? Cause it has to be close, or the training set will be weirdly biased.

1 Like

Good god, man. Hopefully whiskey is involved.

2 Likes

Iā€™ve probably mentioned it here on another machine intelligence post, but I was working on training computers to read and rate college entrance essays years agoā€¦as with a lot of college entrance essays, we pose a dilemma that needs to be analyzed, we look to see how they write, we look for analytical skills, and we ask how this affects them personally.

Our software only focused on the writing aspect and our goal was to have a tool that was objective and not based around the biases of the raters. UNFORTUNATELY, we captured said biases in software. Selection of specific names within personal portion of the essay, the software would elicit positive or negative scoring differences depending on how the name came from the norm (i.e., John or Jane would be a ā€˜neutralā€™ nameā€¦Tyrone to find a stereotypical nameā€¦would be one weā€™d find only rarely outside of African Americans and it would cause the score to be negativeā€¦same with Hispanic names with the exception of Jesus, though the software learned to correlate it positively but only if near Godā€¦Asian names, though we only found a few that triggered it due the varianceā€¦would be positive).

One of the tasks I took was to ameliorate this issue, either by retraining the model (which meant gong back through 3 years of work) or through building a NEW model that only identified names and neutralized them before they went into the model.

Given that this was designed for college entranceā€¦this was a MAJOR problem. Others working with us didnā€™t see it as a problemā€¦they were looking for a quick sale to college testing corporations and didnā€™t want to focus on ā€˜edge casesā€™.

9 Likes

Curse of Dimensionality, Overfitting, and Overtrained models are the bane of ML.

Make your training data as simple as possible, but no simpler :smiley:

1 Like

Insert face palm here.

HOW were the prejudices of the human raters captured in the software? Wouldnā€™t it have been overly complex to have created the original software in such a way that it distinguished between names based on ethnicity and then applied a value judgement
to them, for example? I mean, two steps instead of oneā€¦why would they have done that, originally?

The original algorithms were designed by a man in the late '60s that waited 30 years to actually bring them to live. It wasnā€™t traditional machine intelligence, however we used modern techniques to shape what he had done ā€“ and he was REASONABLY successful with what he had attempted to do. And unfortunately, screwing with the original algorithm in and of itself was off the table as it was his babyā€¦and by this time, Alzheimers was taking its toll, which made him very protective of this. HOWEVER finding ways to shape the data before it made it into the system was something we could do.

Lets just say, when I took over as the lead at my university, we were trying to transition code from supercomputers (you know the things that are slower than your phone these days) to commodity 486s. Twenty years later, a lot of what we didnā€™t know back then is pretty obvious.

Either way, the SOFTWARE wasnā€™t programmed to capture things like ethnicityā€¦it was an emergent feature that arose out of human raters evaluating several thousand papers on several metrics. It was one of the reasons that half the team was comprised of programmers and the other half psychologists. Half the time weā€™d be trying to analyze why the code was giving one result or another. Sometimes a single word would change every rating and this was NOT how the software was supposed to perform. Again, the constraint was to get the code to a level that the consortium to sell itā€¦before our leader succumbed to dementiaā€¦

7 Likes

Thank you for the detailed response. If the algorithm was using the ratersā€™ evaluations from the 1960s, I guess that could explain how something in use in recent times was codified to be explicitly prejudiced that way. Itā€™s just hard to imagine that someone actually coded ā€œif Tyrone, then score is less than or equal to C-ā€, you know?

The ALGORITHM was designed in the '60s. The actual model we designed it on was from essays that were transcribed from the 90s and 2000s. There is no ā€œif Tyrone, then score is less than or equal to C-ā€ā€¦it picked up the subtle score differences between essays rated with different words and phrases grammar rules. We had rules that were hard coded but they were generally grammarā€¦the rest of the rules emerged from the model based around ratings. For instanceā€¦a good deal of the essays came from public schools that participated in a writing program (which I will decline to state because IRB rules seemed to have gone out the window with a lot of this! It was over a decade ago and I left the university to purse a different career for a while, but Iā€™m back again!)

The bigger problems came from the source of the essaysā€¦public schools that had been in the program because it gave them money and were largely minorityā€¦and private schools where private tutors were a norm. One of the keys we used was word usage and what words correlate to higher scores. Names werenā€™t a part of the dictionary and were automatically classified as a generic. From here, the algorithm looked at what words are being used and the types of words used affected the score. Could we have changed this? Yes. However it was part of the base algorithm that the guy that created this felt was integral for a system that was intended to evolve with the time. So we worked around itā€¦figure out how we could identify and classify names (some which we never saw before) as names ā€“ and not some generic construct ā€“ and substitute the ā€˜wordā€™ that was most neutral instead.

It was a learning exercise for meā€¦the folks that I worked with had invested a LONG time in their way of doing things. It was the project that got me to switch out of my computer science program for a neuro / psych programā€¦if Iā€™d know what I did todayā€¦and it was my projectā€¦things would have been completely different. But you find a lot of folks using this sort of technology. Our software, in one form or another, is still in use in a few areas (we also used it for identifying psychological traits within the subjects)ā€¦in this area we were fighting against latent semantic analysis and it sort of won this war, but even it was completely wrong in what is known a decade later.

5 Likes

and @chgoliz, a step back. machine learning is perceived by many to be egalitarian, but in reality it reflects unconscious (or conscious) biases in the data set. so what we call ā€˜featuresā€™, such as name extraction/grammatical errors/fallacies come back to bite us in the same way that classification via zip code does.

and believe me, i am in the thick of it. I get to decide which features are important, which need to be scrubbed, which need to be normalized, which should be hashed, how many vectors are important, and which are necessary conditions. and iā€™m a damn dropout :smiley: As they say, No Pressure. I barely even understand what I am saying.

6 Likes