Rather than calling for algorithmic transparency, we need to call for data transparency, methodological transparency, and sampling transparency.
We?
GIGO indeed
Any sampling person should try different sample sizes, and if the 0.5%, the 1%, the 2% and the 5% disagree, then thereās a problem. Also, we have enough computation now. Very rare is the dataset that cannot be analyzed with a 100% sample, unless we are talking genomes (edit: or tcp dumps). Sure, youāre going to develop on a sample, but when youāre done building the first draft, everyone is always itching to run it against the full dataset. I know I am. And so is everyone I know. Nobody I know goes, āmeh, 1% is fine, ignore the rest.ā
Fascinatingly enough itās tangentially relevant.
data-analysis was used to identify risk-factors for long-termā¦
Those first two dashes are unnecessary. Itās ādata analysisā and ārisk factorsā. Thereās a dash between ālongā and ātermā because ālong-termā forms an adjective.
Sorry ā erring on the side of read-ability.
Wait, what are you quoting?
EDIT: nvm, they were quoting OP, who isnāt me.
Iāve been ramping up projects using amazon Mechanical Turk, and one of the aspects that interests me are the profiles of the turkers (basically what technologies do they use) versus, say, data I can get from akamai. Cause I gotta use this data in a machine learning environment, so it has got to be right.
Who else?
Are they similar?
Iām mining device data, and mechanical Turk gives me finer grained data that I need. But akamai being a cdn has about 10,000x more general data points. (Iām doing a web browser study)
So given that I need fine grained detail, how close do the mturks match akamais distribution? Cause it has to be close, or the training set will be weirdly biased.
Good god, man. Hopefully whiskey is involved.
Iāve probably mentioned it here on another machine intelligence post, but I was working on training computers to read and rate college entrance essays years agoā¦as with a lot of college entrance essays, we pose a dilemma that needs to be analyzed, we look to see how they write, we look for analytical skills, and we ask how this affects them personally.
Our software only focused on the writing aspect and our goal was to have a tool that was objective and not based around the biases of the raters. UNFORTUNATELY, we captured said biases in software. Selection of specific names within personal portion of the essay, the software would elicit positive or negative scoring differences depending on how the name came from the norm (i.e., John or Jane would be a āneutralā nameā¦Tyrone to find a stereotypical nameā¦would be one weād find only rarely outside of African Americans and it would cause the score to be negativeā¦same with Hispanic names with the exception of Jesus, though the software learned to correlate it positively but only if near Godā¦Asian names, though we only found a few that triggered it due the varianceā¦would be positive).
One of the tasks I took was to ameliorate this issue, either by retraining the model (which meant gong back through 3 years of work) or through building a NEW model that only identified names and neutralized them before they went into the model.
Given that this was designed for college entranceā¦this was a MAJOR problem. Others working with us didnāt see it as a problemā¦they were looking for a quick sale to college testing corporations and didnāt want to focus on āedge casesā.
Curse of Dimensionality, Overfitting, and Overtrained models are the bane of ML.
Make your training data as simple as possible, but no simpler
Insert face palm here.
HOW were the prejudices of the human raters captured in the software? Wouldnāt it have been overly complex to have created the original software in such a way that it distinguished between names based on ethnicity and then applied a value judgement
to them, for example? I mean, two steps instead of oneā¦why would they have done that, originally?
The original algorithms were designed by a man in the late '60s that waited 30 years to actually bring them to live. It wasnāt traditional machine intelligence, however we used modern techniques to shape what he had done ā and he was REASONABLY successful with what he had attempted to do. And unfortunately, screwing with the original algorithm in and of itself was off the table as it was his babyā¦and by this time, Alzheimers was taking its toll, which made him very protective of this. HOWEVER finding ways to shape the data before it made it into the system was something we could do.
Lets just say, when I took over as the lead at my university, we were trying to transition code from supercomputers (you know the things that are slower than your phone these days) to commodity 486s. Twenty years later, a lot of what we didnāt know back then is pretty obvious.
Either way, the SOFTWARE wasnāt programmed to capture things like ethnicityā¦it was an emergent feature that arose out of human raters evaluating several thousand papers on several metrics. It was one of the reasons that half the team was comprised of programmers and the other half psychologists. Half the time weād be trying to analyze why the code was giving one result or another. Sometimes a single word would change every rating and this was NOT how the software was supposed to perform. Again, the constraint was to get the code to a level that the consortium to sell itā¦before our leader succumbed to dementiaā¦
Thank you for the detailed response. If the algorithm was using the ratersā evaluations from the 1960s, I guess that could explain how something in use in recent times was codified to be explicitly prejudiced that way. Itās just hard to imagine that someone actually coded āif Tyrone, then score is less than or equal to C-ā, you know?
The ALGORITHM was designed in the '60s. The actual model we designed it on was from essays that were transcribed from the 90s and 2000s. There is no āif Tyrone, then score is less than or equal to C-āā¦it picked up the subtle score differences between essays rated with different words and phrases grammar rules. We had rules that were hard coded but they were generally grammarā¦the rest of the rules emerged from the model based around ratings. For instanceā¦a good deal of the essays came from public schools that participated in a writing program (which I will decline to state because IRB rules seemed to have gone out the window with a lot of this! It was over a decade ago and I left the university to purse a different career for a while, but Iām back again!)
The bigger problems came from the source of the essaysā¦public schools that had been in the program because it gave them money and were largely minorityā¦and private schools where private tutors were a norm. One of the keys we used was word usage and what words correlate to higher scores. Names werenāt a part of the dictionary and were automatically classified as a generic. From here, the algorithm looked at what words are being used and the types of words used affected the score. Could we have changed this? Yes. However it was part of the base algorithm that the guy that created this felt was integral for a system that was intended to evolve with the time. So we worked around itā¦figure out how we could identify and classify names (some which we never saw before) as names ā and not some generic construct ā and substitute the āwordā that was most neutral instead.
It was a learning exercise for meā¦the folks that I worked with had invested a LONG time in their way of doing things. It was the project that got me to switch out of my computer science program for a neuro / psych programā¦if Iād know what I did todayā¦and it was my projectā¦things would have been completely different. But you find a lot of folks using this sort of technology. Our software, in one form or another, is still in use in a few areas (we also used it for identifying psychological traits within the subjects)ā¦in this area we were fighting against latent semantic analysis and it sort of won this war, but even it was completely wrong in what is known a decade later.
and @anon67050589, a step back. machine learning is perceived by many to be egalitarian, but in reality it reflects unconscious (or conscious) biases in the data set. so what we call āfeaturesā, such as name extraction/grammatical errors/fallacies come back to bite us in the same way that classification via zip code does.
and believe me, i am in the thick of it. I get to decide which features are important, which need to be scrubbed, which need to be normalized, which should be hashed, how many vectors are important, and which are necessary conditions. and iām a damn dropout As they say, No Pressure. I barely even understand what I am saying.