Another aspect that creeps in with ML is original data-collection, and mixing/merging of various public-domain datasets.
For example, the race/ethnicity field in Medicare is notoriously underpopulated and inaccurate, not to mention conflating race with ethnicity in the same variable.
Data wonks at CMS tried to remedy the incompleteness with name-search algorithms in the early 2000’s, trying to match known names to categorize people post-hoc, such as Gonzales being assumed to be hispanic ethnicity. But what about race? And what about a name like Lee? Or Law or Sing or Song? And how about that old tradition of women taking their husband’s surname? That doesn’t change her racial/ethnic identity.
So, activities like this, while seeming like worthwhile endeavors, are riddled with problems. Not to mention, it has not addressed the fundamental issue: why is the race/ethnicity field in Medicare unreliable and missing, anyways? How was that information supposed to be collected? What failed? Why didn’t people volunteer it? Were they concerned that they might not receive the benefit if they let on about their cultural and genetic origins? Was the information collected by an agent, whose presence in the process interjected a racial bias? Were the forms processed by biased people? Etc. It’s endless speculation why the data sucks so bad.
I’m bringing this up because we are often trying to solve a problem with an algorithm or fast thinking after the fact. But it might have been such a piss-poor data collection in the first place that no amount of fancy programming is going to suddenly create a Nobel-prize-winning breakthrough and change the nature of GIGO. Garbage data is garbage. Sometimes, we can deduce signal from noise, but that’s often a reduction, or a stripping away of static, not an imputation.
And then, when we use that bad data, assuming it to be correct, we have lost that conversation about data quality and we have propagated an error that we might not ever be aware of.
Love ML, it’s the shizz. But we have to be careful with it.
Degrees are no indication of intelligence & resourcefulness. I know some really kickass middle-aged programmers and thinkers who don’t have an undergrad degree and don’t want one because they are doing fine. This world is all about wits and wiles anyways, so recognize it and move forward confidently. You’re good in my book.
Now see, THIS is why I love BB! Real information discussed by multiple people who know what they’re talking about.
How perfect an example this is, that the skills/knowledge of multiple disciplines need to be part of every step, or else it really is a case of the weakest link making the whole chain untrustworthy.
I can’t even fathom how using such disparate input (essays from wealthy private school students vs those from poor inner city public school students…seriously? I mean…SERIOUSLY?) was ever green-lighted. But it’s all downhill from there. How do you create an algorithm to work with corrupted data? The more you do it right, the worse it is.
Doesn’t this problem persists no matter how you present the data, because the data are prejudiced? If you want a machine to learn to play chess, it can objectively know whether it won or lost the game. But if you want a machine to grade papers, you have to tell it whether it graded them correctly or not. All of the biases of the people giving it that information are automatically taken in by the machine. If we had objective rules for grammar and style it would be different, but English doesn’t really have those, and if we did we wouldn’t actually need a learning machine to solve the problem for us any more than we’d need one to objectively grade arithmetic.
@clifyt, @japhroaig, Any thoughts? I can’t see any way for machine learning to create more egalitarian outcomes when it is learning from us.
The fact of the matter was, this was SUPPOSED to be a generic algorithm. The goal was to have papers from all walks of life because this was intended to be real world data. One of the problems of the original IQ tests was that they were solely normalized on upper middle class college students at elite universities. Certain questions such as “If Johnny were on his yacht and fell over starboard, while doing 15 knots, how long would it take to turn around and pick him up”…ok, a hyperbolic example but questions were similar and students of ‘lower’ upbringing had no clue what these words were. So the goal is to find neutral questions – and in test design we have multiple levels of people from statisticians, to psychometricians, to content knowledge specialists…where in the past it would be one guy (generally a guy…white…older…higher socioeconomic) that did everything.
And this was one of the things we trained humans on. I spent as much time with the English / Psych staff training humans to improve their interrater reliability as I did machines. If we had 4 people rate an essay and several gave the paper a 5 and another gave a 2…we wanted to know why. And then we’d workshop the paper with the group that rated it and try to see what stuck out and fix the human element. And we’d come to a solution. Occasionally the solution was that one of the raters was unreliable and didn’t make it to the next round.
But to reduce biases in things like names…we didn’t just neutralize / scrub the content for the computers, but we ended up scrubbing the content for the raters. We’d edit the names / places of the essays and resubmit them to the humans…which then the subtle biases came into play. The same essay that once references Miguel now referenced Martin would be a point higher (given that it was a 6 point scale, this was significant). Same in reverse. When we spoke about a lot of these things without bluntly accusing folks of racism, often times they thought they were doing students a favor because it meant they could have better student success if they had lower scores and thus were more likely to be entered into remedial courses (the truth of the matter is, in MOST colleges, underrepresented minorities of any ilk do better by going into more remedial courses in the first semester – which then translates to higher retention rates – which then translates to the students doing just as well as the other students in upper level courses…which given that one of the folks from The Bell Curve infamy was involved…didn’t look good, but in fact he had ALWAYS mentioned one of the problems with minorities in college has been that they were underserved in their primary education because of one reason or another – it wasn’t a matter of race or genetics, it was a problem with how society dealt with these kids).
So…our approach was to take subjective ratings and make them as objective as possible in the humans – and then reduce the bias in the humans – while trying to take the bias out of the computer for past damage. Blah blah blah blah blah blah blah…
I’m starting to remember why I got out of this field !!!
Yeah, it sounds antiquated. Now, with algorithms like Latent Dirichlet Allocation, you can build up a model based solely on the content of the corpus, rather than trying to apply an external, and therefore biased, standard against a corpus.
So, say, to measure “complexity of thought” (among many possible applications) you could deduce which papers produce the richest autocorrelation of the most esoteric words available in English. And then go check the few that rise to the top to see if those papers actually make sense and aren’t technobabble or nonsense.
Then, you wouldn’t be dependent on proper nouns, or other subjective interpretations until the very end, the checking step that the humans do.
But we didn’t have LDA or the other stuff until fairly recently. And it takes effort to implement: you can’t just download the LDA app and dump in a bunch of text. It takes some setup.
Interesting…I was trying to see if any of the folks still involved had our demo up and I was in the first reference…I’m almost wanting to try this out to see how it compared to what we had worked on (not our work, but it does reference my team…which I’m wondering if I should edit this as I prefer semi-anonymity).
That said…our focus wasn’t on content knowledge, but actual writing skills. Which was the better written work? Jabberwocky or the Gettysburg Address? Our demo had both in there. Objectively is it Jabberwocky. Far more complexity of sentence and far more skilled as a writer. Gettysburg Address was far more important because of the subject and who said it than the writing. It was writing that was more better orated than written. I mean, my writing is atrocious for the most part, but more often than not, it is an analogue for conversation and nothing more.
Skip the content and let people use technobabble if you are actually rating for writing skill and not conflating the subject matter with it. Then again, this has been the argument all along…Landauer’s LSA was FAR more focused on keyword analysis in practice than we were in an attempt to discern meaning from the essays, as opposed to simply writing ability. I always thought these two schools of though shouldn’t be competing as they were focusing on two different end results. And back to blah blah blah as I feel that I’m trying hard not to dip my own feet into technobabble.
It’s super freakin hard not to devolve into technobabble when discussing this stuff. I used gensim in Python to do an analysis of job postings. It took some work getting stuff ready to go into it, but the results were pretty solid. Still working on the project though… with all the other crap going on.
Luckily, most of these algorithms only use this as one part of the rubric. It might take it down a notch…or more likely folks like Hemingway would happen to use mostly neutral words which will neither help nor hurt.
BTW Check this out…I’ve been meaning to pull the trigger on the desktop app for my own writing.
So, I’m not really the ML guy at work, but I work veeery closely with them. I am their data source guy. But if there are other promising techniques I’d love to hear about them. We have two phds creating the models and classifiers, but (oh yes, I’m getting meta) they themselves may have biases :D.
I already know we overtrained at least one set, and are missing vast swaths of coverage we desperately need.
BTW, I’m a fan of Faulkner, so I tried a few lines through that, just for shits and giggles.
As an example, here’s the opening sentence from Absalom, Absalom (121 simple words, complex structure):
From a little after two oclock until almost sundown of the long still hot weary dead September afternoon they sat in what Miss Coldfield still called the office because her father had called it that – a dim hot airless room with the blinds all closed and fastened for forty-three summers because when she was a girl someone had believed that light and moving air carried heat and that dark was always cooler, and which (as the sun shone fuller and fuller on that side of the house) became latticed with yellow slashes filled with dust motes which Quentin thought of as being of the dead old dried paint itself blown inward from the scaling blinds as wind might have blown them.
Result: “1 of 1 sentences is very hard to read.” The entire thing was highlighted red.
If anyone’s interested, there is only one verb in that grammatically correct sentence: sat.