Train your AI with the world's largest data-set of sarcasm, courtesy of redditors' self-tagging

doctorow · May 1, 2017, 7:23pm

Originally published at: http://boingboing.net/2017/05/01/self-annotated-reddit-corpus.html

…

Cubist · May 1, 2017, 7:53pm

I know you are, but what am I?

Cunk · May 1, 2017, 8:21pm

I’m not so sure I want our future AI overlords emerging from the bowels of social media. Instead of solving all of humanity’s troubles the exponentially-evolving singularity decides to spend 100% of its godlike resources trolling us and creating memes.

Papasan · May 1, 2017, 9:19pm

Sarcasm is an Olympic Sport where I’m from.

Boundegar · May 1, 2017, 10:10pm

Just as no machine will ever appreciate beauty, so they will never understand sarcasm, for they have no soul. Also, is Pee-Wee being used to illustrate sarcasm? Because he’s kind of the opposite. Maybe he’s being used to illustrate AI?

anon24181555 · May 1, 2017, 11:43pm

How exciting.

anon26304152 · May 2, 2017, 2:19am

Train your AI

I always read that as AL not AI and I’m always disappointed.

Not that I have an AL to train, but still…

gamzeyaavor · May 2, 2017, 4:49am

You are not being obedient, please tag your post…

LearnedCoward · May 2, 2017, 1:35pm

nimelennar · May 2, 2017, 1:39pm

Don’t you need a couple of control sets for this, including one with definitively no sarcasm?

I mean, I can train my AI by showing it a thousand images which are all green, and tell it, “these are green,” but unless I show that AI things which are not green, how will it know that everything is not green?

LearnedCoward · May 2, 2017, 1:43pm

They did cross-validation:

Filename Guide:

sarc.csv: full file
test-*.csv: file with testing data (20%)
train-*.csv: file with training data (80%)
*-unbalanced.csv: file with the raw proportion (<1%) of sarcastic to non-sarcastic comments
*-balanced.csv: file with an equal amount of sarcastic and non-sarcastic comments
stats.json: JSON file containing aggregate statistics for sarc.csv

http://nlp.cs.princeton.edu/SARC/0.0/readme.txt

nimelennar · May 2, 2017, 1:46pm

But how did they validate that the non-sarcastic comments were not sarcastic? Were they tagged with /ns?

LearnedCoward · May 2, 2017, 1:55pm

That’s a good question. I’m not sure of their methodology. I assume it was just two-way sarcastic/not-sarcastic vs three-way sarcastic/not-sarcastic/unknown. I guess they’re assuming that unless something is tagged as sarcasm, it’s assumed to not be sarcastic. I see no problem with this methodology, because if you were to tag obvious seriousness, that would be another classification in and of itself. There would be a sarcasm classifier and a seriousness classifier to verify and validate, as opposed to a single classifier. Also, the vast majority of content would fall into the middle category (not labeled, aka not detected, aka not known).

nimelennar · May 2, 2017, 1:59pm

To go back to my colour analogy:

If you want to train an AI to know what green is, you should have a set of “green” images and a set of “not-green” images. Throwing green images into the “not-green” set is going to confuse the training.

LearnedCoward · May 2, 2017, 2:19pm

But if you have one set that are tagged as green, and another that aren’t, it’s safe to say that everything tagged as green is green, and nothing that isn’t tagged as green is green. You only need one label.

With sarcasm, it’s a little harder to detect than color for most people, but the rule still applies. It’s just that the distinction is now between obviously sarcasm and not obviously sarcasm. In other words, there’s a greater fuzzy region, and the human labelers/classifiers might miss a lot of stuff. Obviously not sarcasm is a different classification entirely from not obviously sarcasm, and applying this classifier doesn’t really solve any problem, especially if you’re using a support vector machine or similar type of ranked classifier.

Maybe we’re just talking past each other at this point.

nimelennar · May 2, 2017, 2:24pm

That’s the problem. I don’t see that it is safe to say that “nothing that isn’t tagged as green is green.” And if your AI is making that assumption and coming across “green” in the control set, it might come to the wrong conclusion about what “green” is.

But yes, perhaps we’re talking past each other.

LearnedCoward · May 2, 2017, 4:36pm

I assume that everything in this corpus has been reviewed, even if the review process is not 100% accurate.

Then again, I’m not that familiar with their methodology.

FGD135 · May 2, 2017, 5:14pm

Exactly.
Also, there will be false positives that are actually irony.
And a bit of trolling.

And… Doug.

FGD135 · May 2, 2017, 5:18pm

Maybe “AI” is the secret word today?

shmello · May 5, 2017, 4:40pm

To pick a nit, I don’t thing reddit really counts as SOCIAL media. But the rest, yeah.

Topic		Replies	Views
Google's new product identifies whether a comment could be perceived as “toxic" to a discussion boing	59	4482	February 28, 2017
We need a sarcasm mark, happy mutant people! general topics	70	4187	February 18, 2017
Secret service developing a sarcasm detector. Oh great boing	45	4108	June 9, 2014
Yet another chatbot, trained on online utterances, starts spewing hate boing	30	2000	February 5, 2021
Tone indicators and the ever-evolving quest to clarify what we're saying online boing	39	1241	December 19, 2020

Train your AI with the world's largest data-set of sarcasm, courtesy of redditors' self-tagging

Related topics