Train your AI with the world's largest data-set of sarcasm, courtesy of redditors' self-tagging


#1

Originally published at: http://boingboing.net/2017/05/01/self-annotated-reddit-corpus.html


#2

I know you are, but what am I?


#3

I’m not so sure I want our future AI overlords emerging from the bowels of social media. Instead of solving all of humanity’s troubles the exponentially-evolving singularity decides to spend 100% of its godlike resources trolling us and creating memes.


#4

Sarcasm is an Olympic Sport where I’m from.


#5

Just as no machine will ever appreciate beauty, so they will never understand sarcasm, for they have no soul. Also, is Pee-Wee being used to illustrate sarcasm? Because he’s kind of the opposite. Maybe he’s being used to illustrate AI?


#6

How exciting.


#7

Train your AI

I always read that as AL not AI and I’m always disappointed.

Not that I have an AL to train, but still…


#8

You are not being obedient, please tag your post…


#9

#10

Don’t you need a couple of control sets for this, including one with definitively no sarcasm?

I mean, I can train my AI by showing it a thousand images which are all green, and tell it, “these are green,” but unless I show that AI things which are not green, how will it know that everything is not green?


#11

They did cross-validation:

Filename Guide:

  • sarc.csv: full file
  • test-*.csv: file with testing data (20%)
  • train-*.csv: file with training data (80%)
  • *-unbalanced.csv: file with the raw proportion (<1%) of sarcastic to non-sarcastic comments
  • *-balanced.csv: file with an equal amount of sarcastic and non-sarcastic comments
  • stats.json: JSON file containing aggregate statistics for sarc.csv

http://nlp.cs.princeton.edu/SARC/0.0/readme.txt


#12

But how did they validate that the non-sarcastic comments were not sarcastic? Were they tagged with /ns?


#13

That’s a good question. I’m not sure of their methodology. I assume it was just two-way sarcastic/not-sarcastic vs three-way sarcastic/not-sarcastic/unknown. I guess they’re assuming that unless something is tagged as sarcasm, it’s assumed to not be sarcastic. I see no problem with this methodology, because if you were to tag obvious seriousness, that would be another classification in and of itself. There would be a sarcasm classifier and a seriousness classifier to verify and validate, as opposed to a single classifier. Also, the vast majority of content would fall into the middle category (not labeled, aka not detected, aka not known).


#14

To go back to my colour analogy:

If you want to train an AI to know what green is, you should have a set of “green” images and a set of “not-green” images. Throwing green images into the “not-green” set is going to confuse the training.


#15

But if you have one set that are tagged as green, and another that aren’t, it’s safe to say that everything tagged as green is green, and nothing that isn’t tagged as green is green. You only need one label.

With sarcasm, it’s a little harder to detect than color for most people, but the rule still applies. It’s just that the distinction is now between obviously sarcasm and not obviously sarcasm. In other words, there’s a greater fuzzy region, and the human labelers/classifiers might miss a lot of stuff. Obviously not sarcasm is a different classification entirely from not obviously sarcasm, and applying this classifier doesn’t really solve any problem, especially if you’re using a support vector machine or similar type of ranked classifier.

Maybe we’re just talking past each other at this point.


#16

That’s the problem. I don’t see that it is safe to say that “nothing that isn’t tagged as green is green.” And if your AI is making that assumption and coming across “green” in the control set, it might come to the wrong conclusion about what “green” is.

But yes, perhaps we’re talking past each other.


#17

I assume that everything in this corpus has been reviewed, even if the review process is not 100% accurate.

Then again, I’m not that familiar with their methodology.


#18

Exactly.
Also, there will be false positives that are actually irony.
And a bit of trolling.

And… Doug.


#19

Maybe “AI” is the secret word today?


#20

To pick a nit, I don’t thing reddit really counts as SOCIAL media. But the rest, yeah.