Your voice-to-text speech is recorded and sent to strangers

Nonentity · March 1, 2015, 12:38am

Both the Mechanical Turk and the people listening to this data are just the mechanism to interpret the data. What’s worrying people here is how the data that’s collected to be interpreted in this way is being collected. It’s the combination of both the frontend and the backend here that’s the problem.

From the linked article:

“I’ve heard more text-to-speech sexting than I care to. (You’ve never hear something sexy until you’ve heard a guy with a slight Indian accent slowly enunciate “I want to have sex with you” to his texing app)”

Somehow, I doubt that’s command words.

Nonentity · March 1, 2015, 1:09am

Ok, that’s just rude. If you “pause” the Voice & Audio activity, it “may”

limit or disable features such as using “Ok Google” to start a voice search and reduce the accuracy of speech recognition across Google products that use your voice.

But…

Note that this setting does not affect storage of information by Google products (like Voice) that can be used to store your audio or voice inputs. Google may also continue to collect and store audio data in an anonymized way.

So (if I’m reading that right), you can’t opt out of the data collection… you can only opt out of that data collection actually being helpful to you.

I wonder what terms the other manufacturers mentioned in this story have.

art_carnage · March 1, 2015, 6:30am

Apples and oranges. He was describing speech-to-text (though he had it backwards, as text-to-speech). That’s something you have to activate, usually by clicking a microphone icon. Then, the application will attempt to convert your speech to text.

Again, your TV isn’t spying on what you’re saying. Until you say the cue words, it’s not paying much attention to you, and is not transmitting your voice anywhere.

Engineer · March 1, 2015, 6:40am

How would the police know that you were driving rather than a passenger? While some conversations might give that information way throuogh context, that’s likely to be the exception rather than the rule.

Skeptic · March 1, 2015, 6:45am

Except you don’t actually know that. What a smart TV does depends on the make, model and software revision. There is no universal rule that prevents a smart TV from listening to you, perhaps even for reasons you haven’t though of, for various analytical purposes, or to track who’s watching, or who knows. Once you click on the ELUA (or even if you don’t) all bets are off.

Nonentity · March 1, 2015, 9:21am

[quote=“art_carnage, post:23, topic:52784”]
Apples and oranges.[/quote]

If you’re only nitpicking the single, parenthetical side statement in the article that mentions TVs, then OK, fair point.

If we’re being that nitpicky, though, then 1) the TV’s EULA doesn’t restrict it to what you described, 2) even if it were restricted to that in practice, it would be until the TV thinks you might have said a cue word, and 3) if the TV’s recognized your cue word(s), why does it need to transmit it anywhere in the first place, especially when they explicitly say it will be transmitted to an ~~(unnamed)~~ third party?

(edit) I’ll give Samsung some credit… it looks like their privacy statement’s changed since Cory’s earlier article, and it now includes a name of the third party (Nuance Communications, Inc) as well as a statement that collection happens “only when you make a specific search request […] by clicking the activation button”. That’s a lot better than what was previously quoted.

shaddack · March 1, 2015, 9:27am

Because the Holy Cloud is at this moment somewhat better in speaker-agnostic speech recognition than offline solutions.

Nonentity · March 1, 2015, 9:56am

Yes, but that seems redundant if we’re already talking about a situation where the offline solution identified it. But if we were only talking about using processing in the cloud to handle the recognition of nonstandardized commands then I for one wouldn’t have a problem with it.

It occurs to me that we’re talking about two distinct issues:

the EULA on certain TVs reserve the right to transmit voice information to third parties
voice-to-text content is evidently being farmed out for additional manual processing without explicit permission

For me, at least, it’s the manual processing part that’s the biggest problem, and the first issue mainly becomes a problem if it intersects with the second one, or if there’s storage involved.

shaddack · March 1, 2015, 10:30am

The offline solution is specialized to identifying of one single word/phrase, with allowance for false positives. If this solution identifies a positive, it sends the following sound to the Cloud, and waits for the result of its analysis.

I believe the cloud does analysis of all commands.

The manual processing is there for faster self-learning of the processing engine. People will be generally way too lazy to give permissions for each transaction that goes wrong. The engine therefore won’t be getting enough feedback to know when/where it sucks.

Nonentity · March 1, 2015, 5:04pm

“People will be too lazy to willingly let us do what we want with this data, so we’ll just do it anyways” is a horrible, horrible position. Seems like it’s been all too common, lately.

shaddack · March 1, 2015, 5:06pm

If you don’t do it, your competitor who does it will have a system that works better and is improved faster.

Still thinking there’s a choice?

(For the record, I’d prefer local processing, without sending the data out at all.)

Nonentity · March 1, 2015, 5:21pm

Yup. There’s plenty of options to get more explicit permission from the people whose private communications are being passed on to strangers. Making absolutely no attempt other than a few lines in an EULA no one reads is just laziness.

Heck, even a not-so-obvious way for someone to go in later and choose what data they don’t want manually processed would be a vast improvement.

retchdog · March 1, 2015, 5:29pm

oh, it’s very efficient. rather than trying to estimate the proportion of languages and dialects of your users and finding representative samples of each one, you can just snoop. this is almost guaranteed to be an unbiased sample of usage, since it’s exactly that. this saves them a $million or so right off the bat by not doing a formal survey, and they get much better results too. if they want to script focus groups for each dialect in addition to that, they still can.

however, it’s not really necessary in many cases. language is easier to recognize than to script, so they can get by with less effort that way. the level of complexity of siri queries is pretty low.

keep in mind that much of the problem with speech recognition is in things like natural cadence, multiple voices, background noise, and bandwidth compression. snooping in fact gets a lot more than scripting could provide.

this says nothing about the ethics of it, but it’s probably quite efficient.

anon67050589 · March 1, 2015, 5:35pm

Since no one seems to have mentioned this app yet…

I did some research into voice-to-text apps, including reading the EULAs, and found Dragon Dictation’s EULA the most straightforward with the tightest boundaries. There may be others I didn’t know to check, but FWIW Dragon Dictation seems a lot less privacy-intrusive than the biggies like Apple and Google.

MarjaE · March 1, 2015, 6:06pm

You seem to assume that “scripting” is the only alternative to snooping. So what is “scripting”?

You don’t seem to be encountering the same gorram repeated errors that I am, because many of them are tied to dialect differences.

For example, the speech-to-text I tried to use hardly ever recognizes the word “a,” either substituting “A,” or “hey.” I know that in some dialects, “a” and “hey” might sound alike, but in mine they do not, and writing “hey” for “a” causes an awful lot of extra errors. And it doesn’t recognize the phrase “or on,” substituting an ableist slur I’d never use.

Let’s face it, the same sound might be associated with (a) one word in a dialect that has strong rs and hs, avoided any þ-d or þ-f mergers, went through several vowel mergers, and went through the split between the various CURE vowels, and with (b) another word in a dialect that often drops rs and hs, went through þ-d or þ-f mergers, avoided several vowel mergers, and avoided the split between the various CURE vowels.

Allowing the user to choose one of several dialects, instead of one of four choices, where each choice crosscuts important sound shifts, would mean the software wouldn’t have as much trouble recognizing the right word.

retchdog · March 1, 2015, 6:17pm

well, “scripting” refers just to what you mentioned: having volunteers read prepared texts. i honestly don’t think about speech recognition much. i don’t encounter your problems because i don’t use speech recognition and, at least in its current form, hopefully never will. it would be much more likely for me to develop a speech recognizer than to use one.

“Allowing the user to choose one of several dialects”: but before that, you need to train the recognizer to recognize those dialects. by “efficiency” i was thinking about the cheapest way to do that, from data-gathering to implementation to testing.

MarjaE · March 1, 2015, 6:38pm

I suppose that cheaply getting every single sentance wrong might meet a certain standard of efficiency…

I am disabled with arm injuries, so could use speech-to-text, but can’t use the speech-to-text I’ve tried, even with a good external mic.

falcon2001 · March 1, 2015, 9:24pm

Sure, maybe in 20-30 years, but computational linguistics researchers will laugh at the idea of it being any sooner than that. Human speech is difficult at best to compute, even for relatively ‘clear’ accents. Beyond simply understanding what you said, there’s what you MEANT to say - for instance Their, They’re, and There, all of which are essentially identical words and require grammatical context in order to work better.

Then do that for every major language, and collect your big fat check.

retchdog · March 1, 2015, 9:26pm

really, this probably wouldn’t be helped by scripting.

i wouldn’t be surprised if the speech recognizers on phones were tuned toward recognizing statistically-likely key phrases which people are likely to search for. they’re probably not really intended for general-purpose transcription, especially for stop words and articles. (yes, these are different tasks, and you can even, to some extent, intentionally improve performance at one by sacrificing the other.)

i see your problem, but in the cold logic of capitalism, you’re just not the target market for these things. they’re there mostly to “help” people spend money, not as a prosthesis. :-/

MarjaE · March 1, 2015, 10:11pm

But I’m not talking about phones. I can use a speakerphone on a good day, but I can’t use smartphones. And speech-to-text software is advertised for dictation and as an accessibility feature.

Topic		Replies	Views
Heard: an app that records what you heard 5 minutes ago boing	10	3082	July 29, 2013
Making a computer voice that "people like" boing	48	4464	February 20, 2016
Anti-robocall robot shows how compromising phone metadata is boing	10	3056	September 10, 2013
Perfect impression of a contemporary text-to-speech bot boing	12	1789	April 14, 2019
It's dismayingly easy to make an app that turns a smart-speaker into a password-stealing listening device and sneak it past the manufacturer's security checks boing	15	1234	October 26, 2019

Your voice-to-text speech is recorded and sent to strangers

Related topics