New Amazon patent application reveals "solution" to missed Alexa instructions: always on recording

Originally published at:

We were gifted an Alexa for Christmas. First thing we asked her: “Alexa, are you spying on us?”

She says she’s not.


Me: Who needs a 55 gallon drum of Clearglide lubricant? Alexa, what is the weather?

Alexa: Ordering 55 gallon drum of Clearglide lubricant on your Amazon card.

As long as it’s in the buffer you bought it. No take-backs.


Alexa: Delete yourself.


I’m sorry, Garymon, but I can’t do that.


Alexa: Sing Daisy.

This might tickle…


How is recording the information before a trigger word so you can later decode not meet the ‘obvious’ test?

Not just obvious to a practitioner in the field, but in this case obvious to goddamn everyone


Not to be contrarian here, but I’m not quite sure I see the big deal.

Assuming you’re ok with Alexa’s always-on listening to begin with, then there’s necessarily a certain amount of “recording” going on anyway. That’s just how audio digitization and speech recognition works. Even if it’s just a single frame, that’s at least 1/44,000th of a second or so worth of “recording”. Without knowing the internals of Alexa’s wake word model, I’d guess there’s probably already more than that being buffered in various chunks of its silicon at any given time. Maybe a lot more if the data is streamed over the network.

And yet, I don’t think most people would really call those transiently buffered bits “recording”. It’s clearly just a reasonable technical implementation detail.

Of course, there are obviously limits to how much is reasonable. The question is, how much is too much, and how much is required for this algorithm to recognize phrases like, “Order me a 55 gallon drum of lubricant, Alexa,”?

Answer: well, probably not 6 months worth. That would be an awfully long comma*. To implement this particular feature**, we’re only talking about a look-back of maybe 5 or 10 seconds. Not more than a minute. And that still seems pretty transient to me.

Calling it “always on recording” makes it sound scary, but that’s like the music industry calling the bit-shuffling your computer needs to do to play a streaming audio track “copying”. It’s not. It’s still just little bit of buffering.

* That’s actually the more interesting part of the patent: it’s not about the buffer, but about inferring the presence of a command from certain (language and speaker-specific) patterns of speech pauses and tone changes and so forth. The buffering part is definitely obvious, what exactly to do with it arguably less so.

** I’m not saying Amazon doesn’t stockpile recordings of all your Alexa data for other reasons. I’m just saying that’s not something you can infer from the presence of this feature.

1 Like

For all experiments I run, the data acquisition records constantly into a rolling buffer and if there is a trigger anywhere in the sequence it saves and processes the data from the start of the buffer. Its a really standard approach to data acquisition and processing.
Hopefully, even the USPTO recognizes that previous use over decades by millions of people means that it shouldn’t be patentable

1 Like
1 Like

Oh, so new opportunities…


that’s “always-on” recording.

Obviously patent lawyers don’t watch science fiction.

I’m not sure an appearance in sci-fi qualifies as prior art. Star Trek imagined an artificial intelligence with a highly competent natural language speech interface 50 years ago, but that’s not the same as having actually invented one.

The problem domain isn’t like buttons to place items in an online shopping cart. The implementation is hard here, not the concept. Saying, “Oh, so all they did was program a computer to capture audio and recognize the words. What’s the big deal?” is glossing over some rather important bits.

And again, if you actually click the link, it’s pretty clear the buffering part is more or less incidental. What the patent is really about is what you do with the buffer: picking out a command phrase from an audio stream based on patterns of pauses and tone changes and stuff. Do you know how to solve that trivially? I don’t.

(Which is why it’s a travesty that they don’t really have to tell us how they plan to do it either, of course. Patent apps like this should be required to include source code, model weights, ASIC layouts, etc., not just a generic, lawyerized description.)

I don’t think their solution is patent worthy.

Lots of security cameras do something similar. They are always recording to a circular buffer and when motion is detected in the video stream, they save the clip with motion along with the previous x seconds because they aren’t always able to detect the start of motion or motion at the edges of the frame.

1 Like

Well, maybe not. But the point is it’s not NOT patentable solely on the basis that it uses a circular buffer.

Note that what they’re doing/want to do is quite different from just “previous x seconds”. It’s looking back in a buffer, yes, but then it’s actually analyzing the contents, in order to see whether/where it’s connected to a successive utterance, and constitutes a command. That’s what the bulk of the application is about.

An analogy to security cameras would be not only capturing some previous frames, but then going a step further. For example, analyzing the footage to make some kind of cursory machine-learning based evaluation on whether to send it to you: determining whether the entire clip actually contained suspicious activity, or just the mailman or a flock of birds or something, and based on that determination, cutting together just the relevant portions for review (not a fixed buffer).

That might or might not be patentable either, but it certainly isn’t what most off the shelf cameras are already doing (yet).

I think that’s pretty much exactly what they are doing. A low power IR sensor (or other kind of motion sensor) detects a change and then more expensive frame-by-frame analysis is done to determine how much of the buffer to save.

Either way, I don’t think it should be patentable. It’s pretty obvious to somebody that works with stuff like this.

Most of the hip ones - like Nest, say - just use the image sensor and run frame comparison algorithms, of varying sophistication, to detect motion. That’s why they prepend extra footage in the alert - the noise/event threshold usually won’t be cleared until after the motion is well underway. It’s not clear to me that any of them actually do go back and try to determine the precise beginning of the alerted event (partly because the need to do that is sort of obviated by the fact that it’s a security camera, so you naturally have a multi-day buffer and an interface for scrubbing around in the stream).

Some of Google’s more recent “here’s a video cut I made with highlights of your birthday party” AI stuff might be closer to the mark – that involves categorizing different types of activities, timestamping their starts and ends, and deciding what’s “important”. But that is clearly a complex set of problems to solve, however “obvious” it seems to you.

I think we’ll have to agree to disagree on the (potential) patentability of that sort of thing. Saying it’s “obvious to somebody that works with stuff like this” is conflating the difference between “obviously something you’d want to do” and “obvious how to do it exactly”. A difference which is potentially pretty large in this case.

Skepticism of the patent system is generally well deserved, but there are also actual advances happening in areas like speech recognition (and video analysis) right now. Granted, a lot of it is just more GPU cycles and bigger ML models being thrown at the problems, but there are also some legitimately novel algorithms and inventions mixed in there. It isn’t necessarily easy to see which is which from a cursory examination.

They draw a green box around the area containing the frame differences. That feels like it would qualify to me.

My camera is a cheap one and is at least 5 years old so none of this is particularly new.

Sure, it’s probably just drawing loose boxes around change regions, maybe with a bit of averaging. That’s easy.

The question is, is it doing it temporally? Temporally and accurately? Suppose the motion detection trips as the mailman walks up to your door – does it then trace that back and send you a precisely trimmed clip starting from the moment the mailman’s foot first entered the frame? What if, a few seconds before that, a pigeon (otherwise too small to clear the detection threshold) had walked across the frame. Will it know the difference between “pigeon motion” and “mailman motion” and still only send you a clip of the mailman, or will it send you a clip that opens with 10 irrelevant seconds of pigeon waddling, because that’s what it found when it went back with a finer-toothed motion algorithm?

And, okay, if it is doing all that, are you really sure that’s not a genuinely hard problem that’s being solved, at least the first time it was done?

I’m just not sure you’re understanding the domain.

What Alexa needs to do is take in a complicated sound environment, like:

“…I know, right! Why would anyone need a giant vat of lube like that? Hang on a sec. Set a timer for 5 minutes, Alexa. And add parsley to the shopping list. Dinner’s going to be ready in 5 minutes, kids!”

And when triggered, go back and analyze to annotate what’s going on into to something like “[talking on phone] [ALEXA COMMAND] [wake word] [FOLLOWUP COMMAND [talking to kids]” so it can separate the relevant parts - the commands - and ignore the rest. It does that by looking for some of the cues you or I would - pauses, shifts in tone, volume and prosody, etc.

That’s not just motion detection. That’s an actual hard problem (several of them, really) that speech recognition has struggled with for, well, ever. Is solving those problems patent-worthy? Maybe, maybe not. But it’s not especially trivial.