“So many notes are playing that it merges together to becomes a staccato droning noise”
I had relationships that sounded like that.
Pipe that to a player piano.
Needs more cowbell.
I wonder if there’s something like the McGurk Effect at work here, where we hear the lyrics because we know they should be there, and we hear something close enough to them for our brains to fill in the blanks. Or to put it another way, would someone who has never heard the original be able to fill in any of the words after hearing this version?
This phenomenon works on video as well as audio.
A few years ago, I built a TV set called SatanVision. It was made of a 96 x 128 pixel array of really dim red LED blocks that I got off Ebay for cheap.
The curious thing is that when I was watching an old B&W gangster movie on it, I could see the sheen in the sharkskin suits. Via red LEDs.
That’s one strange wavelet transform.
Just what I came here to say!
Put it on a player piano and see if we can still “hear” the words.
Interesting, yes, thanks for the link. The “words” were right at the edge of comprehensibility to me
WAV -> Fast Fourier Transform (frequency domain) -> MIDI
Speaking of words, there was a gizmo made in the late seventies called Speak n Spell. It had a very early voice synthesizer made by TI. It didn’t sample; rather, it had an analog circuit that would synthesize waveforms that sounded like words.
I got to play with one in a retail shop where I worked. I fond that if I pressed the same button many times in a row, the sounds it made sounded less and less like words and more and more like noises.
Nope, the Speak and Spell had a real DSP! A ROM on the board contained digitized samples for individual phonemes. Speech was encoded as a series of phonemes and modifiers, which made text-to-speech rather easy.
Yep! When the voice samples are played in arbitrary order, that’s how you get the famous circuit-bent speak&spell sound:
Famous and funky!
Goodness but that was painful to listen to
When I first for Studio Vision Pro in the late 90s, I used its audio-to-MIDI function to do stuff like this.
What has made the process more accurate is partial tracking algorithms which allow for each instrumental or vocal sound to be converted to MIDI separately, instead of one big overlapping FFT lump.
Regular speech would also sound like a heap of noise without a big wet mushy brain to push it into a pattern. Is there really a big difference between this piano noise and say, a heavily distorted voice, or a voice heard weakly through noise (wind in leaves, rushing water, etc)?
Sure, there is the significant difference that the underlying pattern existed in the first place, rather than wholly as an artifact of the listener’s perceptions.
There have been psychacoustic studies involving how much data can be stripped from a signal and still have it be intelligible. There is a threshold where somebody who has heard the full-detail version can hear the content of the reduced copy easily, while somebody who hasn’t heard it before cannot.
As interesting as they may be, such studies are probably also to blame for the worsening audio quality of mobile phones. How much data can I afford to lose? Can I pay extra to get it back?