I'm supposed to be recording music this week but I've got caught up with enthusiasm for sound programming, as my new music will be vocal and I'd like some vocal processing options. I write my music on my own software, which is brilliant because it's cheap AND it gives you the freedom and power to add new bits as needed. Today I wrote a vocoder, which turned out to be quite simple; it's a bank of band-pass filters that filter the voice at different frequencies, then boost a sound wave (music) in those frequencies by an amount proportional to the volume of each filter.
It has sixteen filters. I wondered what would happen if I changed the destination frequencies but kept the source ones the same. I got some interesting results when boosting the mid range but didn't experiment further. I must try more of that one day...
My next task though is to pitch correct a vocal sample. Playing a sample at any pitch is easy so the hard part is determining the root frequency of the voice. I started by pondering (and head scratching) at Discrete Fourier Transforms for a couple of days. This bamboozling maths made sense; from what I understand it's basically that there are two ways to represent a complex waveform; as amplitude and time, and as a sum of sine and cosine waves of different frequencies. Both represent the same wave, just in two forms, the time and frequency way ("domain" mwahahahaha). This made me wonder if the universe behaves like this, as it sparked off a connection with waves and particles; odd how waves are seen as timeless and infinite but particles are discrete and finite, yet they are the same thing seen in different ways. It made me think that the Fourier transform has an important role in physics, and that any equation for the universe must be two equations, both different that say the same thing in two ways. After all, everything is one wave, the whole universe is a complex wave.
I'll leave that prediction to history and get back to my speech test. So, I need to work out the frequency of a bit of singing. I've split the sound into lots of sections. After lots of reading it seems there are several ways to identify pitch. The frequency way had appeal because I'd not done it before; that basically divides the sample into root frequencies, so a pure note would be a clear spike. The thing is the voice isn't a pure note anyway, so it's not necessarily easier than looking at the normal time-based sample.
So I think I'll start with normal sample data. Speech/singing isn't always pitched anyway, lots of hissy sounds like "sh" "tch" "sss" are white noise really and could be ignored. I'll need some A.I. to guess a good pitch, or even think that it isn't a pitch, and discard it.
Some principles mused...
A singing voice might be off but it's not going to be very off so it's more likely to be close to the desired pitch, or it's more likely to be at the previously sung pitch (rarely will a song jump up two octaves). You could go further and guess that a pitch will be musically pleasant, guessing that a tune is probably in the right key. This is the start of artificial intelligence routines...
So, I've got a few starting strategies. At best I want to determine the difference between ideals; a sine wave and white noise, and if it's a sine wave then determine the frequency. What options are there to detect either?
A sine wave flows smoothly, white noise jerks about. I thought of a fish flowing along the wave, heading up or down for the next hillock or trough. The fish can't turn as rapidly as the sound wave. If the fish stays close to the target path then it's a smooth wave, if it's always turning or miles away then it's more likely to be random noise. Perhaps the fish could be rewarded when it's close to the wave with accumulated points, but those fall away if the fish becomes sad at unable to track the wave (or rapidly turns; the pain of the anxious fish means white noise psychosis). So, a volume tracker of some sort is one strategy.
Another is to pick random points every so often... if that fish headed for waypoints every so many samples, then on a sine wave the results would still be easy going; a quantized sine wave would still appear smooth, a noisy one still random. Does that help?
And frequency. I thought I could track zero crossings, the times that the wave jumps above and below zero, then track similarity. This would be a great indicator on perfect source data; on an ideal sine wave the periods would be identical; and totally random on white noise. So a reward number for regularity would be good, together with a proposed period.
(EDIT: What if a sound wave had a good repeating pattern rather than simple repeats? I expect a few natural sounds might do that, or do they?)
Looking at different windows of samples might be a problem too, chances are there will be a few periods of pitch on there. An ideal size to analyse must be sought; one short enough to find only one sound/syllable, but long enough to give a good indicator of periodicity for low notes. Perhaps window periods that match the tempo of the song would be an idea. An idea that a musician and never a mathematician would even pick, as powers of two are best, and they're not at 120 B.P.M. (I wonder why 44,100hz was chosen as CD frequency and not 65,536hz for example? Maybe it's something to do with the laser of a compact disc, or a transistor or something.)
Well, I'll give it a go tomorrow. Please excuse my meandering thoughts but I wanted to type them out and I thought that this might be useful to other signal processing people in future. There's no conclusion to this blog post ... well, not tonight :) Happy dreams of sound fish to you!