As Auto-Tune is making its way into everything nowadays, public awareness of the process is rising. The recent Time article about Auto-Tune and its creator is a good read, but it oversimplifies the principles behind the algorithm.
In the article, Andy Hildebrand’s background in seismic analysis is viewed as the key to his later work with pitch correction. Hildebrand apparently used autocorrelation for seismic mapping, which is viewed as the key to his later success with Auto-Tune:
He was debating the next chapter of his life at a dinner party when a guest challenged him to invent a box that would allow her to sing in tune. After he tinkered with autocorrelation for a few months, Auto-Tune was born in late 1996.
What the article fails to mention is that autocorrelation has been used for pitch detection since at least the 1970’s. Rabiner and Schafer’s book “Digital Processing of Speech Signals” describes the process in detail, and it was published in 1978. Eventide pitch shifters used autocorrelation starting in the 1980’s, to peform their splicing detection and pitch correction.
In addition, the basic pitch shifting method used by Hildebrant was described by Keith Lent in a Computer Music Journal article in 1989. The idea is to chop up the vocal signal into small windowed grains, where each grain holds a single period of the input signal, and then to spit the grains out at a rate corresponding to the new pitch. This can be viewed as a form of pitch synchonous granular synthesis, where both the grains and the grain rate are determined by analysis of an incoming signal. Lent’s technique has been used by most of the formant preserving pitch shifters, including algorithms from IVL, Digitech and Eventide. The technique was independently developed by France Telecom, and is often referred to as PSOLA (Pitch Synchronous Overlap and Add).
The key to Hildebrand’s innovation is how he combined Lent’s pitch shifting method with the robust pitch detection that autocorrelation provides. Lent’s original paper used a simple time-domain method for determining the input periodicity, which resulted in audible distortion for certain input signals. I have written Lent-style pitch shifters before, and the pitch detection algorithm is critical in avoiding octave jumps, unnaturally hoarse voices, or metallic syllabants. My code had all of those problems, although my boss at the time was able to fix many of the issues. Hildebrant’s patent describes how he uses sample rate reduction and some clever mathematical tricks to create a robust pitch detector that runs much faster than standard autocorrelation.
So, if you are watching The Backyardigans, and the overly pitch corrected vocals drive you crazy, don’t just blame Andy Hildebrand – blame Keith Lent.
EDIT, 2016: OK, I was wrong. Andy Hildebrand contacted me via email with the following, that is reprinted with his permission:
No, I don’t use the Lent algorithm: way too imprecise. You are close on
the detection algorithm. The math in the patent is absolutely precise
to what I do. But that math is used continuously to track pitch as
well. I always know exactly what the pitch is. I run a simple rate
rate converter from that point and when I have to repeat a cycle
(going sharper) or delete a cycle (going flatter) I can because I know
exactly what the period is at every instant.
Apologies to Andy Hildebrand for the misinformation about the Lent algorithm, and for misspelling his name as “Hildebrant.”