How Does Speech Synthesis Work?

How Does Speech Synthesis Work?

Humans have been building talking machines for centuries—or at least trying to. Inventor Wolfgang von Kempelen nearly got there with bellows and tubes back in the 18th century. Bell Labs legend Homer Dudley succeeded in the 1930s. His “Voder” manipulated raw electronic sound to produce recognizable spoken language—but it required a highly trained operator and would have been useless for an iPhone.

When we talk about speech synthesis today, we usually mean one technology in particular: text to speech (TTS). This voice-modeling software translates written language into speech audio files, allowing Alexa to keep coming up with new things to say. So how does speech synthesis work in the era of AI voice assistants and smart speakers?

A few technologies do the trick. One common approach to TTS is called unit selection synthesis (USS). A USS engine sews chunks of recorded speech into new utterances. But in order to minimize audible pitch differences at the seams, the professional voice talent must record hours of speech in a fairly neutral and unvarying speaking style. As a result, USS voices sound less natural, and there is no flexibility to synthesize more expressive or emotional speaking styles without doubling or tripling the amount of recorded speech.

Instead, let’s look at neural text to speech, which uses machine learning to produce more lifelike results. Here are the basic steps a neural TTS engine uses to speak:

1. Linguistic Pre-Processing…

…in which the TTS software converts written language into a detailed pronunciation guide.

First and foremost, the TTS engine needs to understand how to pronounce the text. That requires translation into a phonetic transcription, a pronunciation guide with words represented as phonemes. (Phonemes are the building blocks of spoken words. For instance, “cat” is made up of three phonemes: the /k/ sound represented by the letter “c,” the short vowel /a/ represented by the letter “a,” and the /t/ at the end.)

The TTS engine matches combinations of letters to corresponding phonemes to build this phonetic transcription. The system also consults pre-programmed rules. These rules are especially important for numerals and dates—the system needs to decide whether “1920” means “one thousand, nine-hundred and twenty” or “nineteen-twenty” before it can break the text down into its constituent parts, for instance.

In addition to phonemes, the TTS engine identifies stresses: syllables with a slightly raised pitch, some extra volume, and/or an incrementally longer duration, like the “but” in “butter.” At the end of linguistic pre-processing, the text represents a string of stressed and unstressed phonemes. That’s the input for the neural networks to come.

2. Sequence-to-Sequence Processing…

…in which a deep neural network (DNN) translates text into numbers that represent sound.

The sequence-to-sequence network is software that translates your prepared script into a two-dimensional mathematical model of sound called a spectrogram. At its simplest, a spectrogram is a Cartesian plane in which the X axis represents time and the Y axis represents frequency.

The system generates these spectrograms by consulting training data. The neural network has already processed recordings of a human speaker. It has broken down those recordings into phoneme models (plus lots of other parts, but let’s keep this simple). So it has an idea of what the spectrograms for a given speaker look like. When it encounters a new text, the network maps each speech element to a training-derived spectrogram. Long story short: The sequence-to-sequence network matches phonetic transcriptions to spectrogram representations inferred from the original training data.

What does the spectrogram do?

The spectrogram contains numerical values for each frame, or a temporal snapshot, of the represented sound—and the TTS engine needs these numbers to build a voice audio file. Essentially, the sequence-to-sequence model maps text onto spectrograms, which translate text into numbers. Those numbers represent the precise acoustic characteristics of whoever’s voice was in the training data recordings, if that speaker were to say the words represented in the phonetic transcription.

3. Audio File Production with the Vocoder…

…in which the spectrogram is converted into a playable audio file.

We’ve translated text into phonemes and phonemes into spectrograms and spectrograms into numbers: How do you turn those numbers into sound? The answer is another type of neural network called a vocoder. Its job is to translate the numerical data from the spectrogram into a playable audio file.

The vocoder requires training from the same audio data you used to create the sequence-to-sequence model. That training data provides information that the vocoder uses to predict the best mapping of spectrogram data onto a digital audio sample. Once the vocoder has performed its translation, the system gives you its final output: An audio file, synthesized speech in its consumable form.

That’s a highly simplified picture of how speech synthesis works, of course. Dig deeper by learning about the ReadSpeaker VoiceLab, which constructs bespoke DNN voices for brands and creators.

Start a Conversation

Question? Suggestions? Get in touch with us today. We look forward to hearing from you.

Contact Readspeaker AI