Text-to-speech (TTS) technology is changing the way we interact with our machines. It speaks to us from our smart speakers, virtual assistants, and voice bots. Combined with a range of smart technologies—automatic speech recognition (ASR), natural language understanding (NLU), dialog management, and natural language generation (NLG), at minimum—TTS lets us issue commands and get responses entirely through speech. The resulting voice user interfaces turn computing into a more human experience.
But to really transform personal computers into personable computers, robotic TTS voices won’t do. Thankfully, artificial intelligence (AI) allows us to create synthetic speech that’s barely discernible from the real thing. This AI-powered TTS is called neural text to speech. It’s how the ReadSpeaker VoiceLab crafts custom TTS voices for brands and creators. And thanks to AI, neural TTS is more natural, expressive, and welcoming than ever.
If you’ve ever mistaken machine-generated speech for a human speaker, neural TTS is probably the reason why. Here’s what it is, what it can do, and why it’s important for your business.
What Makes Text to Speech “Neural?”
In a nutshell, neural TTS is a form of machine speech built with neural networks. A neural network is a type of computer architecture modeled on the human brain. Your brain processes data through unbelievably complex webs of electrochemical connections between nerve cells, or neurons. As these connective pathways develop through repetition, they require less effort to activate. We call that “learning.”
Neural networks loosely mimic this action. They’re clusters of processing units—artificial neurons—that classify input data and transmit it to other artificial neurons. By setting parameters for desired results, then processing large datasets, neural networks learn to map optimal paths from neuron to neuron, input to output. Unlike traditional computing, you don’t write the rules for a neural network; there’s no “If A, then B.” Rather, the network derives the rules from the training data. It’s a form of machine learning that’s been applied to everything from image recognition to picking winning stocks.
But not all neural networks are deep neural networks (DNN), the technology ReadSpeaker’s VoiceLab uses to produce more lifelike machine speech. We call a neural network “deep” when it consists of three or more processing layers:
The input layer initially classifies data, passing it through one or more “hidden” layers. These hidden layers further refine the signal, sorting it into more and more complex classifications. Finally, the output layer produces the final result: Labeling an image correctly, for instance, or predicting a stock fluctuation—or producing an audio signal that sounds uncannily like human speech.
Neural TTS Models: Duration, Pitch, and Acoustic Predictions
To create a neural TTS voice, we train DNN models on recordings of human speech. The resulting synthetic voice will sound like the input data—the source speaker—which is why we often call neural TTS voice cloning. But it takes multiple DNNs working in concert to pull off this imitation act. In fact, neural TTS voices require at least three distinct DNN models, which combine to create the overall voice reproduction:
- The acoustic model reproduces the timbre of the speaker’s voice, the color or texture that listeners identify as belonging to that speaker.
- The pitch model predicts the range of tones in the speech—not just how high or low the TTS voice will be, but also the variance in tone from one phoneme to the next.
- The duration model predicts how long the voice should hold each phoneme. It helps the TTS engine pronounce the word “feet” rather than “fffeet,” for instance.
The pitch and duration prediction models are called prosodic parameters. That’s because they determine prosody, or non-phonetic properties of speech like intonation, rhythm, and breaks. Meanwhile, the acoustic model predicts acoustic parameters that capture information about the speaker’s voice timbre and the phonetic properties of speech. Today, we can combine these models for increasingly lifelike TTS voices with faster production times—and that’s just one of the capabilities DNNs bring to the field of machine speech.
New Possibilities for Neural TTS Technology
The most obvious advantage of neural TTS is that it sounds better. In a 2016 study, participants rated DNN-based TTS systems as more natural than other types of TTS—and DNN technology has come a long way since 2016. But neural text to speech is also leading to unexpected TTS-production techniques that simultaneously reduce costs and improve quality.
That’s important for brands. TTS allows you to engage consumers through voice-first channels like smart speakers, virtual assistants, and interactive voice response (IVR) systems. Here are a few ways DNN-based TTS makes these experiences better for brands and consumers alike.
Say you like the sound of one TTS voice but the speaking style of another. Prosody transfer makes it possible to get the best of both. As long as the two voices are compatible—meaning they’re in the same language, and they aren’t too far apart in pitch range—we can combine the prosody from one voice with the sound of another. For brands, prosody transfer makes it possible to give a custom branded TTS voice more expressive range—without starting from scratch for each new speaking style.
An advanced machine learning technique called transfer learning reduces the amount of training data required to produce a new neural TTS voice. Large datasets from existing TTS voices fill in the learning gaps left by shorter new voice recordings. While a few hours of voice recordings are always ideal for training voice models, speaker-adaptation allows us to emulate a new voice even when only shorter recordings are available. In other words, we can train these multi-voice models faster, with less original training data, and still produce lifelike TTS voices. This will help drive down costs and expand access to original, branded text-to-speech personas.
“Emotional” Speaking Styles
Training data determines the sound of every TTS voice. If you record three hours of someone speaking angrily, with large pitch variances and high intensity, you’ll end up with an “angry” TTS voice. With traditional text to speech, you needed a good 25 hours of recorded data to produce a decent voice—and that voice had to be relatively neutral in expression. With DNN models, you can get terrific results by training models on just a few hours of recorded speech—and even less with speaker adaptation.
These advances allow the ReadSpeaker VoiceLab to record three to five hours of a single speaking style or affective mood, then another hour or so of the same speaker performing in styles that suggest different moods. (Lacking these recordings, you could always find a more expressive TTS voice and use prosody transfer to mimic the performance.) That allows us to create voices with emotional variation, adjustable at the point of production through our proprietary voice-text markup language (VTML). So you can produce an enthusiastic TTS message, and another apologetic statement, all with the same, recognizable TTS voice, and all through the same TTS production engine.
Combine this capability with conversational AI to create automated chatbots, IVR systems, and virtual assistants that adjust speaking tone to match the mood of the speaker. That, in turn, improves the customer experience through fully automated voice channels. Maybe that’s why, as of 2019, more than 90% of companies were investing in voice. By creating a more natural audio experience, neural text to speech is helping to power this shift toward voice marketing strategies.