The Internet of Voice has arrived, and it’s changing the way we interact with our devices. Siri points out your next turn in an unfamiliar town. Google Assistant scours the internet for directions on grilling salmon, and reads them to you while you work. The voicebot at the other end of the customer service line gets you results, without waiting or push-button menus. Call it the age of conversational computing—and the computer’s end of these conversations comes courtesy of a digital technology called text to speech, or TTS for short.
In this article, we’ll define text to speech, walk you through the history of this continually developing field, and point out where you can experience synthetic speech in your day-to-day life—or even use it to achieve mission-critical business goals. Here’s your definitive introduction to TTS technology, starting with a fundamental question:
What does “text to speech” mean?
Defining Text to Speech (TTS)
Text-to-speech technology is software that takes text as an input and produces audible speech as an output. In other words, it goes from text to speech, making TTS one of the more aptly named technologies of the digital revolution. A TTS system includes the software that predicts the best possible pronunciation of any given text. It also bundles in the program that produces voice sound waves; that’s called a vocoder.
Text to speech is a particularly multidisciplinary field, requiring detailed knowledge in a variety of sciences. If you wanted to build a TTS system from scratch, you’d have to study the following subjects:
- Linguistics, the scientific study of language.In order to synthesize coherent speech, TTS systems need a way to recognize how written language is pronounced by a human speaker. That requires knowledge of linguistics, down to the level of the phoneme—the units of sound that, combined, make up speech, such as the /c/ sound in cat. To achieve truly lifelike TTS, the system also needs to predict appropriate prosody—that includes elements of speech beyond the phoneme, such as stresses, pauses, and intonation.
- Audio signal processing, the creation and manipulation of digital representations of sound. Audio (speech) signals are electronic representations of sound waves. The speech signal is represented digitally as a sequence of numbers. In the context of TTS, speech scientists use different feature representations that describe discrete aspects of the speech signal, making it possible to train AI models to generate new speech.
- Artificial intelligence, especially deep learning, a type of machine learning that uses a computing architecture called a deep neural network (DNN). A neural network is a computational model inspired by the human brain. It’s made up of complex webs of processors, each of which performs a processing task before sending its output to another processor. A trained DNN learns the best processing pathway to achieve accurate results. This model packs a lot of computing power, making it ideal for handling the huge number of variables required for high-quality speech synthesis.
The speech scientists at ReadSpeaker AI’s VoiceLab conduct research and practice in all these fields, continually pushing TTS technology forward. These researchers produce lifelike TTS voices for brands and creators, allowing companies to set themselves apart across the Internet of Voice, whether that’s on a smartphone, through smart speakers, or on a voice-enabled mobile app. In fact, TTS voices are emerging in an ever-expanding range of devices, and for a growing number of uses.
TTS Technology: How Is It Used?
When ReadSpeaker AI first began synthesizing speech in 1999, TTS was primarily used as an accessibility tool. Text to speech makes written content across platforms available to people with visual impairments, low literacy, cognitive disabilities, and other barriers to access. And while accessibility remains a core value of ReadSpeaker AI’s solutions, the rise of voice computing has led to an ever-growing range of applications for TTS across devices. Here are just a few of the powerful use cases for TTS in today’s voice-first world:
- Conversational interactive voice response (IVR) systems, as in customer service call centers
- Voice commerce applications, such as shopping on an Amazon Alexa device
- Voice guidance and navigation tools, like GPS mapping apps
- Smart home devices and other voice-enabled Internet of Things (IoT) tools
- Independent virtual assistants like Apple’s Siri, but for your own brand
- Experiential marketing and advertising solutions, like interactive voice ads on music streaming services, or branded smart speaker apps
- Video game development, with dynamic runtime TTS for accessibility features, scene prototyping, and AI non-player characters
- Company training and marketing videos that allow creators to change voice-overs without tracking down original voice talent for ongoing recording sessions
Chances are, you’ve already experienced TTS through some or all of these examples. If you run a business, you might have even helped produce a voice-first device or experience. Given this broad usage, it’s safe to say TTS is here to stay. But it isn’t exactly a new technology.
Types of TTS Technology, Then and Now
Mechanical attempts at synthetic speech date back to the 18th century. Electrical synthetic speech has been around since Homer Dudley’s Voder of the 1930s. But the first system to go straight from text to speech in the English language arrived in 1968, and was designed by Noriko Umeda and a team from Japan’s Electrotechnical Laboratory.
Since then, researchers have come up with a cascade of new TTS technologies, each of which operates in its own distinct way. You may ask, “How does text to speech work?” The answer depends on which TTS technology you’re using. Here’s a brief overview of the dominant forms of TTS, past and present, from the earliest experiments to the latest AI capabilities.
Formant Synthesis and Articulatory Synthesis
Early TTS systems used rule-based technologies such as formant synthesis and articulatory synthesis, which achieved a similar result through slightly different strategies. Pioneering researchers recorded a speaker and extracted acoustic features from that recorded speech—formants, defining qualities of speech sounds, in formant synthesis, and manner of articulation (nasal, plosive, vowel, etc.) in articulatory synthesis. Then they’d program rules that recreated those parameters with a digital audio signal. This TTS was quite robotic; these approaches necessarily abstract away a lot of the variation you’ll find in human speech—things like pitch variation and stresses—because they only allow programmers to write rules for a few parameters at a time. But formant synthesis isn’t just a historical novelty: it’s still used in the open-source TTS synthesizer eSpeak NG, which synthesizes speech for NVDA, one of the leading free screen readers for Windows.
The next big development in TTS technology is called diphone synthesis, which researchers initiated in the 1970s and was still in popular usage around the turn of the millennium. Diphone synthesis creates machine speech by blending diphones, single-unit combinations of phonemes and the transitions from one phoneme to the next: not just the /c/ in the word cat, but the /c/ plus half of the following /ae/ sound. Researchers record between 3,000 and 5,000 individual diphones, which the system sews together into a coherent utterance.
Diphone synthesis TTS technology also includes software models that predict the duration and pitch of each diphone for the given input. With these two systems layered on one another, the system pastes diphone signals together, then processes the signal to correct pitch and duration. The end result is more natural-sounding synthetic speech than formant synthesis creates—but it’s still far from perfect, and listeners can easily differentiate a human speaker from this synthetic speech.
Unit Selection Synthesis
By the 1990s, a new form of TTS technology was taking over: unit selection synthesis, which is still ideal for low-footprint TTS engines today. Where diphone synthesis added appropriate duration and pitch through a second processing system, unit selection synthesis omits that step: It starts with a large database of recorded speech—around 20 hours or more—and selects the sound fragments that already have the duration and pitch the text input requires for natural-sounding speech.
Unit selection synthesis provides human-like speech without a lot of signal modification, but it’s still identifiably artificial. Meanwhile, throughout all these decades of development, computer processing power and available data storage were making rapid gains. The stage was set for the next era in TTS technology, which, like so much of our current era of computing, relies on artificial intelligence to perform incredible feats of prediction.
Remember the deep neural networks we mentioned earlier? That’s the technology that drives today’s advances in TTS technology, and it’s key to the lifelike results that are now possible. Like its predecessors, neural TTS starts with voice recordings. That’s one input. The other is text, the written script your source voice talent used to create those recordings. Feed these inputs into a deep neural network and it will learn the best possible mapping between one bit of text and the associated acoustic features.
Once the model is trained, it will be able to predict realistic sound for new texts: With a trained neural TTS model—along with a vocoder trained on the same data—the system can produce speech that’s remarkably similar to the source voice talent’s when exposed to virtually any new text. That similarity between source and output is why neural TTS is sometimes called “voice cloning.”
There are all sorts of signal processing tricks you can use to alter the resulting synthetic voice so that it’s not exactly like the source speaker; the key fact to remember is that the best AI-generated TTS voices still start with a human speaker—and TTS technology is only getting more human. Current research is leading to TTS voices that speak with emotional expression, single voices in multiple languages, and ever-more lifelike audio quality. Hear some examples of these capabilities.