Voice interface technology is everywhere—maybe even in your own home. Digital voice assistants like Alexa, Siri, and Google Assistant control more than 3 billion devices, a figure that’s expected to more than double by 2023. These familiar personas are the public-facing side of popular voice user interfaces, or VUI.
But VUI isn’t just for smart speakers. This technology is improving crucial business processes, from hands-free control over manufacturing lines to booking a boardroom on the go. Here’s what corporate decision-makers need to know about voice interface technology: what it is, what it does, and how it can help meet business goals.
Voice User Interface: A Definition
In computer science, the user interface is the hardware and software that allows a person to interact with the machine. It may include a keyboard, mouse, or touchscreen along with the software that generates the on-screen elements you click, drag, or type into.
Personal computers from the early 1980s were controlled with a text-only interface. Users had to type highly specific text commands to get the machine to do anything. Graphical user interfaces (GUI)—like the one on Apple’s 1984 game-changer, the Macintosh—replaced those exacting command prompts with visual icons users could manipulate with a mouse, creating the desktop image we still use today.
Like the GUI and the command line interface before it, VUI gives users a new way to issue commands to digital devices—this time without needing a screen, keyboard, or mouse. In short, define VUI as technology that allows people to interact with digital devices through the medium of the voice.
Elements of the Voice User Interface
A pure voice user interface accepts inputs and provides outputs using only the spoken word. Contrast this with a bimodal user interface, which combines voice interaction with another medium, like text on a screen. One example of a bimodal user interface is a smart TV that allows you to turn down the volume with voice commands; it’s voice-enabled, but it will still show your volume bar diminishing graphically on the screen.
For now, we’ll limit the discussion to end-to-end VUI, a system that accepts spoken commands and responds to those commands using machine-generated speech. In a pure-speech VUI, three distinct technologies come together to create an increasingly natural interaction between people and their tools:
- Automated speech recognition (ASR). The VUI’s first task is to transcribe the spoken command into a machine-readable format, typically text. In the early days of VUI’s growth—around the mid 2000s—ASR was limited to a prescribed list of commands, and early speech-to-text engines were easily confused by variations in the speaker’s speed, tone, and accent. That’s no longer the case, as we’ll discuss in the third item on this list.
- Text to speech (TTS). A voice-enabled device will translate a spoken command into text, execute that command, and prepare a response—a scripted text reply. A TTS engine translates this text into synthetic speech to complete the interaction loop with the user. There’s a wide variance in the quality of TTS even in today’s voice user interfaces, ranging from robotic and affectless to natural, warm, and lifelike, as in ReadSpeaker’s solutions.
- Artificial intelligence (AI). Early VUI wasn’t easy to use. It tripped up on the subtle variances in accents and dialects from one speaker to the next. The pre-scripted TTS responses were buzzy and inhuman, even hard to understand. Artificial intelligence is helping to solve these problems. Deep neural networks learn from actual human speech, improving recognition over time. This type of AI-driven ASR is called natural language understanding (NLU), and it’s what allows Alexa to recognize that “play my favorite playlist” and “let’s listen to some music” mean the same thing. On the TTS side, deep learning leads to voice models that reproduce the subtle variations in user language to create much more human-like speech, even mirroring the user’s dialect when appropriate. This is called natural language generation (NLG).
While artificial intelligence is revolutionizing both automated speech recognition and text-to-speech engines, ASR and TTS remain very different technologies. When user interface providers are designing for voice, they need at least two partners: A company that builds ASR systems and another that specializes in TTS.
Looking for a TTS provider for a custom VUI? Check out our customer testimonials to find out what it’s like to work with ReadSpeaker.
A Brief History of Voice Interface Technology
Voice user interfaces didn’t become a household technology until Apple released its voice assistant Siri on the iPhone 4S in 2011. But the roots of VUI stretch back much further, with ASR and TTS each following their own trajectories.
The International Computer Science Institute dates machine speech recognition back to 1952, when Bell Labs introduced a device called Audrey. Audrey could understand the spoken digits zero to nine with up to 99% accuracy, which limited its use to verbally dialing telephone numbers. It also cost a fortune and occupied a six-foot relay rack. Audrey was no consumer product, but it provided proof of concept.
A decade later, at the 1962 World’s Fair, IBM unveiled the “Shoebox,” a machine that could understand 16 English words. In 1971, the U.S. Defense Advanced Research Project Agency’s (DARPA) began work on Harpy, the first voice recognition device to reach a vocabulary of over 1,000 words. Still, through the 1970s and 1980s, ASR remained firmly outside the consumer space.
That finally changed in 1990, when a company called Dragon Systems released a limited consumer ASR program. Seven years later, Dragon began selling the first widely available speech recognition software that could follow full sentences: Dragon NaturallySpeaking. Doctors still use an updated version of this product to take hands-free medical notes.
By the 2010s, advances in natural language understanding led to the first generation of voice assistants, and IBM’s Watson system competing on Jeopardy. Today, NLU allows voice recognition systems to understand subtle differences in spoken language, creating more natural interaction between devices and the people who use them.
Synthetic voice technology goes back even further than ASR. In an appearance on the Alpha Voice podcast, ReadSpeaker’s Niclas Bergström outlines the history of TTS, starting with a 1779 synthetic voice machine made of reeds and resonators.
Bell Labs began experimenting with electronic voice synthesizers in the late 1920s, eventually leading to engineer Homer Dudley’s invention of the first fully functional speech-generating machine, the Voder, a decade later.
The first true text-to-speech system emerged in Japan in 1968, Bergström says. The 1970s saw an explosion in TTS technology, with major commercial systems like Texas Instruments’ Speak and Spell and Ray Kurzweil’s line of reading machines for people with visual impairments.
By the 1990s, text-to-speech powered a growth in interactive voice response (IVR), the computerized automated phone systems still in use today.
In 1999, ReadSpeaker was founded, soon becoming the first company to introduce TTS on cloud-computing systems. This innovation made it easy for developers designing for voice to incorporate TTS into otherwise unrelated software and, later, mobile apps. Today, ReadSpeaker drives TTS technology forward with pioneering use of deep neural networks—technology that’s making VUI more dynamic and user-friendly on a continual basis. Here are a few ways businesses are using VUI to drive value today.
Voice User Interface Examples from Today’s Businesses
While the most familiar VUIs belong to cell phones and smart speakers, companies are using voice interface technology to streamline collaboration, expand branding opportunities, create better user experiences for their customers, and more. Here are a few voice user interface examples from the field:
- Manufacturers are using VUI to control production lines, engaging with the local industrial internet of things without putting down their tools.
- Teachers use VUI devices in the classroom, where they answer student questions, provide instant definitions and facts, and even help with language education.
- In the healthcare field, medical professionals enjoy hands-free control of dictation devices, simplifying the creation of medical records.
- Adding a VUI to server-based computer systems allows employees to schedule meeting rooms, change appointments, and record notes within a closed, secure system—and without touching a computer terminal.
- Companies are providing enterprise-ready voice assistant services. For instance, Synqq is a smart note-taking app that uses NLU to record meetings and highlights the important moments, like discussion of action items.
- Conversational AI platforms like MindMeld provide a starting point for companies looking to implement VUI in their own unique customer service systems.
As these examples suggest, businesses use VUI in two general ways: In the office, to streamline internal processes; and in their products, to create a better user experience. In either application, a unique, branded voice can strengthen recognition, loyalty, and engagement between the company and the listener. Find out how ReadSpeaker powers VUIs and other text-to-speech applications here.
Do You Need Neural Text-To-Speech for a Voice User Interface?
ReadSpeaker Custom Voices are virtually indistinguishable from a human speaker. They’re custom-built to match your brand. And they’re available in more than 30 languages, with more on the way. Voice interfaces don’t provide an opportunity for traditional visual identifiers like logos and color schemes. That leaves the voice itself to do a lot of the work of differentiating one brand from the next—and ReadSpeaker can help.
Whether you choose a custom-branded voice or an off-the-shelf creation, ReadSpeaker’s TTS services are ideal for anyone designing for voice user interfaces. Our solutions operate on the cloud, through your server, or even offline, through a standalone device. All ReadSpeaker TTS solutions were built by teams of engineers, linguists, and deep neural networks—and we’ve been doing this work since 1999. Contact us today to discuss how we can help you design and implement voice interface technology for your mission-critical systems.