Implementing a custom, branded voice assistant for your product, service, or app begins with setting goals and deciding what you want your brand to sound like. Skipping this process or taking shortcuts by not getting enough people involved can lead to a brand mismatch once the voice assistant has been implemented. Selecting the voice talent and choosing a Text-to-Speech (TTS) voice that matches the attributes and tone of voice users expect to hear from your brand is a critical step toward making a lasting impression with your voice assistant.
When you’re ready to determine which voice will represent your brand, get as many different stakeholders as possible involved to achieve consensus around the type of voice you want to implement. During that crucial stage of the process, get consensus on these 4 key questions:
- Who’s talking? Define your voice to match your brand
- What’s your voice assistant’s personality? Determine tone and accent
- Where will it be used? Make decisions based on the environment and context
- How does it perform? Test quantitatively and qualitatively
1. Carefully define your TTS to match your brand identity
Whether you plan to use an existing text-to-speech (TTS) voice, a custom voice, neural TTS, or concatenated TTS, it’s a good idea to start by defining the identity of your ideal voice. The voice you choose is going to represent your brand, so it’s important to spend time on this with internal stakeholders. This should ideally include a discussion within your team, market research about how your customers perceive your brand, and additional research about the ideal age, gender, and desired accent of your voice assistant.
The voice you choose is going to represent your brand, so it’s important to spend time on this with internal stakeholders.
Spending time upfront to define who your voice assistant will be is critical to its eventual success. Don’t let the default TTS voice that comes built into some platforms define who you are as a company. You’ll risk sounding like everyone else, and creating a mismatch between how you want to be perceived and how users actually perceive your brand.
A good example of a discrepancy between identity and voice is Lost Voice Guy, also known as Lee Ridley. He’s a comedian from the UK who lost his ability to speak at a young age. He now uses a speech-enabled tablet PC in his act. The default voice he uses has a BBC-like British accent, whereas Lee is a down to earth guy from the North of England. He uses this mismatch to comedic effect and it works really well. Lee’s application of a voice mismatch is unique. Unless you’re attempting comedy, you’ll want the voice identity to match your brand.
An example of a great voice match is the voice the BBC gave Beeb, their new voice assistant. The voice identity was carefully crafted after months of research. As a result, they decided to break away from the prestigious Southern British accent traditionally associated with the BBC and gave the app a Northern British accent instead. They also decided to use a male voice as opposed to a female voice to avoid perpetuating gender stereotypes.
2. It’s not always what the TTS says that matters, but how it’s said
How you say something can be as important as what you say. Whether it’s putting emphasis on a specific word or changing the intonation of a full sentence, it can change how speech is perceived. Therefore, it may be necessary for your TTS voice to adapt according to the context of your users. For example, adopting an apologetic tone when correcting a mistake made by the voice assistant can alter the user experience from a purely negative one to something more palatable.
A cheerful voice isn’t always the right default. It may sound good on paper, until you hear it announce a flight delay or bad weather. Some TTS voices can be trained via a markup to change the speaking style to better match the context of the conversation. As TTS continues to advance, we may see voice assistants adapt their speaking style automatically based on some form of semantic analysis or user context.
Adopting an apologetic tone when correcting a mistake made by the voice assistant can alter the user experience from a purely negative one to something more palatable.
Today, most TTS engines come with a pre-processing module that converts things like times, dates, and telephone numbers into standardized speech. Proper nouns like artists’ names, company names, or product names can sometimes be tricky—especially when the same word can have multiple pronunciations or is spelled in an unexpected way.
In addition, abbreviations, acronyms, and initialisms that are common to your product or service may require training your TTS. For example, if your voice assistant said that your bank balance is dollar sign one two seven, you’d find that really strange. Make sure that you choose a TTS that allows for flexibility and customization via a markup language and a user defined lexicon.
3. Choose a TTS that best meets your needs
An important distinction to make when choosing TTS is the type of technology used to develop the voice.
Currently, there are two options, neural TTS and concatenated TTS. Which one you use will depend on your budget, voice requirements, and your user’s environment. While neural TTS is more humanlike, natural, and pleasant than concatenated TTS, it can only run in a cloud environment and it’s a lot more expensive. Most TTS providers charge up to four times more for neural TTS than for concatenated TTS. If voice quality is one of your top goals and you have the budget, the cloud connectivity, and the required CPU capacity, then a neural TTS is definitely a better option.
While neural TTS is more humanlike, natural, and pleasant than concatenated TTS, it can only run in a cloud environment and it’s a lot more expensive.
If, however, your voice assistant will be running in a noisy environment, neural TTS can be less intelligible, which can be problematic for some users and use cases. Although costs for neural TTS are likely to go down over time, there are advantages to using a less-expensive concatenated TTS that can be more easily embedded in products without cloud connectivity and is more intelligible in noisy environments.
4. Balance TTS accuracy with pleasantness
Evaluating TTS options includes measuring accuracy quantitatively as well as gauging the emotional response it involves qualitatively. You can test the accuracy of TTS by seeing how it handles addresses, names, numbers, foreign words, or homographs. Evaluating the effectiveness of the voice can be really subjective because people react emotionally to voices—whether they realize it or not. We can be more forgiving of a pleasant sounding voice that makes lots of mistakes than an unpleasant voice with impeccable pronunciation skills.
Mean opinion score-based tests that are ranked on a scale of 1-5 are typically used to evaluate voices in the most objective way possible. These tests can be conducted internally or outsourced. In either case, it’s important to choose the criteria carefully to include things like naturalness, pleasantness, and intelligibility. As the technology gets more natural and intelligible, the emotional connection users have to your voice assistant will become much more important. Ensure that the test subjects are representative of the users who are going to interact with your voice assistant to create the best match possible.
When you’re working with internal stakeholders to reach agreement about the sound of your branded voice, keep in mind that the goal is to find a voice that will be a positive extension of your brand and your ambassador in the world at large.
As the technology gets more natural and intelligible, the emotional connection users have to your voice assistant will become much more important
If you find a voice in a TTS catalogue that meets your needs, go ahead and use it. Just be aware that you may end up sounding like everyone else and lose the opportunity to communicate your unique brand identity and differentiate yourself from the competition. On the other hand, a custom vocal identity can personalize your brand in a market soon to be filled with other branded voices.
SoundHound Inc., has all the tools and expertise needed to create unique VUIs and a vocal brand identity. Explore Houndify’s independent voice AI platform at Houndify.com and register for a free account. Want to learn more? Talk to us about how we can help you bring your voice strategy to life.
Andrew Richards is director of business development at SoundHound Inc., based in France. He’s been working in the voice tech space for almost 20 years and has spent more than 15 years working with text-to-speech technology.