Audio in Video Games: Text to Speech and the AI NPC


Audio in video games has come a long way from the bleeps, bloops, and four-channel themes of the Nintendo Entertainment System’s “Super Mario Bros.” Like other consoles of the 1980s, the original NES used 8-bit sound chips to generate music and in-game sound effects. That audio quality remains charming; it even spawned an electronic music genre, chiptune, that’s thriving to this day. But compared to the immersive soundscapes of film and television, early audio design for games was…a bit limited.

In the 1990s, console producers made the leap to optical media. The extra storage available on CD-ROMs left space for high-definition audio recordings, and background music became indistinguishable from symphonic film scores. Between the ‘90s and today, the development of 3D audio design led to increasingly immersive soundscapes, alerting players to off-screen events and helping to define in-game space.

So what’s next for audio in video games?

The most exciting development has ramifications that go far beyond the player’s ears, creating deeper, more realistic, and more dynamic virtual experiences, particularly for open-world games and RPGs—and it’s all made possible through advances in artificial intelligence (AI) and text-to-speech (TTS) integrations.

Neural TTS for AI Audio in Video Games

Developments in AI are leading to the creation of in-game non-player characters (NPCs) free from the constrictions of pre-scripted conversation trees. The audio component? Getting those characters to speak their dynamic responses, out loud and in real time. That’s where TTS comes into play—but not all TTS engines are ready for the technical requirements of video game developers.

This isn’t speculative; AI NPCs are already in development. Based on the player’s questions or statements, they use natural language generation (NLG) software to come up with a fresh, relevant response. Essentially, these characters are AI chatbots—and, unlike characters who stay on-script, only text to speech can give these bots a voice.

Voice actors remain the gold standard of video game character speech, but when characters themselves—or at least the AI models behind those characters—come up with new lines on the spot, pre-recorded speech isn’t an option. To create the next generation of immersive AI NPCs for open-world games, developers need to leverage TTS. They also need an embedded TTS game engine plug-in to ensure runtime response, free of latency.

Generating Dynamic TTS Audio Without Latency

The latency issue is a major barrier to deployment of AI NPCs in today’s games. In the video below, you can see there’s a multi-second delay between the player’s question and the AI NPC’s response. That’s because both the NLG and the TTS services are cloud-based, integrated into the game engine via API. The game has to send a request out to the NLG and TTS modules and wait for the reply before it can play the audio, causing a delayed player experience.

Developers can leap the audio hurdle by using a TTS game engine plug-in from ReadSpeaker AI. This TTS software integrates directly into game engines, generating audio tracks on the user’s device so they’re free from latency. It’s dynamic TTS at runtime: instant video game audio for dynamically generated dialog.

Of course, TTS is already familiar to game developers. They use it to prototype game dialogue, faster and at a lower cost than re-recording voice actor dialogue again and again. Text to speech is also a key accessibility tool, used to bring such games as The Last Of Us 2 to broader, more diverse audiences. But with the emerging generation of runtime TTS game engine plug-ins, we’re looking at something new. Voicebot NPCs are poised to take audio in video games—and the experiences it creates—to a higher level than ever before.

