Voice AI for Telephony: A Developer’s Guide

Alexey Aylarov
15 Min Read

While Voice AI agents offer powerful applications for businesses, integrating them into current telephony systems presents numerous hurdles. These guidelines can help.

Credit: shutterstock/CHIEW

Despite the rise of support applications and chatbots, traditional telephony remains a vital foundation for customer interactions. Now, voice AI is being introduced into call centers to further enhance these exchanges. 

However, this advancement also presents developers with a fresh set of challenges. Chief among them is the complex task of connecting advanced AI layers with established telecom networks. Given the continuous evolution and updates of large language models, the voice AI infrastructure must be designed from the outset for easy adaptability. While much uncertainty surrounds this transition, one truth stands clear: the integration of AI with telephony should not have its inherent challenges underestimated.

Voice AI agents provide a multitude of valuable functions for businesses. They are instrumental in scheduling, rescheduling, and canceling customer appointments. Furthermore, they can effectively triage incoming calls, directing them appropriately to human agents. Voice AI is even capable of managing estimated times of arrival (ETAs), coordinating deliveries, and arranging candidate interviews.

Businesses should anticipate from the start that they will need to modify components of their voice AI pipeline and select systems that offer maximum flexibility. Even with this foresight, developers continue to encounter additional problems.

Why Telephony Remains a Hurdle for Developers

Many people mistakenly believe that a voice AI agent is simply ChatGPT with a voice interface, an AI embedded directly to handle and route calls. This perception is far from accurate. Voice AI agents demand a comprehensive infrastructure, comprising multiple essential components that empower the underlying large language model (LLM) to function effectively in real-world scenarios.

Large Language Models (LLMs): These are the core intelligence behind any AI calling system, tasked with understanding user intent, mapping out conversational steps, and formulating replies to ensure smooth communication between the caller and the AI agent. 

Speech-to-Text (STT): This vital technology converts spoken audio from the caller into written text, a necessary step for any analytical processing to occur.

Text-to-Speech (TTS): The inverse of STT, this component synthesizes the AI agent’s textual responses into natural-sounding speech for the caller. 

Turn-taking: To maintain a natural, human-like conversation with an AI, effective turn-taking is crucial. This involves features like voice activity detection and barge-in policies, allowing for fluid exchanges. 

Telephony Gateway: This bridging device is responsible for converting between PSTN, SIP, and WebRTC protocols, while also managing signal processing and media streams.

These elements interlock within a sophisticated telephony infrastructure, which nonetheless has its inherent limitations. Local telecom carriers must navigate these, alongside their business’s specific compliance mandates, requirements, and constraints. Consequently, communication networks invariably consist of a diverse mix of vendors and technologies, necessitating that enterprises maintain adaptability as they integrate new components with their existing setups.

This flexibility is particularly important for voice AI applications, which come with some of the most rigorous technical demands. Application developers should aim to orchestrate voice AI-specific elements while ensuring seamless interoperability with legacy systems. 

The Technical Realities

Developers confront a series of complex technical challenges when integrating voice AI into telecommunication networks. Advancing with the development of a voice AI agent—one that reliably performs in a production environment—requires thoroughly understanding these issues and engineering robust solutions.

Tackling Latency

Latency is a persistent concern that can undermine any effective voice AI system. Noticeable gaps or silences before an agent responds are a significant deterrent for callers, often leading them to believe the agent is absent or the technology is malfunctioning. 

The International Telecommunications Union (ITU) suggests a mouth-to-ear latency of under 400 milliseconds for a natural conversational flow. “Mouth-to-ear” refers to the time it takes for spoken words to be perceived by a listener. Humans typically take a few hundred milliseconds to initiate a response. This implies that for AI systems to convincingly mimic human interaction, they must deliver a response within a very tight timeframe. The AI’s reply then embarks on another network journey, allowing the original speaker to hear it. In total, the entire interaction needs to occur within approximately one second; otherwise, the conversation will feel disjointed. Currently, most voice AI systems are nearing this benchmark, with ongoing advancements in technology and methodologies promising further improvements. 

Latency can be the deciding factor for the success or failure of real-time AI systems. We’ve witnessed this problem in healthcare scenarios where latency combined with insufficient language support has caused issues. For instance, an Australian startup sought to use an AI caller to check on elderly Cantonese-speaking patients—a seemingly ideal application of the technology. However, high latencies due to reliance on US-based voice AI infrastructure, coupled with a lack of Cantonese TTS, resulted in an unnatural and frustrating experience.

Solutions for latency issues often involve engineering refinements. Developers should strive to minimize latency at every stage of the development process. This necessitates real-time, end-to-end data flows—meaning data streams both in and out concurrently, rather than waiting for the LLM to complete its full text output before passing it to the TTS for synthesis. 

Closely monitoring extended delays during calls is also crucial. This enables the system to inject a response when necessary, thereby minimizing awkward pauses or silences. Indeed, another strategy involves maintaining a continuous stream of communication with the user. Instead of the line falling silent and prompting user concern, it’s important to notify callers proactively if a delay is anticipated. Incorporating subtle background noises can similarly reassure callers that their query is still being processed despite any temporary pauses.

Addressing Impersonal AI

Another challenge for voice AI systems is the risk of sounding monotonous and impersonal, leaving callers with the impression they’ve interacted with a generic AI. To combat this, specialized third-party Text-to-Speech (TTS) systems are available. By offering a broader array of voice options, these services help to inject more variety and maintain a human-like touch. 

The diversity within the field means that solutions for voice AI-telephony integration take many forms. While streaming TTS can reduce latency, some vendors provide a wide selection of voices, enabling businesses to choose one that aligns uniquely with their brand and requirements. Companies that already possess a recognizable brand voice can clone and integrate it into their voice AI system. Having a distinct voice speak directly to customers via telephony can be a potent asset. Others, however, should have the flexibility to select from a variety of voices to find one that best resonates with their brand identity.

Integrating with Telephony Systems

A further challenge involves integrating your AI agent with existing telephony systems, particularly within contact centers and enterprise infrastructure. These environments are often themselves composed of a mix of systems from various vendors. While the SIP standard governs much of traditional telephony, it doesn’t guarantee complete interoperability. Indeed, older systems often have fixed or limited configurations, meaning new systems must be highly adaptable. 

In this context, choosing an experienced vendor who understands how to operate across diverse environments and with different systems is a wise decision. Another useful approach is to ensure they provide robust debugging tools and the necessary support to address any unforeseen issues that may arise. 

Network quality can fluctuate significantly across different countries, especially in rapidly developing regions such as Latin America. For instance, we’ve observed unreliable SIP interconnections from Mexico, forcing customers to route calls through the US, which introduces unnecessary latency. Conversely, substantial investments in Brazil’s infrastructure in recent years have improved service not only within the country but also across the wider region. Ideally, your CPaaS (communications platform as a service) provider will have established carrier relationships in numerous countries, enabling them to optimize traffic in all circumstances.   

Five Key Strategies for Building Effective Real-Time Voice AI

To condense the insights above, I’ve compiled five essential tips for creating a real-time voice AI that truly performs. 

1. Define User Needs and Constraints: Begin by thoroughly understanding the user’s requirements and limitations. It’s equally vital to consider latency tolerance, supported languages, geographical coverage, as well as other critical factors like Key Performance Indicators (KPIs) and compliance scope. 

2. Choose Your Comms Integration and Media Path Wisely: Carefully consider your strategy regarding voice versus messaging. If you opt for voice, meticulously plan your architecture, particularly concerning CPaaS, trunks, transfers, and DTMF (dual-tone multi-frequency) signaling.

3. Ensure a Robust, Compatible Real-Time AI Pipeline: No voice AI system is complete without a strong and compatible real-time AI pipeline. First, select an LLM; the chosen underlying LLM will dictate your voice system’s behaviors, influencing latency, compliance, tone, and many other aspects. Having clarity on voice and pipelines from the outset will empower businesses to develop an effective voice AI. 

4. Facilitate Deep Integration with Existing Systems: This is another crucial piece of the puzzle, allowing the technology to access and leverage important information and context about the caller, such as names and account details. Unnatural memory lapses from the bot are a significant drawback. A well-integrated system can help prevent common pitfalls (like latency, missing barge-in capabilities, or hallucinations) and make your voice AI feel genuinely responsive.

5. Prioritize Productionization: This is mission-critical for all telephony applications. It’s essential for call centers, real-time gaming, trading systems, and, crucially, for your voice agent, which you’ve painstakingly built with the aim of operating flawlessly on every single phone call. Properly constructed infrastructure enables the bot to manage word error rates, latency, and autoscaling effectively.

Voice AI agents are constantly evolving, representing an iterative technology accompanied by a unique set of challenges. I’ll conclude with some advice for future-proofing your voice AI and telecom stack amidst this ongoing evolution.

The Future Landscape for Real-Time Voice AI

One critical piece of advice is to stay ahead of the curve regarding LLM and speech vendors. Assume these components are not static; instead, anticipate the need to swap them out to keep pace with technological advancements. Don’t fall behind; ensure your platform allows for flexible mixing and matching. 

More broadly, avoid being caught off guard by technological evolutions. By anticipating improvements in speech and AI quality and performance, rather than merely reacting to them, you’ll be able to quickly implement enhancements as they emerge. Even if a particular approach is yielding benefits today, don’t cling to it indefinitely, or a superior strategy emerging tomorrow might bypass you entirely.

It’s also worth noting that the global reach of voice AI presents both challenges and opportunities. In areas like the San Francisco Bay Area, a significant portion of voice AI orchestration platforms primarily cater to US users. While this is effective locally, companies with a more international customer base possess an advantage because they confront challenges that many localized companies have yet to experience. 

For example, international operations often face significant latency issues, as voice AI data centers might be geographically distant (or solely US-based) and telecom carriers less reliable. This gives international providers an edge, as their global footprint often translates into robust carrier relationships and extensive partnerships within the voice AI ecosystem.

Ultimately, it will only be a matter of years before the next generation of voice applications far surpasses what we observe today. In fact, the integration could become so seamless that distinguishing between AI agents and human agents in state-of-the-art systems will be nearly impossible. This advancement should accelerate call centers in replacing their legacy IVR (interactive voice response) systems with voice AI. Similarly, it should motivate developers and stakeholders to build AI-driven call workflows that are robust enough for real-world deployment.

New Tech Forum provides a platform for technology leaders—including vendors and other external contributors—to delve into and discuss emerging enterprise technology with unparalleled depth and scope. The selection process is subjective, based on our identification of technologies we deem important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing materials for publication and retains the right to edit all contributed content. Please direct all inquiries to doug_dineley@foundryco.com.

                    Generative AIArtificial IntelligenceSoftware DevelopmentDevelopment Tools                   

Share This Article
Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *