Voice-to-Voice Agentic AI: Poised for Uncomplicating Self-Service

Rube Goldberg, the venerated engineer, sculptor, and cartoonist, died in 1970. He published his last cartoon in 1964 but continued to create paintings and sculpture that captured his life-long concentration: creating complicated machines to perform simple tasks.

I'm sure it is a coincidence that, at about the same time engineers at AT&T (along with some startups specializing in the obscure disciplines of voice synthesis) began developing the first interactive voice response (IVR) systems. Touchtone phones had debuted in 1964 and were just starting to find near mass adoption. Minicomputers in the form of DEC's PDP-8 made their appearance in 1965. By the mid-1970s they became the work horses for call processing, voice processing, and business applications that supported voicemail, automated call distributors (ACDs), and all the precursors to modern contact centers and collaboration infrastructures.

In Rube Goldberg fashion, last century's engineers addressed the first-order challenges of understanding caller intents and responding with relevant information or actions by employing crude technologies that appear humorously unsuitable. The results were IVR systems whose prompts those of us of a certain age can recite by heart:

  • "If you know your party's extension, enter it now, otherwise..."
  • "Press 1 for Sales."
  • "Press 2 for Service."
  • "Press 3 for Billing." or
  • "Press 0 for the operator or all other reasons."

Touchtone phones, which debuted in 1964, were replacing rotary dialing as digital switching systems replaced clunky electro-mechanical beasts with names like Strowger or step-by-step switches that used motors and pullies to replace switchboard operators at scale. However, nIt also marked the beginning of an AI winter as academic papers by the likes of Marvin Minsky and Seymor Papert documented the limitations of existing single-layer neural networks.

It took another 20 years for the visionaries who defined voice-based self-service to begin to speech-enable IVRs. Their efforts yielded modest results. By 2000, callers could say or press 1. By 2010 they could speak in full sentences but most often found themselves repeating a single word, Agent.

To overcome those limitations, large companies employed staffs of voice user interface (VUI) designers and developers who could build dialogue models that speech-enable specific tasks. By 2015 companies in businesses with a high volume of calls that were repetitive (travel and hospitality, financial services, utilities, government agencies, for instance) employed speech-enabled IVRs to contain as many calls as possible.

Automated assistants, chatbots, and voicebots all shared similar objectives: to recognize customer intent and match it with the right answers or resources. Both endeavors (intent recognition and matching) were labor-intensive, brittle, and in constant need of monitoring and refining. Resemblance to a Rube Goldberg cartoon was unintentional, but very real.

GenAI and Agentic AI Offer a Solution

It hasn't taken long for providers of the cutting-edge frontier models (cue OpenAI and Google) to provide real-time access to generative AI resources that understand spoken input, detect the intent of utterances, and provide very human-sounding responses. OpenAI's demonstration of GPT-4o earlier this year provided a vivid example. A voicebot could be interrupted. It could laugh. It even sounded sarcastic occasionally. Flavors of Google's Gemini, like Notebook LM (which transforms knowledge articles into podcasts in which two very different personas discuss a topic for 10 minutes), might feel like parlor tricks today, but they are steadily changing the expectations (and the reality) of what genAI-powered voice assistants could do on behalf of customers.

GenAI aspect means that responses are generated by LLM-based resources. Agentic refers to the fact that the bot can act autonomously. It figures out what it needs to do to complete a task and carries out the steps necessary to accomplish the callers' objectives.

Precautions must be taken in high-volume, customer-facing instances to prevent so-called hallucinations and insure privacy. But the personnel who had been responsible for monitoring and tweaking speech-enabled IVRs or voicebots that employed legacy development tools are very well-suited for undertaking the training, monitoring, and ongoing tweaking of the new generation of voice assistants and agents. No spring-loaded mechanical arms or bowls of wet spaghetti required (you can look that up in the compendium of Rube Goldberg illustrations.


Dan Miller is founder of Opus Research.