AI Streamlines Path to Human Customer Service Agents

Home

News

January 15, 2026

BrianRoberts

New research reveals that open-source, ChatGPT-style AI systems can potentially use natural language to connect callers with the right person in a call center, bypassing the need to navigate frustrating and frequently changing menu systems that often feel deliberately obstructive.

Reaching a live agent can be a frustrating ordeal, as callers must slowly work through multiple-choice options, often unsure which selection fits their specific issue. When none do, savvy users often develop tricks and workarounds to reach a human representative and escape 'option hell'. For many, this experience feels adversarial and user-unfriendly.

It's no surprise that call centers are a primary target for AI augmentation or replacement. Despite some experts urging caution, automating call centers remains low-hanging fruit for tech headlines and a promising area for AI-driven innovation that can deliver an unusually quick return on investment.

Closed Shop

However, open-source principles and publicly available data are seldom applied in this domain, and for good reason. Companies automating their customer response systems have little incentive to share the data, methodologies, or corporate intellectual property that underpins their competitive advantage.

Sharing such resources would erode their market edge. More critically, as AI systems can be prone to leaking sensitive information, it also carries significant legal risks.

This has led to several well-funded companies developing AI-powered call center systems independently, inevitably duplicating some efforts. It has also fueled a surge in B2B startups and established players aiming to meet the growing demand for AI-driven customer service capabilities.

A PolyAI voice assistant opens a customer service call for fictional company

A PolyAI voice assistant initiates a customer service call for the fictional 'Augusta Lawn Care', leveraging extensive training conversations to automate responses within existing call center infrastructure. Source

Furthermore, the drive to eliminate the frustration of complex call-center menus has spurred research efforts. However, most findings are not published on Arxiv or other open platforms, reflecting the typically proprietary nature of Interactive Voice Response (IVR) development.

Consequently, research, data, and business intelligence related to AI automation in customer service are closely guarded. Very few open-source alternatives exist, even if using such systems with legally secure data were a viable option, which is doubtful.

Local Call

Against this backdrop, a new paper from Colombia is a welcome attempt to bring IVR development slightly out of its corporate vault. The concise study, titled Beyond IVR Touch-Tones: Customer Intent Routing using LLMs, comes from a researcher at Universidad Distrital Francisco José de Caldas in Bogotá. It claims to be the first non-proprietary project using Large Language Models (LLMs) to create a functional schema for a Customer Intent Routing (CIR) system.

Rather than using real call data or proprietary menu structures, the project generates all components from scratch using three AI models: one to design a realistic call center menu, another to simulate hundreds of caller complaints, and a third to function as the chatbot, routing these queries to the correct destination.

The outcome is a fully synthetic yet convincing test environment featuring a fictional telecommunications company and 920 distinct user queries. This setup allows the experiment to explore how well current AI interprets vague, unstructured speech and directs callers appropriately, all while avoiding legal risks.

Tests show the system can accurately map free-form caller complaints to the correct destination with up to 89.13% accuracy, particularly when provided with 'flattened' menu options instead of detailed descriptions.

The study also found that the AI made more errors when callers used casual or varied language. However, some mistakes occurred not because the AI misunderstood, but because the phone menu itself was confusing.

Examples of customer interactions shared as part of the new project. Source

The project's data has been made publicly available.

Method

The tripartite approach began with a model creating a detailed phone menu for a fictional telecom company. A second model generated unique caller messages—some straightforward, others rephrased or more casual—to simulate realistic speech patterns. A total of 920 examples were generated.

The third model was tasked with connecting each caller to the right department based solely on the message and a version of the menu. This framework made the experiment fully reproducible without needing real call data or exposing customer information.

The three systems chosen for the tripartite approach. [Source] https://arxiv.org/pdf/2510.21715

The three systems selected for the tripartite approach. Source

The models used were gpt-3.5-turbo, gpt-4o-mini, and gpt-4.1-mini, respectively.

To simulate an authentic customer service setting, a complex phone menu needed to be synthesized from scratch. Due to a lack of relevant datasets, the gpt-3.5-turbo model was prompted to generate a complete, multi-branch structure for a fictional telecom provider.

Each branch represented service areas like billing, technical support, account management, and new services, complete with realistic sub-options and varying depths. Two menu versions were created for testing: a plain text hierarchy mimicking a human-readable format, and a list of endpoints with corresponding button sequences.

This enabled testing on both a detailed and a simplified version of the routing problem:

Two versions of the phone menu were provided to the AI: a detailed text hierarchy, and a simplified list of direct menu options, to compare how well each format supported routing callers to the right place.

To generate the test caller messages, a second language model produced a set of original complaints or requests, with ten unique examples per menu endpoint.

Each was then rephrased into several variations to reflect the diverse ways real people express their issues, incorporating changes in length, tone, and even minor errors or filler words.

The 920 initial messages were crafted to test the system's accuracy and simulate the unpredictability of natural conversation.

In the third stage, the final model's ability to map each message to the correct menu destination was tested using the two different IVR presentation formats.

For the first version, the AI received a full descriptive outline of the phone tree. For the second, it saw only a list of final destinations with their button sequences.

The aim was to determine if a simplified menu would help the model route calls more effectively. In both cases, the system processed one message at a time and was instructed to return only the path, with no extra text, to enable automated scoring.

Isolation

To prevent test result contamination, each model was kept isolated. The first model drafted the phone menu, but it was finalized manually to remain unfamiliar to the other systems.

The caller messages were generated separately by gpt-4o-mini, using only endpoint names without access to the menu structure. Finally, gpt-4.1-mini, which performed the routing, only had access to the menu text and incoming messages, having no role in creating them.

Metrics

Two standard metrics evaluated the routing system's performance: accuracy, defined as the percentage of cases where the model provided the exact correct path (e.g., 1-2-3). Confusion matrices were also generated* to pinpoint error locations. Evaluations were conducted in Python using the pandas and scikit-learn libraries.

Results

Testing revealed that the model's accuracy significantly depended on menu presentation. With a flattened list of menu paths, the system achieved 89.13% accuracy on the simpler dataset, compared to 81.30% with the full descriptive menu.

Routing accuracy for the third (LLM3) model, across different prompt formats and dataset types, indicating that flattened menu paths consistently outperformed hierarchical descriptions, and that accuracy declined slightly when inputs were augmented with paraphrased or informal language.

Routing accuracy for the third (LLM3) model, across different prompt formats and dataset types, showing that flattened menu paths consistently outperformed hierarchical descriptions, and accuracy decreased slightly with paraphrased or informal language inputs.

This trend continued with the larger, more linguistically diverse dataset, where the flattened version again performed better, scoring 86.52% versus 77.07% for the descriptive format.

The paper notes these results suggest simpler, list-based prompts helped the model match queries more reliably than lengthy hierarchical descriptions.

Accuracy also dipped slightly when paraphrased and informal caller messages were introduced, indicating that while increased variety enhanced realism, it also made classification more difficult.

The paper concludes:

‘Our results demonstrate that LLMs route customer intents more accurately when provided with flattened IVR paths (up to 89.13%) compared to verbose menu descriptions (as low as 77.07%). This suggests that concise, structured prompts reduce noise and are better suited for routing tasks.

‘This supports the idea that clarity and brevity improve LLM performance in classification scenarios.

‘Furthermore, converting menus into flattened paths is a straightforward, automatable process for real-world deployment.’

Conclusion

It is encouraging to see open research emerging in a field typically characterized by secrecy and exclusivity. A key question remains: will future systems require 'framing' architectures to contextualize LLMs, or will models simply need access to locally available business intelligence, eliminating the need for companies to share data with third parties.

Ultimately, the core design principles explored here seem likely to be naturally adopted by future AI systems, even beyond customer service, without requiring specific adaptation for that use case.

* Please consult the source paper for these details.

First published Wednesday, October 29, 2025

Anthropic Study Links Polished AI Content to Reduced Human Thinking When you see AI instantly produce a well-structured, logically clear piece of code or document, are you tempted to trust it without a second thought? According to AIbase, the leading AI company Anthropic recently published a research report titled "A

UK Government Departments Clash Over Energy Needs for AI Data Centers The UK government is grappling with a major challenge: advancing clean energy while aiming to become a global leader in artificial intelligence. Yet serious inconsistencies appear between the departments responsible for these goals. The Department fo

Cyberspace Administration of China mandates tagging of AI-generated and fictional short videos The Cyberspace Administration of China has rolled out a comprehensive plan to standardize short video content labeling, mandating that platforms offer six required tags—including "AI-generated content"—ushering in a new era of mandatory transparency