New $1.5B Router Model Hits 93% Accuracy, Eliminating Expensive Retraining Costs
Researchers at Katanemo Labs have unveiled Arch-Router, an advanced routing model and framework engineered to intelligently direct user queries to the most appropriate large language model (LLM).
For companies developing products that leverage multiple LLMs, Arch-Rolver tackles a central dilemma: how to automatically route each request to the ideal model for the task, without depending on inflexible logic or expensive retraining whenever updates are needed.
The challenges of LLM routing
As the variety of available LLMs expands, developers are shifting from single-model configurations to multi-model architectures that utilize the distinct capabilities of different models for specialized functions—such as generating code, summarizing text, or editing images.
LLM routing has become an essential technique for constructing and running such systems, serving as an intelligent traffic director that guides each user query to the model best suited to handle it.
Current routing approaches generally fit into two main groups: task-based routing, which assigns queries according to predefined task categories, and performance-based routing, which seeks the best trade-off between expense and output quality.
However, task-based systems often falter when user intent is ambiguous or changes over the course of a conversation—especially in multi-turn dialogues. Performance-based routing, meanwhile, tends to prioritize static benchmark results, frequently overlooking actual user preferences and adapting slowly to new models without costly retraining.
As the researchers at Katanemo Labs state in their paper, a deeper issue is that “existing routing methods have practical limitations in real-world applications. Most are optimized for benchmark performance but ignore human preferences, which are guided by subjective evaluation criteria.”
The team emphasizes the importance of routing systems that “reflect subjective human judgments, provide greater transparency, and remain easily adjustable as both models and applications evolve.”
A new framework for preference-aligned routing
To overcome these issues, the researchers developed a “preference-aligned routing” framework that matches incoming queries to routing rules based on custom user preferences.
In this system, users define their routing policies using natural language through a two-tier “Domain-Action Taxonomy.” This structure reflects how people naturally describe tasks: starting with a broad category—the Domain, such as “legal” or “finance”—and drilling down to a specific task—the Action, like “summarization” or “coding.”
Each policy is then mapped to a preferred model, empowering developers to base routing choices on practical requirements rather than benchmark metrics alone. According to the paper, “This taxonomy acts as a mental model to help users create well-defined, structured routing policies.”
The routing procedure operates in two phases. First, a preference-aligned router model evaluates the user’s query alongside all available policies and picks the best-fitting one. Second, a mapping function connects the selected policy to its assigned LLM.
Because the logic for selecting a model is separated from the policy definition, developers can add, remove, or update models just by editing the routing rules—without retraining or changing the router. This separation enables the necessary flexibility for production environments, where models and applications are constantly changing.

Preference-aligned routing framework Source: arXiv The policy selection is powered by Arch-Router, a compact 1.5-billion-parameter language model optimized for preference-aware routing. Arch-Router takes the user query and the full list of policy descriptions as input, then outputs the identifier of the most suitable policy.
Since policies are included in the input, the system can adjust to new or updated routes during inference through in-context learning—no retraining required. This generative strategy enables Arch-Router to leverage its pre-trained understanding to interpret the meaning of both the query and the policies, and to analyze complete conversation histories in one go.
One common worry with including lengthy policy lists in a prompt is the risk of higher latency. However, the team built Arch-Router for high efficiency. “Even with extensive routing policies, we can expand Arch-Router’s context window with very little effect on latency,” says Salman Paracha, co-author of the paper and Founder/CEO of Katanemo Labs. He points out that latency is mainly determined by output length, and Arch-Router only outputs a short policy name—such as “image_editing” or “document_creation.”
Arch-Router in action
To create Arch-Router, the team fine-tuned a 1.5B parameter variant of the Qwen 2.5 model using a carefully assembled dataset of 43,000 examples. They then benchmarked it against leading proprietary models from OpenAI, Anthropic, and Google across four public datasets designed to test conversational AI systems.
The findings indicate that Arch-Router achieved the top overall routing score of 93.17%, outperforming all other models—including top-tier proprietary ones—by an average of 7.71%. The model’s edge became more apparent in longer conversations, showcasing its superior ability to maintain context across multiple exchanges.

Arch-Router vs other models Source: arXiv In real-world use, this methodology is already being applied in multiple settings, notes Paracha. For instance, in open-source coding platforms, developers rely on Arch-Router to guide different parts of their workflow—like “code design,” “code understanding,” and “code generation”—to the LLMs most effective for each step. Similarly, organizations can route document creation tasks to a model such as Claude 3.7 Sonnet while sending image editing requests to Gemini 2.5 Pro.
The system is also well-suited “for personal assistants across various fields, where users perform a range of activities from summarizing text to answering factual queries,” Paracha explained, adding that “in such situations, Arch-Router helps product teams consolidate and improve the user’s overall experience.”
This framework is built into Arch, Katanemo Labs’ AI-native proxy server for agents, which supports the implementation of granular traffic management rules. For example, when adding a new LLM, a team can route a small percentage of traffic under a certain policy to the new model, validate its performance using internal analytics, and then confidently shift all traffic over. The company is also working to integrate its tools with evaluation platforms to make this workflow even smoother for corporate developers.
At its core, the objective is to help organizations move beyond disconnected AI implementations. “Arch-Router—and the Arch platform overall—enables developers and businesses to evolve from fragmented LLM usage to a unified, policy-governed system,” Paracha states. “When users perform a wide range of tasks, our platform converts that diversity of tasks and models into a cohesive experience, making the final product feel seamless and intuitive.”
Related article
Satya Nadella ready to exploit new OpenAI deal
On Wednesday, a Wall Street analyst asked Microsoft CEO Satya Nadella directly how the revised OpenAI partnership would affect the company’s financials.Nadella described the new agreement as a win for everyone. “We feel good about our partnership wit
OpenAI outlines AI economy with public wealth funds, robot taxes, and four-day week
As governments struggle to manage the economic impact of superintelligent machines, OpenAI has released a set of policy proposals outlining how wealth and work could be reshaped in an "intelligence age." The ideas blend traditional left-leaning mecha
Google rolls out Gemini in Chrome to India
On Wednesday, Google announced it is expanding Gemini integration for Chrome to new regions, including India, Canada, and New Zealand. This rollout allows desktop users to access Gemini via a sidebar, where they can ask Google’s AI chatbot about on-s
Related Special Topic Recommendations
Comments (1)
0/500
Researchers at Katanemo Labs have unveiled Arch-Router, an advanced routing model and framework engineered to intelligently direct user queries to the most appropriate large language model (LLM).
For companies developing products that leverage multiple LLMs, Arch-Rolver tackles a central dilemma: how to automatically route each request to the ideal model for the task, without depending on inflexible logic or expensive retraining whenever updates are needed.
The challenges of LLM routing
As the variety of available LLMs expands, developers are shifting from single-model configurations to multi-model architectures that utilize the distinct capabilities of different models for specialized functions—such as generating code, summarizing text, or editing images.
LLM routing has become an essential technique for constructing and running such systems, serving as an intelligent traffic director that guides each user query to the model best suited to handle it.
Current routing approaches generally fit into two main groups: task-based routing, which assigns queries according to predefined task categories, and performance-based routing, which seeks the best trade-off between expense and output quality.
However, task-based systems often falter when user intent is ambiguous or changes over the course of a conversation—especially in multi-turn dialogues. Performance-based routing, meanwhile, tends to prioritize static benchmark results, frequently overlooking actual user preferences and adapting slowly to new models without costly retraining.
As the researchers at Katanemo Labs state in their paper, a deeper issue is that “existing routing methods have practical limitations in real-world applications. Most are optimized for benchmark performance but ignore human preferences, which are guided by subjective evaluation criteria.”
The team emphasizes the importance of routing systems that “reflect subjective human judgments, provide greater transparency, and remain easily adjustable as both models and applications evolve.”
A new framework for preference-aligned routing
To overcome these issues, the researchers developed a “preference-aligned routing” framework that matches incoming queries to routing rules based on custom user preferences.
In this system, users define their routing policies using natural language through a two-tier “Domain-Action Taxonomy.” This structure reflects how people naturally describe tasks: starting with a broad category—the Domain, such as “legal” or “finance”—and drilling down to a specific task—the Action, like “summarization” or “coding.”
Each policy is then mapped to a preferred model, empowering developers to base routing choices on practical requirements rather than benchmark metrics alone. According to the paper, “This taxonomy acts as a mental model to help users create well-defined, structured routing policies.”
The routing procedure operates in two phases. First, a preference-aligned router model evaluates the user’s query alongside all available policies and picks the best-fitting one. Second, a mapping function connects the selected policy to its assigned LLM.
Because the logic for selecting a model is separated from the policy definition, developers can add, remove, or update models just by editing the routing rules—without retraining or changing the router. This separation enables the necessary flexibility for production environments, where models and applications are constantly changing.

The policy selection is powered by Arch-Router, a compact 1.5-billion-parameter language model optimized for preference-aware routing. Arch-Router takes the user query and the full list of policy descriptions as input, then outputs the identifier of the most suitable policy.
Since policies are included in the input, the system can adjust to new or updated routes during inference through in-context learning—no retraining required. This generative strategy enables Arch-Router to leverage its pre-trained understanding to interpret the meaning of both the query and the policies, and to analyze complete conversation histories in one go.
One common worry with including lengthy policy lists in a prompt is the risk of higher latency. However, the team built Arch-Router for high efficiency. “Even with extensive routing policies, we can expand Arch-Router’s context window with very little effect on latency,” says Salman Paracha, co-author of the paper and Founder/CEO of Katanemo Labs. He points out that latency is mainly determined by output length, and Arch-Router only outputs a short policy name—such as “image_editing” or “document_creation.”
Arch-Router in action
To create Arch-Router, the team fine-tuned a 1.5B parameter variant of the Qwen 2.5 model using a carefully assembled dataset of 43,000 examples. They then benchmarked it against leading proprietary models from OpenAI, Anthropic, and Google across four public datasets designed to test conversational AI systems.
The findings indicate that Arch-Router achieved the top overall routing score of 93.17%, outperforming all other models—including top-tier proprietary ones—by an average of 7.71%. The model’s edge became more apparent in longer conversations, showcasing its superior ability to maintain context across multiple exchanges.

In real-world use, this methodology is already being applied in multiple settings, notes Paracha. For instance, in open-source coding platforms, developers rely on Arch-Router to guide different parts of their workflow—like “code design,” “code understanding,” and “code generation”—to the LLMs most effective for each step. Similarly, organizations can route document creation tasks to a model such as Claude 3.7 Sonnet while sending image editing requests to Gemini 2.5 Pro.
The system is also well-suited “for personal assistants across various fields, where users perform a range of activities from summarizing text to answering factual queries,” Paracha explained, adding that “in such situations, Arch-Router helps product teams consolidate and improve the user’s overall experience.”
This framework is built into Arch, Katanemo Labs’ AI-native proxy server for agents, which supports the implementation of granular traffic management rules. For example, when adding a new LLM, a team can route a small percentage of traffic under a certain policy to the new model, validate its performance using internal analytics, and then confidently shift all traffic over. The company is also working to integrate its tools with evaluation platforms to make this workflow even smoother for corporate developers.
At its core, the objective is to help organizations move beyond disconnected AI implementations. “Arch-Router—and the Arch platform overall—enables developers and businesses to evolve from fragmented LLM usage to a unified, policy-governed system,” Paracha states. “When users perform a wide range of tasks, our platform converts that diversity of tasks and models into a cohesive experience, making the final product feel seamless and intuitive.”
Satya Nadella ready to exploit new OpenAI deal
On Wednesday, a Wall Street analyst asked Microsoft CEO Satya Nadella directly how the revised OpenAI partnership would affect the company’s financials.Nadella described the new agreement as a win for everyone. “We feel good about our partnership wit
OpenAI outlines AI economy with public wealth funds, robot taxes, and four-day week
As governments struggle to manage the economic impact of superintelligent machines, OpenAI has released a set of policy proposals outlining how wealth and work could be reshaped in an "intelligence age." The ideas blend traditional left-leaning mecha
Google rolls out Gemini in Chrome to India
On Wednesday, Google announced it is expanding Gemini integration for Chrome to new regions, including India, Canada, and New Zealand. This rollout allows desktop users to access Gemini via a sidebar, where they can ask Google’s AI chatbot about on-s





Home






