Skip to main content

Papers & Talks

Research, talks, and position papers from the vLLM Semantic Router project.

Research Publications

P
POSITION PAPER
POSITION PAPER

vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality Models

Authors:vLLM Semantic Router Team
Venue:arXiv Technical Report
We introduce vLLM Semantic Router, a signal-driven decision routing framework for Mixture-of-Modality deployments that composes heterogeneous signals into deployment-specific routing policies across cost, privacy, latency, and safety constraints.
2026Paper
vLLMSemantic Router
P
RESEARCH PUBLICATION

Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents

Authors:Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen
Venue:arXiv Technical Report
We formalize the visual confused deputy as a security failure mode in computer-using agents and introduce a dual-channel guardrail that independently checks click targets and action reasoning before execution.
2026Paper
vLLMSemantic Router
P
RESEARCH PUBLICATION

Outcome-Aware Tool Selection for Semantic Routers: Latency-Constrained Learning Without LLM Inference

Authors:Huamin Chen, Xunzhuo Liu, Junchen Jiang, Bowei He, Xue Liu
Venue:arXiv Technical Report
We introduce Outcome-Aware Tool Selection (OATS), an offline embedding refinement method that improves semantic-router tool ranking under single-digit millisecond CPU budgets without adding serving-time model inference.
2026Paper
vLLMSemantic Router
P
RESEARCH PUBLICATION

Adaptive Vision-Language Model Routing for Computer Use Agents

Authors:Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen
Venue:arXiv Technical Report
We propose Adaptive VLM Routing (AVR), which estimates action difficulty and routes computer-use agent steps to the cheapest model that still satisfies a target reliability threshold.
2026Paper
vLLMSemantic Router
P
RESEARCH PUBLICATION

98× Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router

Authors:Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen
Venue:arXiv Technical Report
We combine Flash Attention, prompt compression, and near-streaming body processing to cut routing latency from seconds to tens of milliseconds while keeping the router lightweight enough to share hardware with serving.
2026Paper
vLLMSemantic Router
P
RESEARCH PUBLICATION

inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference

Authors:Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu
Venue:arXiv Technical Report
We present a queueing-theory-grounded fleet planner and discrete-event simulator for sizing multi-pool LLM GPU fleets against P99 TTFT targets, without requiring hardware profiling runs up front.
2026Paper
vLLMSemantic Router
P
RESEARCH PUBLICATION

FleetOpt: Analytical Fleet Provisioning for LLM Inference with Compress-and-Route as Implementation Mechanism

Authors:Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu
Venue:arXiv Technical Report
We derive the minimum-cost two-pool LLM fleet directly from the workload CDF and P99 TTFT target, then use Compress-and-Route to make the optimal boundary deployable in practice.
2026Paper
vLLMSemantic Router
P
RESEARCH PUBLICATION

When to Reason: Semantic Router for vLLM

Authors:Chen Wang, Xunzhuo Liu, Yuhan Liu, Yue Zhu, Xiangxi Mo, Junchen Jiang, Huamin Chen
Venue:NeurIPS - MLForSys
We present a semantic router that classifies queries based on their reasoning requirements and selectively applies reasoning only when beneficial.
2025Paper
vLLMSemantic Router
P
RESEARCH PUBLICATION

Category-Aware Semantic Caching for Heterogeneous LLM Workloads

Authors:Chen Wang, Xunzhuo Liu, Yue Zhu, Alaa Youssef, Priya Nagpurkar, Huamin Chen
We present a category-aware semantic caching where similarity thresholds, TTLs, and quotas vary by query category, with a hybrid architecture separating in-memory HNSW search from external document storage.
2025Paper
vLLMSemantic Router
P
RESEARCH PUBLICATION

Semantic Inference Routing Protocol (SIRP)

Authors:Huamin Chen, Luay Jalil
Venue:Internet Engineering Task Force (IETF)
This document specifies the Semantic Inference Routing Protocol (SIRP), a framework for content-level classification and semantic routing in AI inference systems.
2025Paper
vLLMSemantic Router
P
RESEARCH PUBLICATION

Multi-Provider Extensions for Agentic AI Inference APIs

Authors:H. Chen, L. Jalil, N. Cocker
Venue:Internet Engineering Task Force (IETF) - Network Management Research Group
This document specifies multi-provider extensions for agentic AI inference APIs. Published: 20 October 2025. Intended Status: Informational. Expires: 23 April 2026.
2025Paper
vLLMSemantic Router

Conference Presentations

T
CONFERENCE PRESENTATION

Intelligent LLM Routing: A New Paradigm for Multi-Model AI Orchestration in Kubernetes

Speakers:Chen Wang, Huamin Chen
Venue:KubeCon NA 2025
This research-driven talk introduces a novel architecture paradigm that complements recent advances in timely intelligent inference routing for large language models.
vLLMSemantic Router
T
CONFERENCE PRESENTATION

vLLM Semantic Router: Unlock the Power of Intelligent Routing

Speakers:Xunzhuo Liu
Venue:vLLM Meetup Beijing
A deep dive into vLLM Semantic Router capabilities, demonstrating how intelligent routing can unlock new possibilities for efficient LLM inference.
vLLMSemantic Router
T
CONFERENCE PRESENTATION

AI-Powered vLLM Semantic Router

Speakers:Huamin Chen
Venue:vLLM Office Hours
An overview of AI-powered features in vLLM Semantic Router, showcasing the latest developments and community contributions.
vLLMSemantic Router