Kubernetes Operator
The Semantic Router Operator provides a Kubernetes-native way to deploy and manage vLLM Semantic Router instances using Custom Resource Definitions (CRDs). It simplifies deployment, configuration, and lifecycle management across Kubernetes and OpenShift platforms.
Features
- 🚀 Declarative Deployment: Define semantic router instances using Kubernetes CRDs
- 🔄 Automatic Configuration: Generates and manages ConfigMaps for semantic router configuration
- 📦 Persistent Storage: Manages PVCs for ML model storage with automatic lifecycle
- 🔐 Platform Detection: Automatically detects and configures for OpenShift or standard Kubernetes
- 📊 Built-in Observability: Metrics, tracing, and monitoring support out of the box
- 🎯 Production Features: HPA, ingress, service mesh integration, and pod disruption budgets
- 🛡️ Secure by Default: Drops all capabilities, prevents privilege escalation
Prerequisites
- Kubernetes 1.24+ or OpenShift 4.12+
kubectlorocCLI configured- Cluster admin access (for CRD installation)
Installation
Option 1: Using Kustomize (Standard Kubernetes)
# Clone the repository
git clone https://github.com/vllm-project/semantic-router
cd semantic-router/deploy/operator
# Install CRDs
make install
# Deploy the operator
make deploy IMG=ghcr.io/vllm-project/semantic-router-operator:latest
Verify the operator is running:
kubectl get pods -n semantic-router-operator-system
Option 2: Using OLM (OpenShift)
For OpenShift deployments using Operator Lifecycle Manager:
cd semantic-router/deploy/operator
# Build and push to your registry (Quay, internal registry, etc.)
podman login quay.io
make podman-build IMG=quay.io/<your-org>/semantic-router-operator:latest
make podman-push IMG=quay.io/<your-org>/semantic-router-operator:latest
# Deploy using OLM
make openshift-deploy
Deploy Your First Router
Quick Start with Sample Configurations
Choose a pre-configured sample based on your infrastructure:
# Simple standalone deployment with KServe backend
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_simple.yaml
# Full-featured OpenShift deployment with Routes
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_openshift.yaml
# Gateway integration mode (Istio/Envoy Gateway)
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_gateway.yaml
# Llama Stack backend discovery
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_llamastack.yaml
# Redis cache backend for production caching
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_redis_cache.yaml
# Milvus cache backend for large-scale deployments
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_milvus_cache.yaml
# Hybrid cache backend for optimal performance
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_hybrid_cache.yaml
# mmBERT 2D Matryoshka embeddings with layer early exit
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_mmbert.yaml
# Complexity-aware routing for intelligent model selection
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/main/deploy/operator/config/samples/vllm.ai_v1alpha1_semanticrouter_complexity.yaml
Custom Configuration
Create a my-router.yaml file:
apiVersion: vllm.ai/v1alpha1
kind: SemanticRouter
metadata:
name: my-router
namespace: default
spec:
replicas: 2
image:
repository: ghcr.io/vllm-project/semantic-router/extproc
tag: latest
# Configure vLLM backend endpoints
vllmEndpoints:
# KServe InferenceService (RHOAI 3.x)
- name: llama3-8b-endpoint
model: llama3-8b
reasoningFamily: qwen3
loras:
- name: computer-science-expert
description: Adapter for advanced computer science prompts
backend:
type: kserve
inferenceServiceName: llama-3-8b
weight: 1
resources:
limits:
memory: "7Gi"
cpu: "2"
requests:
memory: "3Gi"
cpu: "1"
persistence:
enabled: true
size: 10Gi
storageClassName: "standard"
config:
providers:
defaults:
default_model: llama3-8b
default_reasoning_effort: medium
reasoning_families:
qwen3:
type: chat_template_kwargs
parameter: enable_thinking
models:
- name: llama3-8b
provider_model_id: llama3-8b
backend_refs:
- name: llama3-8b-endpoint
endpoint: llama-3-8b-predictor.default.svc.cluster.local:80
protocol: http
routing:
modelCards:
- name: llama3-8b
modality: text
capabilities: ["chat", "reasoning"]
decisions:
- name: default-route
description: Catch-all route
priority: 100
rules:
operator: AND
conditions: []
modelRefs:
- model: llama3-8b
use_reasoning: false
global:
stores:
semantic_cache:
enabled: true
backend_type: memory
max_entries: 1000
ttl_seconds: 3600
integrations:
tools:
enabled: true
top_k: 3
similarity_threshold: 0.2
model_catalog:
system:
prompt_guard: models/mmbert32k-jailbreak-detector-merged
modules:
prompt_guard:
enabled: true
model_ref: prompt_guard
threshold: 0.7
toolsDb:
- tool:
type: "function"
function:
name: "get_weather"
description: "Get weather information for a location"
parameters:
type: "object"
properties:
location:
type: "string"
description: "City and state, e.g. San Francisco, CA"
required: ["location"]
description: "Weather information tool"
category: "weather"
tags: ["weather", "temperature"]
Apply the configuration:
kubectl apply -f my-router.yaml
spec.config should use the same canonical providers/routing/global layout as local config.yaml. spec.vllmEndpoints remains the Kubernetes adapter for discovering backends and served-model aliases; the operator translates it into canonical providers.models[].backend_refs[] and routing.modelCards entries, including optional loras, when rendering the runtime config.
Advanced Features
Embedding Models Configuration
The operator supports three high-performance embedding models for semantic understanding and caching. You can configure these models to optimize for your specific use case.
Available Embedding Models
-
Qwen3-Embedding (1024 dimensions, 32K context)
- Best for: High-quality semantic understanding with long context
- Use case: Complex queries, research documents, detailed analysis
-
EmbeddingGemma (768 dimensions, 8K context)
- Best for: Fast performance with good accuracy
- Use case: Real-time applications, high-throughput scenarios
-
mmBERT 2D Matryoshka (64-768 dimensions, multilingual)
- Best for: Adaptive performance with layer early exit
- Use case: Multilingual deployments, flexible quality/speed trade-offs
Example: mmBERT with Layer Early Exit
spec:
config:
global:
model_catalog:
embeddings:
semantic:
mmbert_model_path: "models/mom-embedding-ultra"
use_cpu: true
embedding_config:
model_type: "mmbert"
# Layer early exit: balance speed vs accuracy
# Layer 3: ~7x speedup (fast, good for high-volume queries)
# Layer 6: ~3.6x speedup (balanced - recommended)
# Layer 11: ~2x speedup (higher accuracy)
# Layer 22: full model (maximum accuracy)
target_layer: 6
# Dimension reduction for faster similarity search
# Options: 64, 128, 256, 512, 768
target_dimension: 256
preload_embeddings: true
enable_soft_matching: true
top_k: 1
min_score_threshold: "0.5"
stores:
semantic_cache:
enabled: true
backend_type: "memory"
embedding_model: "mmbert"
similarity_threshold: "0.85"
max_entries: 5000
ttl_seconds: 7200
See mmbert sample configuration for a complete example.
Example: Qwen3 with Redis Cache
spec:
config:
global:
model_catalog:
embeddings:
semantic:
qwen3_model_path: "models/qwen3-embedding"
use_cpu: true
stores:
semantic_cache:
enabled: true
backend_type: "redis"
embedding_model: "qwen3"
redis:
connection:
host: redis.cache-backends.svc.cluster.local
port: 6379
index:
vector_field:
dimension: 1024 # Qwen3 dimension
See redis cache sample configuration for a complete example.
Complexity-Aware Routing
Route queries to different models based on complexity classification. Simple queries go to fast models, complex queries go to powerful models.
Example Configuration
spec:
# Configure multiple backends with different capabilities
vllmEndpoints:
- name: llama-8b-fast
model: llama3-8b
reasoningFamily: qwen3
backend:
type: kserve
inferenceServiceName: llama-3-8b
weight: 2 # Prefer for simple queries
- name: llama-70b-reasoning
model: llama3-70b
reasoningFamily: deepseek
backend:
type: kserve
inferenceServiceName: llama-3-70b
weight: 1 # Use for complex queries
config:
# Define complexity rules
complexity_rules:
# Rule 1: Code complexity
- name: "code-complexity"
description: "Classify coding tasks by complexity"
threshold: "0.3" # Lower threshold works better for embedding-based similarity
# Examples of complex coding tasks
hard:
candidates:
- "Implement a distributed lock manager with leader election"
- "Design a database migration system with rollback support"
- "Create a compiler optimization pass for loop unrolling"
# Examples of simple coding tasks
easy:
candidates:
- "Write a function to reverse a string"
- "Create a class to represent a rectangle"
- "Implement a simple counter with increment/decrement"
# Rule 2: Reasoning complexity
- name: "reasoning-complexity"
description: "Classify reasoning and problem-solving tasks"
threshold: "0.3" # Lower threshold works better for embedding-based similarity
hard:
candidates:
- "Analyze the geopolitical implications of renewable energy adoption"
- "Evaluate the ethical considerations of AI in healthcare"
- "Design a multi-stage marketing strategy for a new product launch"
easy:
candidates:
- "What is the capital of France?"
- "How many days are in a week?"
- "Name three common pets"
# Rule 3: Domain-specific complexity with conditional application
- name: "medical-complexity"
description: "Classify medical queries (only for medical domain)"
threshold: "0.3" # Lower threshold works better for embedding-based similarity
hard:
candidates:
- "Differential diagnosis for chest pain with dyspnea"
- "Treatment protocol for multi-drug resistant tuberculosis"
easy:
candidates:
- "What is the normal body temperature?"
- "What are common symptoms of a cold?"
# Only apply this rule if domain signal indicates medical domain
composer:
operator: "AND"
conditions:
- type: "domain"
name: "medical"
How it works:
- Incoming query is compared against
hardandeasycandidate examples - Similarity scores determine complexity classification
- Output signals:
{rule-name}:hard,{rule-name}:easy, or{rule-name}:medium - Router uses signals to select appropriate backend model
- Composer allows conditional rule application based on other signals
See complexity routing sample configuration for a complete example.
Verify Deployment
# Check the SemanticRouter resource
kubectl get semanticrouter my-router
# Check created resources
kubectl get deployment,service,configmap -l app.kubernetes.io/instance=my-router
# View status
kubectl describe semanticrouter my-router
# View logs
kubectl logs -f deployment/my-router
Expected output:
NAME PHASE REPLICAS READY AGE
semanticrouter.vllm.ai/my-router Running 2 2 5m
Backend Discovery Types
The operator supports three types of backend discovery for connecting semantic router to vLLM model servers. Choose the type that matches your infrastructure.
KServe InferenceService Discovery
For RHOAI 3.x or standalone KServe deployments. The operator automatically discovers the predictor service created by KServe.
spec:
vllmEndpoints:
- name: llama3-8b-endpoint
model: llama3-8b
reasoningFamily: qwen3
backend:
type: kserve
inferenceServiceName: llama-3-8b # InferenceService in same namespace
weight: 1
When to use:
- Running on Red Hat OpenShift AI (RHOAI) 3.x
- Using KServe for model serving
- Want automatic service discovery
How it works:
- Discovers the predictor service:
{inferenceServiceName}-predictor - Uses port 8443 (KServe default HTTPS port)
- Works in the same namespace as SemanticRouter
Llama Stack Service Discovery
Discovers Llama Stack deployments using Kubernetes label selectors.
spec:
vllmEndpoints:
- name: llama-405b-endpoint
model: llama-3.3-70b-instruct
reasoningFamily: gpt
backend:
type: llamastack
discoveryLabels:
app: llama-stack
model: llama-3.3-70b
weight: 1
When to use:
- Using Meta's Llama Stack for model serving
- Multiple Llama Stack services with different models
- Want label-based service discovery
How it works:
- Lists services matching the label selector
- Uses first matching service if multiple found
- Extracts port from service definition
Direct Kubernetes Service
Direct connection to any Kubernetes service (vLLM, TGI, etc.).
spec:
vllmEndpoints:
- name: custom-vllm-endpoint
model: deepseek-r1-distill-qwen-7b
reasoningFamily: deepseek
backend:
type: service
service:
name: vllm-deepseek
namespace: vllm-serving # Can reference service in another namespace
port: 8000
weight: 1
When to use:
- Direct vLLM deployments
- Custom model servers with OpenAI-compatible API
- Cross-namespace service references
- Maximum control over service endpoints
How it works:
- Connects to specified service directly
- No discovery - uses explicit configuration
- Supports cross-namespace references
Multiple Backends
You can configure multiple backends with load balancing weights:
spec:
vllmEndpoints:
# KServe backend
- name: llama3-8b
model: llama3-8b
reasoningFamily: qwen3
backend:
type: kserve
inferenceServiceName: llama-3-8b
weight: 2 # Higher weight = more traffic
# Direct service backend
- name: qwen-7b
model: qwen2.5-7b
reasoningFamily: qwen3
backend:
type: service
service:
name: vllm-qwen
port: 8000
weight: 1
Deployment Modes
The operator supports two deployment modes with different architectures.