Runtime¶
The production execution layer — connects routing decisions to LLM providers with caching, batching, and full observability.
Once Core determines where to route a query, Runtime ensures it runs reliably, efficiently, and observably at scale.
from stratarouter import Router
from stratarouter_runtime import CoreRuntimeBridge, RuntimeConfig
config = RuntimeConfig(cache_enabled=True, batch_enabled=True)
bridge = CoreRuntimeBridge(config)
async def handle(query: str):
embedding = await get_embedding(query)
decision = core.route(query, embedding)
result = await bridge.execute(decision, context={"user_id": user_id})
return result.response
Key Features¶
Core-Runtime Bridge
Translates routing decisions into execution plans with policy validation and context enrichment.
Execution Engine
Sandboxed execution with exponential backoff retry, circuit breakers, and configurable timeouts.
Provider Clients
Unified interface for OpenAI, Anthropic, Google, Azure, Cohere, and local models with automatic failover.
Semantic Caching
85%+ cache hit rate in production. Understands semantic similarity — 70–80% LLM cost reduction.
Batch Processing
Automatic deduplication and batching. 3–5x throughput gain, 40–60% cost reduction with zero code changes.
Full Observability
Prometheus metrics, OpenTelemetry traces, and structured JSON logs — full visibility from query to response.
Architecture¶
graph TB
subgraph "Application Layer"
A[Your Application]
end
subgraph "Runtime System"
B[Core-Runtime Bridge]
C[Execution Engine]
D[Cache Layer]
E[Batch Coordinator]
F[State Manager]
G[Provider Clients]
end
subgraph "Infrastructure"
H[(PostgreSQL)]
I[Redis]
J[Prometheus]
K[OpenTelemetry]
end
subgraph "LLM Providers"
L[OpenAI] --- M[Anthropic]
N[Google] --- O[Local]
end
A --> B
B --> C
C --> D & E & F & G
D --> I
F --> H
C --> J & K
G --> L & M & N & O
style B fill:#4A9EFF
style C fill:#00C853
style D fill:#FFC107
style G fill:#FF5252
Performance¶
Latency Breakdown (P99)¶
| Component | Time | Share |
|---|---|---|
| Core Routing | 1.2ms | 2% |
| Bridge + Policy | 0.5ms | 1% |
| Cache Lookup | 2.0ms | 4% |
| Provider Call | 45.0ms | 90% |
| Post-processing | 1.3ms | 3% |
| Total | ~50ms | 100% |
With cache hit: ~4ms (12.5x faster)
Throughput vs Configuration¶
| Configuration | Throughput | P99 Latency | Cache Hit Rate |
|---|---|---|---|
| No caching | 200 req/s | 250ms | 0% |
| With caching | 1,500 req/s | 50ms | 85% |
| Caching + batching | 5,000 req/s | 100ms | 85% |
Cost Savings¶
# Without caching
1M requests × $0.002/request = $2,000/month
# With 85% cache hit rate
1M requests × 15% miss rate × $0.002 = $300/month
# Monthly savings: $1,700 (85% reduction)
Supported Providers¶
| Provider | Models | Streaming | Embeddings |
|---|---|---|---|
| OpenAI | GPT-4o, GPT-5, all series | Yes | Yes |
| Anthropic | Claude 4.5, Claude 3 series | Yes | No |
| Gemini 3.1, Vertex AI | Yes | Yes | |
| Cohere | Command, Embed | Yes | Yes |
| Azure OpenAI | All GPT models | Yes | Yes |
| Local (Ollama, vLLM) | Any model | Yes | Yes |
Usage Examples¶
Basic Execution¶
from stratarouter import Router
from stratarouter_runtime import CoreRuntimeBridge, RuntimeConfig
core = Router()
core.add_routes(routes)
core.build_index(embeddings)
config = RuntimeConfig(
cache_enabled=True,
batch_enabled=True,
execution_timeout=60
)
bridge = CoreRuntimeBridge(config)
async def handle_query(query: str, user_id: str):
embedding = await get_embedding(query)
route_result = core.route(query, embedding)
result = await bridge.execute(route_result, context={"user_id": user_id})
return result.response
Semantic Caching¶
config = RuntimeConfig(
cache_enabled=True,
cache_backend="redis",
cache_ttl=3600,
cache_similarity_threshold=0.95
)
bridge = CoreRuntimeBridge(config)
result1 = await bridge.execute(...) # cache miss — ~50ms
result2 = await bridge.execute(...) # cache hit — ~4ms
print(result2.cache_hit) # True
Batch Processing¶
config = RuntimeConfig(
batch_enabled=True,
batch_window_ms=50,
batch_max_size=32,
batch_similarity=0.98
)
bridge = CoreRuntimeBridge(config)
# Concurrent requests are automatically batched and deduplicated
results = await asyncio.gather(*[
bridge.execute(decision1, ...),
bridge.execute(decision2, ...),
bridge.execute(decision3, ...),
])
Multi-Provider with Fallback¶
from stratarouter_runtime import LLMClientRegistry
registry = LLMClientRegistry()
registry.register("openai", OpenAIClient(api_key=...))
registry.register("anthropic", AnthropicClient(api_key=...))
registry.register("local", LocalClient(endpoint=...))
result = await registry.complete(
primary="openai",
fallback=["anthropic", "local"],
request=...
)
Configuration Reference¶
from stratarouter_runtime import RuntimeConfig
config = RuntimeConfig(
# Execution
execution_timeout=60, # seconds
max_retries=3,
retry_delay_ms=100,
# Caching
cache_enabled=True,
cache_backend="redis", # "redis" or "memory"
cache_ttl=3600,
cache_similarity_threshold=0.95,
# Batching
batch_enabled=True,
batch_window_ms=50,
batch_max_size=32,
batch_similarity_threshold=0.98,
# State
state_backend="postgresql",
checkpoint_interval=10,
# Observability
metrics_enabled=True,
tracing_enabled=True,
log_level="info"
)
Next Steps¶
Core-Runtime Bridge
How routing decisions are translated into execution plans.
→
Caching System
Tune semantic cache hit rates and reduce LLM costs.
→
Provider Clients
Configure and manage LLM provider connections.
→
Observability
Prometheus metrics, OpenTelemetry tracing, and structured logs.
→
Production Deployment
Deploy to Docker, Kubernetes, or your cloud provider.
→
Monitoring Guide
Set up dashboards, alerts, and SLO tracking.
→