Skip to content

Runtime

The production execution layer — connects routing decisions to LLM providers with caching, batching, and full observability.

Once Core determines where to route a query, Runtime ensures it runs reliably, efficiently, and observably at scale.

from stratarouter import Router
from stratarouter_runtime import CoreRuntimeBridge, RuntimeConfig

config = RuntimeConfig(cache_enabled=True, batch_enabled=True)
bridge = CoreRuntimeBridge(config)

async def handle(query: str):
    embedding = await get_embedding(query)
    decision  = core.route(query, embedding)
    result    = await bridge.execute(decision, context={"user_id": user_id})
    return result.response

Key Features

Core-Runtime Bridge
Translates routing decisions into execution plans with policy validation and context enrichment.
Execution Engine
Sandboxed execution with exponential backoff retry, circuit breakers, and configurable timeouts.
Provider Clients
Unified interface for OpenAI, Anthropic, Google, Azure, Cohere, and local models with automatic failover.
Semantic Caching
85%+ cache hit rate in production. Understands semantic similarity — 70–80% LLM cost reduction.
Batch Processing
Automatic deduplication and batching. 3–5x throughput gain, 40–60% cost reduction with zero code changes.
Full Observability
Prometheus metrics, OpenTelemetry traces, and structured JSON logs — full visibility from query to response.

Architecture

graph TB
    subgraph "Application Layer"
        A[Your Application]
    end
    subgraph "Runtime System"
        B[Core-Runtime Bridge]
        C[Execution Engine]
        D[Cache Layer]
        E[Batch Coordinator]
        F[State Manager]
        G[Provider Clients]
    end
    subgraph "Infrastructure"
        H[(PostgreSQL)]
        I[Redis]
        J[Prometheus]
        K[OpenTelemetry]
    end
    subgraph "LLM Providers"
        L[OpenAI] --- M[Anthropic]
        N[Google] --- O[Local]
    end
    A --> B
    B --> C
    C --> D & E & F & G
    D --> I
    F --> H
    C --> J & K
    G --> L & M & N & O
    style B fill:#4A9EFF
    style C fill:#00C853
    style D fill:#FFC107
    style G fill:#FF5252

Performance

Latency Breakdown (P99)

Component Time Share
Core Routing 1.2ms 2%
Bridge + Policy 0.5ms 1%
Cache Lookup 2.0ms 4%
Provider Call 45.0ms 90%
Post-processing 1.3ms 3%
Total ~50ms 100%

With cache hit: ~4ms (12.5x faster)

Throughput vs Configuration

Configuration Throughput P99 Latency Cache Hit Rate
No caching 200 req/s 250ms 0%
With caching 1,500 req/s 50ms 85%
Caching + batching 5,000 req/s 100ms 85%

Cost Savings

# Without caching
1M requests × $0.002/request = $2,000/month

# With 85% cache hit rate
1M requests × 15% miss rate × $0.002 = $300/month

# Monthly savings: $1,700 (85% reduction)

Supported Providers

Provider Models Streaming Embeddings
OpenAI GPT-4o, GPT-5, all series Yes Yes
Anthropic Claude 4.5, Claude 3 series Yes No
Google Gemini 3.1, Vertex AI Yes Yes
Cohere Command, Embed Yes Yes
Azure OpenAI All GPT models Yes Yes
Local (Ollama, vLLM) Any model Yes Yes

Usage Examples

Basic Execution

from stratarouter import Router
from stratarouter_runtime import CoreRuntimeBridge, RuntimeConfig

core = Router()
core.add_routes(routes)
core.build_index(embeddings)

config = RuntimeConfig(
    cache_enabled=True,
    batch_enabled=True,
    execution_timeout=60
)
bridge = CoreRuntimeBridge(config)

async def handle_query(query: str, user_id: str):
    embedding    = await get_embedding(query)
    route_result = core.route(query, embedding)
    result       = await bridge.execute(route_result, context={"user_id": user_id})
    return result.response

Semantic Caching

config = RuntimeConfig(
    cache_enabled=True,
    cache_backend="redis",
    cache_ttl=3600,
    cache_similarity_threshold=0.95
)
bridge = CoreRuntimeBridge(config)

result1 = await bridge.execute(...)  # cache miss  — ~50ms
result2 = await bridge.execute(...)  # cache hit   — ~4ms
print(result2.cache_hit)   # True

Batch Processing

config = RuntimeConfig(
    batch_enabled=True,
    batch_window_ms=50,
    batch_max_size=32,
    batch_similarity=0.98
)
bridge = CoreRuntimeBridge(config)

# Concurrent requests are automatically batched and deduplicated
results = await asyncio.gather(*[
    bridge.execute(decision1, ...),
    bridge.execute(decision2, ...),
    bridge.execute(decision3, ...),
])

Multi-Provider with Fallback

from stratarouter_runtime import LLMClientRegistry

registry = LLMClientRegistry()
registry.register("openai",    OpenAIClient(api_key=...))
registry.register("anthropic", AnthropicClient(api_key=...))
registry.register("local",     LocalClient(endpoint=...))

result = await registry.complete(
    primary="openai",
    fallback=["anthropic", "local"],
    request=...
)

Configuration Reference

from stratarouter_runtime import RuntimeConfig

config = RuntimeConfig(
    # Execution
    execution_timeout=60,       # seconds
    max_retries=3,
    retry_delay_ms=100,

    # Caching
    cache_enabled=True,
    cache_backend="redis",      # "redis" or "memory"
    cache_ttl=3600,
    cache_similarity_threshold=0.95,

    # Batching
    batch_enabled=True,
    batch_window_ms=50,
    batch_max_size=32,
    batch_similarity_threshold=0.98,

    # State
    state_backend="postgresql",
    checkpoint_interval=10,

    # Observability
    metrics_enabled=True,
    tracing_enabled=True,
    log_level="info"
)

Next Steps