Runtime¶

The production execution layer — connects routing decisions to LLM providers with caching, batching, and full observability.

Once Core determines where to route a query, Runtime ensures it runs reliably, efficiently, and observably at scale.

from stratarouter import Router
from stratarouter_runtime import CoreRuntimeBridge, RuntimeConfig

config = RuntimeConfig(cache_enabled=True, batch_enabled=True)
bridge = CoreRuntimeBridge(config)

async def handle(query: str):
    embedding = await get_embedding(query)
    decision  = core.route(query, embedding)
    result    = await bridge.execute(decision, context={"user_id": user_id})
    return result.response

Key Features¶

Core-Runtime Bridge

Translates routing decisions into execution plans with policy validation and context enrichment.

Execution Engine

Sandboxed execution with exponential backoff retry, circuit breakers, and configurable timeouts.

Provider Clients

Unified interface for OpenAI, Anthropic, Google, Azure, Cohere, and local models with automatic failover.

Semantic Caching

85%+ cache hit rate in production. Understands semantic similarity — 70–80% LLM cost reduction.

Batch Processing

Automatic deduplication and batching. 3–5x throughput gain, 40–60% cost reduction with zero code changes.

Full Observability

Prometheus metrics, OpenTelemetry traces, and structured JSON logs — full visibility from query to response.

Architecture¶

graph TB
    subgraph "Application Layer"
        A[Your Application]
    end
    subgraph "Runtime System"
        B[Core-Runtime Bridge]
        C[Execution Engine]
        D[Cache Layer]
        E[Batch Coordinator]
        F[State Manager]
        G[Provider Clients]
    end
    subgraph "Infrastructure"
        H[(PostgreSQL)]
        I[Redis]
        J[Prometheus]
        K[OpenTelemetry]
    end
    subgraph "LLM Providers"
        L[OpenAI] --- M[Anthropic]
        N[Google] --- O[Local]
    end
    A --> B
    B --> C
    C --> D & E & F & G
    D --> I
    F --> H
    C --> J & K
    G --> L & M & N & O
    style B fill:#4A9EFF
    style C fill:#00C853
    style D fill:#FFC107
    style G fill:#FF5252

Performance¶

Latency Breakdown (P99)¶

Component	Time	Share
Core Routing	1.2ms	2%
Bridge + Policy	0.5ms	1%
Cache Lookup	2.0ms	4%
Provider Call	45.0ms	90%
Post-processing	1.3ms	3%
Total	~50ms	100%

With cache hit: ~4ms (12.5x faster)

Throughput vs Configuration¶

Configuration	Throughput	P99 Latency	Cache Hit Rate
No caching	200 req/s	250ms	0%
With caching	1,500 req/s	50ms	85%
Caching + batching	5,000 req/s	100ms	85%

Cost Savings¶

# Without caching
1M requests × $0.002/request = $2,000/month

# With 85% cache hit rate
1M requests × 15% miss rate × $0.002 = $300/month

# Monthly savings: $1,700 (85% reduction)

Supported Providers¶

Provider	Models	Streaming	Embeddings
OpenAI	GPT-4o, GPT-5, all series	Yes	Yes
Anthropic	Claude 4.5, Claude 3 series	Yes	No
Google	Gemini 3.1, Vertex AI	Yes	Yes
Cohere	Command, Embed	Yes	Yes
Azure OpenAI	All GPT models	Yes	Yes
Local (Ollama, vLLM)	Any model	Yes	Yes

Usage Examples¶

Basic Execution¶

from stratarouter import Router
from stratarouter_runtime import CoreRuntimeBridge, RuntimeConfig

core = Router()
core.add_routes(routes)
core.build_index(embeddings)

config = RuntimeConfig(
    cache_enabled=True,
    batch_enabled=True,
    execution_timeout=60
)
bridge = CoreRuntimeBridge(config)

async def handle_query(query: str, user_id: str):
    embedding    = await get_embedding(query)
    route_result = core.route(query, embedding)
    result       = await bridge.execute(route_result, context={"user_id": user_id})
    return result.response

Semantic Caching¶

config = RuntimeConfig(
    cache_enabled=True,
    cache_backend="redis",
    cache_ttl=3600,
    cache_similarity_threshold=0.95
)
bridge = CoreRuntimeBridge(config)

result1 = await bridge.execute(...)  # cache miss  — ~50ms
result2 = await bridge.execute(...)  # cache hit   — ~4ms
print(result2.cache_hit)   # True

Batch Processing¶

config = RuntimeConfig(
    batch_enabled=True,
    batch_window_ms=50,
    batch_max_size=32,
    batch_similarity=0.98
)
bridge = CoreRuntimeBridge(config)

# Concurrent requests are automatically batched and deduplicated
results = await asyncio.gather(*[
    bridge.execute(decision1, ...),
    bridge.execute(decision2, ...),
    bridge.execute(decision3, ...),
])

Multi-Provider with Fallback¶

from stratarouter_runtime import LLMClientRegistry

registry = LLMClientRegistry()
registry.register("openai",    OpenAIClient(api_key=...))
registry.register("anthropic", AnthropicClient(api_key=...))
registry.register("local",     LocalClient(endpoint=...))

result = await registry.complete(
    primary="openai",
    fallback=["anthropic", "local"],
    request=...
)

Configuration Reference¶

from stratarouter_runtime import RuntimeConfig

config = RuntimeConfig(
    # Execution
    execution_timeout=60,       # seconds
    max_retries=3,
    retry_delay_ms=100,

    # Caching
    cache_enabled=True,
    cache_backend="redis",      # "redis" or "memory"
    cache_ttl=3600,
    cache_similarity_threshold=0.95,

    # Batching
    batch_enabled=True,
    batch_window_ms=50,
    batch_max_size=32,
    batch_similarity_threshold=0.98,

    # State
    state_backend="postgresql",
    checkpoint_interval=10,

    # Observability
    metrics_enabled=True,
    tracing_enabled=True,
    log_level="info"
)