Skip to content

Runtime Architecture

Comprehensive architecture guide for StrataRouter Runtime.


System Overview

StrataRouter Runtime provides production-grade execution infrastructure for semantic routing at scale.

graph TB
    subgraph "Application Layer"
        A[Web Applications]
        B[AI Agents]
        C[Workflows]
    end

    subgraph "Runtime Layer"
        D[Core-Runtime Bridge]
        E[Execution Engine]
        F[Cache Layer]
        G[Batch Processor]
    end

    subgraph "Infrastructure"
        H[Provider Clients]
        I[State Manager]
        J[Observability]
    end

    subgraph "External Services"
        K[(PostgreSQL)]
        L[(Redis)]
        M[OpenAI]
        N[Anthropic]
        O[Google]
    end

    A --> D
    B --> D
    C --> D

    D --> E
    D --> F
    D --> G

    E --> H
    E --> I
    E --> J

    F --> L
    I --> K

    H --> M
    H --> N
    H --> O

    style D fill:#14b8a6
    style E fill:#10b981
    style F fill:#f59e0b
    style G fill:#3b82f6

Core Components

1. Core-Runtime Bridge

Purpose: Connects routing decisions to execution

graph LR
    A[Route Decision] --> B[Bridge]
    B --> C[Validation]
    B --> D[Translation]
    B --> E[Context Enrichment]
    C --> F[Execution Plan]
    D --> F
    E --> F

    style B fill:#14b8a6

Key Responsibilities: - Translate route IDs to execution plans - Validate execution context - Enrich with user metadata - Collect feedback for learning

Example:

from stratarouter_runtime import CoreRuntimeBridge

bridge = CoreRuntimeBridge(
    core_router=router,
    runtime_config=config
)

result = await bridge.execute(
    query="Where's my invoice?",
    context={"user_id": "user-123", "org_id": "org-456"}
)


2. Execution Engine

Purpose: Safe, isolated execution with reliability features

graph TB
    A[Execution Request] --> B{Timeout Check}
    B -->|Within Limit| C[Execute]
    B -->|Exceeded| D[Timeout Error]

    C --> E{Success?}
    E -->|Yes| F[Return Result]
    E -->|No| G{Retries Left?}

    G -->|Yes| H[Exponential Backoff]
    H --> C
    G -->|No| I[Circuit Breaker]

    style C fill:#10b981
    style I fill:#ef4444

Features: - Process isolation with sandboxing - Configurable timeouts per operation - Exponential backoff with jitter - Circuit breakers to prevent cascading failures - Resource limits (CPU, memory, time)

Configuration:

from stratarouter_runtime import ExecutionEngine

engine = ExecutionEngine(
    timeout=60,                    # seconds
    max_retries=3,
    retry_delay_ms=100,
    circuit_breaker_threshold=5,
    max_memory_mb=512,
    max_cpu_percent=80
)


3. Provider Clients

Purpose: Unified interface for multiple LLM providers

graph TB
    A[LLM Request] --> B[Provider Registry]

    B --> C[OpenAI Client]
    B --> D[Anthropic Client]
    B --> E[Google Client]
    B --> F[Local Client]

    C --> G[GPT-3.5]
    C --> H[GPT-4]

    D --> I[Claude 3]
    D --> J[Claude Sonnet 4]

    E --> K[Gemini]
    E --> L[Vertex AI]

    F --> M[Ollama]
    F --> N[vLLM]

    style B fill:#14b8a6

Supported Providers: - ✅ OpenAI (GPT-3.5, GPT-4, Embeddings) - ✅ Anthropic (Claude 2, 3, Sonnet 4) - ✅ Google (Gemini, Vertex AI) - ✅ Cohere (Command, Embed) - ✅ Azure OpenAI - ✅ Local (Ollama, vLLM, HuggingFace)

Example:

from stratarouter_runtime import LLMClientRegistry

registry = LLMClientRegistry()
registry.register("openai", OpenAIClient(api_key="..."))
registry.register("anthropic", AnthropicClient(api_key="..."))

# Execute with automatic fallback
result = await registry.complete(
    primary="openai",
    fallback=["anthropic", "google"],
    messages=[{"role": "user", "content": "Hello"}]
)


4. Cache Layer

Purpose: Intelligent semantic caching for cost and latency reduction

graph TB
    A[Query] --> B{Exact Match?}
    B -->|Yes| C[Return Cached<br/>< 1ms]
    B -->|No| D{Semantic Match?}
    D -->|Yes, >95%| E[Return Similar<br/>< 5ms]
    D -->|No| F[Execute LLM<br/>~50ms]
    F --> G[Store in Cache]
    G --> H[Return Result]

    style C fill:#10b981
    style E fill:#14b8a6
    style F fill:#f59e0b

Cache Types: - Exact Match: Hash-based, <1ms lookup - Semantic Match: Embedding similarity, <5ms lookup - Response Cache: Full response caching with TTL

Performance: - 85%+ hit rate in production workloads - 70-80% cost reduction - 10-15x latency improvement

Configuration:

from stratarouter_runtime import CacheManager

cache = CacheManager(
    backend="redis",              # "redis" or "memory"
    ttl=3600,                      # 1 hour
    similarity_threshold=0.95,     # 95% similarity
    max_cache_size_mb=1024        # 1GB
)

# Automatic caching
result = await cache.get_or_execute(
    key=query,
    embedding=embedding,
    executor=lambda: expensive_llm_call()
)


5. Batch Processor

Purpose: Automatic request batching and deduplication

graph TB
    A[Request 1] --> B[Batch Window<br/>50ms]
    C[Request 2] --> B
    D[Request 3] --> B
    E[Request 4] --> B

    B --> F{Check Similarity}
    F --> G[Unique Requests<br/>N=2]
    G --> H[Execute Batch]
    H --> I[Return N=4<br/>Results]

    style B fill:#3b82f6
    style G fill:#10b981

Features: - Request coalescing within time window - Similarity-based deduplication (>98%) - Automatic result distribution

Benefits: - 3-5x throughput improvement - 40-60% cost reduction from deduplication

Configuration:

from stratarouter_runtime import BatchProcessor

batch = BatchProcessor(
    window_ms=50,          # Collect for 50ms
    max_size=32,           # Max 32 requests per batch
    dedup_threshold=0.98   # 98% similarity = duplicate
)


6. State Manager

Purpose: Persistent execution state with crash recovery

graph TB
    A[Execution Start] --> B[Checkpoint 1]
    B --> C[Step 1]
    C --> D[Checkpoint 2]
    D --> E[Step 2]
    E --> F[Checkpoint 3]
    F --> G[Step 3]
    G --> H[Completion]

    E -.->|Crash| I[Recovery]
    I -.->|Resume from| D

    style B fill:#3b82f6
    style D fill:#3b82f6
    style F fill:#3b82f6
    style I fill:#f59e0b

Features: - PostgreSQL backend for ACID guarantees - Automatic checkpointing at configurable intervals - Crash recovery with automatic resume - Full audit trail for compliance - Transaction support

Configuration:

from stratarouter_runtime import StateManager

state = StateManager(
    db_url="postgresql://localhost/stratarouter",
    checkpoint_interval=10,  # Every 10 steps
    retention_days=30
)

# Save checkpoint
await state.checkpoint(execution_id, state_data)

# Recover from crash
state_data = await state.recover(execution_id)


7. Observability Stack

Purpose: Production monitoring and debugging

graph TB
    A[Runtime] --> B[Metrics]
    A --> C[Traces]
    A --> D[Logs]

    B --> E[Prometheus]
    C --> F[Jaeger/Tempo]
    D --> G[Loki/ES]

    E --> H[Grafana]
    F --> H
    G --> H

    H --> I[Dashboards]
    H --> J[Alerts]

    style B fill:#f59e0b
    style C fill:#3b82f6
    style D fill:#10b981

Metrics (Prometheus format):

# Request metrics
stratarouter_runtime_requests_total
stratarouter_runtime_requests_duration_seconds

# Cache metrics
stratarouter_runtime_cache_hits_total
stratarouter_runtime_cache_misses_total
stratarouter_runtime_cache_hit_rate

# Cost metrics
stratarouter_runtime_cost_usd_total
stratarouter_runtime_tokens_total

# Error metrics
stratarouter_runtime_errors_total
stratarouter_runtime_timeouts_total

Distributed Tracing (OpenTelemetry): - Full request flow visibility - Span attribution across services - Performance bottleneck identification - Error tracking and debugging

Structured Logging (JSON):

{
  "timestamp": "2026-01-15T10:30:45Z",
  "level": "info",
  "event": "execution_complete",
  "execution_id": "exec-123",
  "duration_ms": 45.2,
  "cache_hit": false,
  "provider": "openai",
  "model": "gpt-4",
  "tokens": 1250,
  "cost_usd": 0.0375
}


Data Flow

Standard Request Flow

sequenceDiagram
    participant App as Application
    participant Bridge as Bridge
    participant Cache as Cache
    participant Batch as Batch
    participant Exec as Executor
    participant LLM as LLM Provider
    participant State as State Manager

    App->>Bridge: Execute Query
    Bridge->>Cache: Check Cache
    Cache-->>Bridge: Cache Miss
    Bridge->>Batch: Add to Batch

    Note over Batch: Wait 50ms or<br/>32 requests

    Batch->>Exec: Execute Batch
    Exec->>State: Save Checkpoint
    Exec->>LLM: API Call
    LLM-->>Exec: Response
    Exec->>State: Save Completion
    Exec->>Cache: Store Result
    Exec-->>Batch: Results
    Batch-->>Bridge: Deduplicated Results
    Bridge-->>App: Response

Cache Hit Flow

sequenceDiagram
    participant App as Application
    participant Bridge as Bridge
    participant Cache as Cache

    App->>Bridge: Execute Query
    Bridge->>Cache: Check Cache
    Cache-->>Bridge: Cache Hit! (4ms)
    Bridge-->>App: Cached Response

    Note over Bridge,Cache: 10-15x faster<br/>No LLM cost

Deployment Patterns

Pattern 1: Single Instance

graph TB
    subgraph "Single Server"
        A[Runtime Instance]
        B[(PostgreSQL)]
        C[(Redis)]
    end

    A --> B
    A --> C

    D[Load Balancer] --> A

    style A fill:#14b8a6

Use Case: Development, small deployments
Capacity: 1-5K requests/second
Cost: ~$50-100/month

Pattern 2: Horizontal Scaling

graph TB
    A[Load Balancer]

    subgraph "Runtime Instances"
        B[Runtime 1]
        C[Runtime 2]
        D[Runtime 3]
        E[Runtime N]
    end

    subgraph "Shared State"
        F[(PostgreSQL<br/>Primary)]
        G[(Redis<br/>Cluster)]
    end

    A --> B
    A --> C
    A --> D
    A --> E

    B --> F
    C --> F
    D --> F
    E --> F

    B --> G
    C --> G
    D --> G
    E --> G

    style B fill:#14b8a6
    style C fill:#14b8a6
    style D fill:#14b8a6
    style E fill:#14b8a6

Use Case: Production, high traffic
Capacity: 50K+ requests/second (linear scaling)
Cost: Scales with load

Pattern 3: High Availability

graph TB
    subgraph "Region 1 - Primary"
        A[LB 1]
        B[Runtime 1A]
        C[Runtime 1B]
        D[(PostgreSQL<br/>Primary)]
        E[(Redis 1)]
    end

    subgraph "Region 2 - Standby"
        F[LB 2]
        G[Runtime 2A]
        H[Runtime 2B]
        I[(PostgreSQL<br/>Replica)]
        J[(Redis 2)]
    end

    A --> B
    A --> C
    F --> G
    F --> H

    B --> D
    C --> D
    G --> I
    H --> I

    D -.->|Replication| I
    E -.->|Sync| J

    style B fill:#14b8a6
    style C fill:#14b8a6
    style G fill:#10b981
    style H fill:#10b981

Use Case: Enterprise, SLA requirements
Uptime: 99.95%+ guaranteed
RTO: < 30 seconds
RPO: < 1 minute


Performance Characteristics

Latency Breakdown (P99)

pie title "Latency Distribution (50ms total)"
    "LLM API Call" : 45
    "Batch Processing" : 3
    "Cache Lookup" : 2
    "State Save" : 1.5
    "Bridge Overhead" : 0.5

With Cache Hit: ~4ms total (12.5x faster)

Resource Usage

Resource Idle Normal Load Peak Load
CPU 5% 45% 85%
Memory 500MB 2GB 4GB
Network 1 Mbps 10 Mbps 50 Mbps
Disk I/O 10 IOPS 100 IOPS 500 IOPS

Configuration Examples

Basic Configuration

from stratarouter_runtime import RuntimeConfig

config = RuntimeConfig(
    execution_timeout=60,
    cache_enabled=True,
    batch_enabled=True
)

Production Configuration

config = RuntimeConfig(
    # Execution
    execution_timeout=60,
    max_retries=3,
    retry_delay_ms=100,
    circuit_breaker_threshold=5,

    # Cache
    cache_backend="redis",
    cache_ttl=3600,
    cache_similarity_threshold=0.95,

    # Batch
    batch_enabled=True,
    batch_window_ms=50,
    batch_max_size=32,
    batch_dedup_threshold=0.98,

    # State
    state_backend="postgresql",
    checkpoint_interval=10,
    state_retention_days=30,

    # Observability
    metrics_enabled=True,
    metrics_port=9090,
    tracing_enabled=True,
    tracing_endpoint="http://jaeger:4317",
    log_level="info"
)

Enterprise Configuration

config = RuntimeConfig(
    # All production settings, plus:

    # Security
    auth_enabled=True,
    jwt_secret="your-secret-key",
    api_key_validation=True,

    # Multi-tenancy
    tenant_isolation=True,
    resource_quotas=True,

    # High Availability
    region="us-east-1",
    failover_region="us-west-2",
    replication_lag_ms=1000,

    # Compliance
    audit_logging=True,
    data_retention_days=2555,  # 7 years
    encryption_at_rest=True
)

Security Architecture

graph TB
    A[API Request] --> B{Authentication}
    B -->|Valid| C{Authorization}
    B -->|Invalid| D[401 Unauthorized]

    C -->|Allowed| E{Rate Limit}
    C -->|Denied| F[403 Forbidden]

    E -->|Within Limit| G[Execution]
    E -->|Exceeded| H[429 Too Many Requests]

    G --> I{Sandbox}
    I --> J[Execute Safely]

    J --> K{Audit Log}
    K --> L[Return Response]

    style B fill:#f59e0b
    style C fill:#f59e0b
    style I fill:#10b981

Security Features

  • Authentication: API keys, JWT, OAuth 2.0
  • Authorization: RBAC, resource-level permissions
  • Rate Limiting: Token bucket, per-user/tenant
  • Sandboxing: Process isolation, resource limits
  • Encryption: TLS in transit, AES-256 at rest
  • Audit Logging: All operations logged with context