Runtime Architecture¶
Comprehensive architecture guide for StrataRouter Runtime.
System Overview¶
StrataRouter Runtime provides production-grade execution infrastructure for semantic routing at scale.
graph TB
subgraph "Application Layer"
A[Web Applications]
B[AI Agents]
C[Workflows]
end
subgraph "Runtime Layer"
D[Core-Runtime Bridge]
E[Execution Engine]
F[Cache Layer]
G[Batch Processor]
end
subgraph "Infrastructure"
H[Provider Clients]
I[State Manager]
J[Observability]
end
subgraph "External Services"
K[(PostgreSQL)]
L[(Redis)]
M[OpenAI]
N[Anthropic]
O[Google]
end
A --> D
B --> D
C --> D
D --> E
D --> F
D --> G
E --> H
E --> I
E --> J
F --> L
I --> K
H --> M
H --> N
H --> O
style D fill:#14b8a6
style E fill:#10b981
style F fill:#f59e0b
style G fill:#3b82f6
Core Components¶
1. Core-Runtime Bridge¶
Purpose: Connects routing decisions to execution
graph LR
A[Route Decision] --> B[Bridge]
B --> C[Validation]
B --> D[Translation]
B --> E[Context Enrichment]
C --> F[Execution Plan]
D --> F
E --> F
style B fill:#14b8a6
Key Responsibilities: - Translate route IDs to execution plans - Validate execution context - Enrich with user metadata - Collect feedback for learning
Example:
from stratarouter_runtime import CoreRuntimeBridge
bridge = CoreRuntimeBridge(
core_router=router,
runtime_config=config
)
result = await bridge.execute(
query="Where's my invoice?",
context={"user_id": "user-123", "org_id": "org-456"}
)
2. Execution Engine¶
Purpose: Safe, isolated execution with reliability features
graph TB
A[Execution Request] --> B{Timeout Check}
B -->|Within Limit| C[Execute]
B -->|Exceeded| D[Timeout Error]
C --> E{Success?}
E -->|Yes| F[Return Result]
E -->|No| G{Retries Left?}
G -->|Yes| H[Exponential Backoff]
H --> C
G -->|No| I[Circuit Breaker]
style C fill:#10b981
style I fill:#ef4444
Features: - Process isolation with sandboxing - Configurable timeouts per operation - Exponential backoff with jitter - Circuit breakers to prevent cascading failures - Resource limits (CPU, memory, time)
Configuration:
from stratarouter_runtime import ExecutionEngine
engine = ExecutionEngine(
timeout=60, # seconds
max_retries=3,
retry_delay_ms=100,
circuit_breaker_threshold=5,
max_memory_mb=512,
max_cpu_percent=80
)
3. Provider Clients¶
Purpose: Unified interface for multiple LLM providers
graph TB
A[LLM Request] --> B[Provider Registry]
B --> C[OpenAI Client]
B --> D[Anthropic Client]
B --> E[Google Client]
B --> F[Local Client]
C --> G[GPT-3.5]
C --> H[GPT-4]
D --> I[Claude 3]
D --> J[Claude Sonnet 4]
E --> K[Gemini]
E --> L[Vertex AI]
F --> M[Ollama]
F --> N[vLLM]
style B fill:#14b8a6
Supported Providers: - ✅ OpenAI (GPT-3.5, GPT-4, Embeddings) - ✅ Anthropic (Claude 2, 3, Sonnet 4) - ✅ Google (Gemini, Vertex AI) - ✅ Cohere (Command, Embed) - ✅ Azure OpenAI - ✅ Local (Ollama, vLLM, HuggingFace)
Example:
from stratarouter_runtime import LLMClientRegistry
registry = LLMClientRegistry()
registry.register("openai", OpenAIClient(api_key="..."))
registry.register("anthropic", AnthropicClient(api_key="..."))
# Execute with automatic fallback
result = await registry.complete(
primary="openai",
fallback=["anthropic", "google"],
messages=[{"role": "user", "content": "Hello"}]
)
4. Cache Layer¶
Purpose: Intelligent semantic caching for cost and latency reduction
graph TB
A[Query] --> B{Exact Match?}
B -->|Yes| C[Return Cached<br/>< 1ms]
B -->|No| D{Semantic Match?}
D -->|Yes, >95%| E[Return Similar<br/>< 5ms]
D -->|No| F[Execute LLM<br/>~50ms]
F --> G[Store in Cache]
G --> H[Return Result]
style C fill:#10b981
style E fill:#14b8a6
style F fill:#f59e0b
Cache Types: - Exact Match: Hash-based, <1ms lookup - Semantic Match: Embedding similarity, <5ms lookup - Response Cache: Full response caching with TTL
Performance: - 85%+ hit rate in production workloads - 70-80% cost reduction - 10-15x latency improvement
Configuration:
from stratarouter_runtime import CacheManager
cache = CacheManager(
backend="redis", # "redis" or "memory"
ttl=3600, # 1 hour
similarity_threshold=0.95, # 95% similarity
max_cache_size_mb=1024 # 1GB
)
# Automatic caching
result = await cache.get_or_execute(
key=query,
embedding=embedding,
executor=lambda: expensive_llm_call()
)
5. Batch Processor¶
Purpose: Automatic request batching and deduplication
graph TB
A[Request 1] --> B[Batch Window<br/>50ms]
C[Request 2] --> B
D[Request 3] --> B
E[Request 4] --> B
B --> F{Check Similarity}
F --> G[Unique Requests<br/>N=2]
G --> H[Execute Batch]
H --> I[Return N=4<br/>Results]
style B fill:#3b82f6
style G fill:#10b981
Features: - Request coalescing within time window - Similarity-based deduplication (>98%) - Automatic result distribution
Benefits: - 3-5x throughput improvement - 40-60% cost reduction from deduplication
Configuration:
from stratarouter_runtime import BatchProcessor
batch = BatchProcessor(
window_ms=50, # Collect for 50ms
max_size=32, # Max 32 requests per batch
dedup_threshold=0.98 # 98% similarity = duplicate
)
6. State Manager¶
Purpose: Persistent execution state with crash recovery
graph TB
A[Execution Start] --> B[Checkpoint 1]
B --> C[Step 1]
C --> D[Checkpoint 2]
D --> E[Step 2]
E --> F[Checkpoint 3]
F --> G[Step 3]
G --> H[Completion]
E -.->|Crash| I[Recovery]
I -.->|Resume from| D
style B fill:#3b82f6
style D fill:#3b82f6
style F fill:#3b82f6
style I fill:#f59e0b
Features: - PostgreSQL backend for ACID guarantees - Automatic checkpointing at configurable intervals - Crash recovery with automatic resume - Full audit trail for compliance - Transaction support
Configuration:
from stratarouter_runtime import StateManager
state = StateManager(
db_url="postgresql://localhost/stratarouter",
checkpoint_interval=10, # Every 10 steps
retention_days=30
)
# Save checkpoint
await state.checkpoint(execution_id, state_data)
# Recover from crash
state_data = await state.recover(execution_id)
7. Observability Stack¶
Purpose: Production monitoring and debugging
graph TB
A[Runtime] --> B[Metrics]
A --> C[Traces]
A --> D[Logs]
B --> E[Prometheus]
C --> F[Jaeger/Tempo]
D --> G[Loki/ES]
E --> H[Grafana]
F --> H
G --> H
H --> I[Dashboards]
H --> J[Alerts]
style B fill:#f59e0b
style C fill:#3b82f6
style D fill:#10b981
Metrics (Prometheus format):
# Request metrics
stratarouter_runtime_requests_total
stratarouter_runtime_requests_duration_seconds
# Cache metrics
stratarouter_runtime_cache_hits_total
stratarouter_runtime_cache_misses_total
stratarouter_runtime_cache_hit_rate
# Cost metrics
stratarouter_runtime_cost_usd_total
stratarouter_runtime_tokens_total
# Error metrics
stratarouter_runtime_errors_total
stratarouter_runtime_timeouts_total
Distributed Tracing (OpenTelemetry): - Full request flow visibility - Span attribution across services - Performance bottleneck identification - Error tracking and debugging
Structured Logging (JSON):
{
"timestamp": "2026-01-15T10:30:45Z",
"level": "info",
"event": "execution_complete",
"execution_id": "exec-123",
"duration_ms": 45.2,
"cache_hit": false,
"provider": "openai",
"model": "gpt-4",
"tokens": 1250,
"cost_usd": 0.0375
}
Data Flow¶
Standard Request Flow¶
sequenceDiagram
participant App as Application
participant Bridge as Bridge
participant Cache as Cache
participant Batch as Batch
participant Exec as Executor
participant LLM as LLM Provider
participant State as State Manager
App->>Bridge: Execute Query
Bridge->>Cache: Check Cache
Cache-->>Bridge: Cache Miss
Bridge->>Batch: Add to Batch
Note over Batch: Wait 50ms or<br/>32 requests
Batch->>Exec: Execute Batch
Exec->>State: Save Checkpoint
Exec->>LLM: API Call
LLM-->>Exec: Response
Exec->>State: Save Completion
Exec->>Cache: Store Result
Exec-->>Batch: Results
Batch-->>Bridge: Deduplicated Results
Bridge-->>App: Response
Cache Hit Flow¶
sequenceDiagram
participant App as Application
participant Bridge as Bridge
participant Cache as Cache
App->>Bridge: Execute Query
Bridge->>Cache: Check Cache
Cache-->>Bridge: Cache Hit! (4ms)
Bridge-->>App: Cached Response
Note over Bridge,Cache: 10-15x faster<br/>No LLM cost
Deployment Patterns¶
Pattern 1: Single Instance¶
graph TB
subgraph "Single Server"
A[Runtime Instance]
B[(PostgreSQL)]
C[(Redis)]
end
A --> B
A --> C
D[Load Balancer] --> A
style A fill:#14b8a6
Use Case: Development, small deployments
Capacity: 1-5K requests/second
Cost: ~$50-100/month
Pattern 2: Horizontal Scaling¶
graph TB
A[Load Balancer]
subgraph "Runtime Instances"
B[Runtime 1]
C[Runtime 2]
D[Runtime 3]
E[Runtime N]
end
subgraph "Shared State"
F[(PostgreSQL<br/>Primary)]
G[(Redis<br/>Cluster)]
end
A --> B
A --> C
A --> D
A --> E
B --> F
C --> F
D --> F
E --> F
B --> G
C --> G
D --> G
E --> G
style B fill:#14b8a6
style C fill:#14b8a6
style D fill:#14b8a6
style E fill:#14b8a6
Use Case: Production, high traffic
Capacity: 50K+ requests/second (linear scaling)
Cost: Scales with load
Pattern 3: High Availability¶
graph TB
subgraph "Region 1 - Primary"
A[LB 1]
B[Runtime 1A]
C[Runtime 1B]
D[(PostgreSQL<br/>Primary)]
E[(Redis 1)]
end
subgraph "Region 2 - Standby"
F[LB 2]
G[Runtime 2A]
H[Runtime 2B]
I[(PostgreSQL<br/>Replica)]
J[(Redis 2)]
end
A --> B
A --> C
F --> G
F --> H
B --> D
C --> D
G --> I
H --> I
D -.->|Replication| I
E -.->|Sync| J
style B fill:#14b8a6
style C fill:#14b8a6
style G fill:#10b981
style H fill:#10b981
Use Case: Enterprise, SLA requirements
Uptime: 99.95%+ guaranteed
RTO: < 30 seconds
RPO: < 1 minute
Performance Characteristics¶
Latency Breakdown (P99)¶
pie title "Latency Distribution (50ms total)"
"LLM API Call" : 45
"Batch Processing" : 3
"Cache Lookup" : 2
"State Save" : 1.5
"Bridge Overhead" : 0.5
With Cache Hit: ~4ms total (12.5x faster)
Resource Usage¶
| Resource | Idle | Normal Load | Peak Load |
|---|---|---|---|
| CPU | 5% | 45% | 85% |
| Memory | 500MB | 2GB | 4GB |
| Network | 1 Mbps | 10 Mbps | 50 Mbps |
| Disk I/O | 10 IOPS | 100 IOPS | 500 IOPS |
Configuration Examples¶
Basic Configuration¶
from stratarouter_runtime import RuntimeConfig
config = RuntimeConfig(
execution_timeout=60,
cache_enabled=True,
batch_enabled=True
)
Production Configuration¶
config = RuntimeConfig(
# Execution
execution_timeout=60,
max_retries=3,
retry_delay_ms=100,
circuit_breaker_threshold=5,
# Cache
cache_backend="redis",
cache_ttl=3600,
cache_similarity_threshold=0.95,
# Batch
batch_enabled=True,
batch_window_ms=50,
batch_max_size=32,
batch_dedup_threshold=0.98,
# State
state_backend="postgresql",
checkpoint_interval=10,
state_retention_days=30,
# Observability
metrics_enabled=True,
metrics_port=9090,
tracing_enabled=True,
tracing_endpoint="http://jaeger:4317",
log_level="info"
)
Enterprise Configuration¶
config = RuntimeConfig(
# All production settings, plus:
# Security
auth_enabled=True,
jwt_secret="your-secret-key",
api_key_validation=True,
# Multi-tenancy
tenant_isolation=True,
resource_quotas=True,
# High Availability
region="us-east-1",
failover_region="us-west-2",
replication_lag_ms=1000,
# Compliance
audit_logging=True,
data_retention_days=2555, # 7 years
encryption_at_rest=True
)
Security Architecture¶
graph TB
A[API Request] --> B{Authentication}
B -->|Valid| C{Authorization}
B -->|Invalid| D[401 Unauthorized]
C -->|Allowed| E{Rate Limit}
C -->|Denied| F[403 Forbidden]
E -->|Within Limit| G[Execution]
E -->|Exceeded| H[429 Too Many Requests]
G --> I{Sandbox}
I --> J[Execute Safely]
J --> K{Audit Log}
K --> L[Return Response]
style B fill:#f59e0b
style C fill:#f59e0b
style I fill:#10b981
Security Features¶
- Authentication: API keys, JWT, OAuth 2.0
- Authorization: RBAC, resource-level permissions
- Rate Limiting: Token bucket, per-user/tenant
- Sandboxing: Process isolation, resource limits
- Encryption: TLS in transit, AES-256 at rest
- Audit Logging: All operations logged with context
Related Documentation¶
- Runtime Overview - Feature overview
- Core-Runtime Bridge - Integration layer
- Deployment Guide - Production deployment
- Monitoring - Observability setup