Monitoring¶
Production Observability Guide
Comprehensive monitoring setup for StrataRouter in production.
Overview¶
StrataRouter provides complete observability through:
- Prometheus Metrics - Performance and usage statistics
- OpenTelemetry Traces - Request flow and latency breakdown
- Structured Logs - Detailed event tracking
- Health Checks - Service availability monitoring
Metrics¶
Enable Prometheus¶
from stratarouter.runtime import RuntimeExecutor, RuntimeConfig
config = RuntimeConfig(
metrics_enabled=True,
metrics_port=9090
)
executor = RuntimeExecutor(router, config=config)
Key Metrics¶
Routing Performance:
# Routing latency (milliseconds)
stratarouter_routing_latency_ms{quantile="0.5"}
stratarouter_routing_latency_ms{quantile="0.99"}
# Throughput
stratarouter_requests_total
stratarouter_requests_per_second
# Errors
stratarouter_errors_total{error_type="timeout|validation|internal"}
Cache Metrics:
# Hit rate
stratarouter_cache_hit_rate
stratarouter_cache_hits_total
stratarouter_cache_misses_total
# Size
stratarouter_cache_size_bytes
stratarouter_cache_entries
Provider Metrics:
# Per-provider latency
stratarouter_provider_latency_ms{provider="openai|anthropic|google"}
# Provider costs
stratarouter_provider_cost_usd{provider="openai|anthropic|google"}
# Provider errors
stratarouter_provider_errors_total{provider="...",error="..."}
Grafana Dashboard¶
{
"dashboard": {
"title": "StrataRouter Production",
"panels": [
{
"title": "P99 Latency",
"targets": [{
"expr": "stratarouter_routing_latency_ms{quantile=\"0.99\"}"
}]
},
{
"title": "Requests/sec",
"targets": [{
"expr": "rate(stratarouter_requests_total[1m])"
}]
},
{
"title": "Cache Hit Rate",
"targets": [{
"expr": "stratarouter_cache_hit_rate"
}]
},
{
"title": "Error Rate",
"targets": [{
"expr": "rate(stratarouter_errors_total[1m])"
}]
}
]
}
}
Distributed Tracing¶
Enable OpenTelemetry¶
config = RuntimeConfig(
tracing_enabled=True,
tracing_endpoint="http://jaeger:4318",
tracing_sample_rate=0.1, # 10% sampling
service_name="stratarouter"
)
Trace Structure¶
Request [85ms]
├─ Route Selection [1.2ms]
│ ├─ Embedding Generation [0.8ms]
│ └─ HNSW Search [0.4ms]
├─ Cache Lookup [0.3ms]
├─ LLM Execution [82ms]
│ ├─ Provider Call [80ms]
│ └─ Response Parse [2ms]
└─ State Update [1.5ms]
Custom Spans¶
from stratarouter.runtime import trace_span
@trace_span("custom_processing")
def process_result(result):
# Your code here
return processed
Logging¶
Configure Logging¶
import logging
config = RuntimeConfig(
log_level=logging.INFO,
log_format="json",
log_structured=True
)
Log Levels¶
# ERROR - Critical issues
logging.error("Route lookup failed", extra={
"route_id": route_id,
"error": str(e)
})
# WARNING - Degraded performance
logging.warning("Cache miss rate above threshold", extra={
"hit_rate": 0.45,
"threshold": 0.85
})
# INFO - Normal operations
logging.info("Request routed successfully", extra={
"route_id": route_id,
"confidence": 0.92,
"latency_ms": 1.2
})
# DEBUG - Detailed diagnostics
logging.debug("HNSW search completed", extra={
"neighbors": 50,
"search_time_ms": 0.4
})
Structured Logging¶
{
"timestamp": "2026-01-11T10:30:45.123Z",
"level": "INFO",
"message": "Request routed successfully",
"route_id": "billing",
"confidence": 0.92,
"latency_ms": 1.2,
"cache_hit": true,
"provider": "openai",
"trace_id": "abc123"
}
Health Checks¶
Readiness Check¶
{
"status": "ready",
"checks": {
"router": "ok",
"cache": "ok",
"providers": {
"openai": "ok",
"anthropic": "ok"
}
}
}
Liveness Check¶
Alerting¶
Prometheus Alerts¶
groups:
- name: stratarouter
rules:
- alert: HighP99Latency
expr: stratarouter_routing_latency_ms{quantile="0.99"} > 50
for: 5m
labels:
severity: warning
annotations:
summary: "High P99 latency: {{ $value }}ms"
- alert: LowCacheHitRate
expr: stratarouter_cache_hit_rate < 0.7
for: 10m
labels:
severity: warning
annotations:
summary: "Cache hit rate below 70%"
- alert: HighErrorRate
expr: rate(stratarouter_errors_total[5m]) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate > 1%"
- alert: ProviderDown
expr: stratarouter_provider_errors_total > 100
for: 2m
labels:
severity: critical
annotations:
summary: "Provider {{ $labels.provider }} experiencing errors"
Dashboards¶
Production Dashboard¶
Key metrics to monitor:
- Performance
- P50, P95, P99 latency
- Requests per second
-
Error rate
-
Resources
- CPU usage
- Memory usage
-
Cache size
-
Cache
- Hit rate
- Miss rate
-
Evictions
-
Providers
- Per-provider latency
- Per-provider costs
- Provider errors
Log Aggregation¶
ELK Stack¶
# Filebeat configuration
filebeat.inputs:
- type: log
paths:
- /var/log/stratarouter/*.log
json.keys_under_root: true
output.elasticsearch:
hosts: ["elasticsearch:9200"]
CloudWatch Logs¶
import watchtower
config = RuntimeConfig(
log_handler=watchtower.CloudWatchLogHandler(
log_group="/aws/stratarouter",
stream_name="production"
)
)
Performance Monitoring¶
Key Performance Indicators¶
| Metric | Target | Warning | Critical |
|---|---|---|---|
| P99 Latency | < 10ms | > 20ms | > 50ms |
| Cache Hit Rate | > 85% | < 70% | < 50% |
| Error Rate | < 0.1% | > 0.5% | > 1% |
| CPU Usage | < 70% | > 80% | > 90% |
| Memory Usage | < 80% | > 90% | > 95% |
Incident Response¶
Runbook¶
High Latency: 1. Check provider status 2. Review cache hit rate 3. Check resource usage 4. Review recent changes
High Error Rate: 1. Check logs for error patterns 2. Verify provider API keys 3. Check network connectivity 4. Review configuration
Cache Issues: 1. Check Redis/cache backend status 2. Review cache size and evictions 3. Adjust TTL if needed 4. Check similarity threshold