Observability¶
Full production monitoring with metrics, traces, and logs.
Overview¶
StrataRouter Runtime provides comprehensive observability through:
- Prometheus Metrics - Performance and usage metrics
- OpenTelemetry Traces - Distributed tracing
- Structured Logging - JSON logs with context
Prometheus Metrics¶
Expose Metrics¶
from stratarouter_runtime import MetricsExporter
exporter = MetricsExporter(port=9090)
exporter.start()
# Metrics available at http://localhost:9090/metrics
Key Metrics¶
Request Metrics:
stratarouter_runtime_requests_total{provider,route,status}
stratarouter_runtime_request_duration_seconds{provider,route}
stratarouter_runtime_requests_in_flight
Cache Metrics:
stratarouter_runtime_cache_hits_total
stratarouter_runtime_cache_misses_total
stratarouter_runtime_cache_hit_rate
stratarouter_runtime_cache_size_bytes
Cost Metrics:
stratarouter_runtime_cost_usd_total{provider}
stratarouter_runtime_tokens_used_total{provider,model}
Error Metrics:
stratarouter_runtime_errors_total{type,provider}
stratarouter_runtime_timeouts_total{provider}
stratarouter_runtime_retries_total{provider}
OpenTelemetry Tracing¶
from stratarouter_runtime import TracingConfig
tracing = TracingConfig(
enabled=True,
endpoint="http://localhost:4317",
service_name="stratarouter",
sample_rate=0.1
)
Trace Example¶
Span: execute_query (50ms)
├─ Span: route_decision (1ms)
├─ Span: cache_lookup (2ms)
├─ Span: llm_call (45ms)
│ ├─ Span: openai_request (44ms)
│ └─ Span: response_parse (1ms)
└─ Span: cache_store (2ms)
Structured Logging¶
import logging
logger = logging.getLogger("stratarouter")
logger.setLevel(logging.INFO)
# Logs include:
{
"timestamp": "2026-01-11T10:30:00Z",
"level": "INFO",
"message": "Request completed",
"request_id": "req-123",
"user_id": "user-456",
"route_id": "billing",
"latency_ms": 45,
"cost_usd": 0.002
}
Grafana Dashboard¶
Import pre-built dashboard:
# Download dashboard JSON
curl -o stratarouter-dashboard.json \
https://raw.githubusercontent.com/agentdyne9/stratarouter/main/grafana/dashboard.json
# Import via Grafana API
curl -X POST http://localhost:3000/api/dashboards/import \
-H "Content-Type: application/json" \
-d @stratarouter-dashboard.json
The dashboard includes panels for:
- Request throughput and error rate
- P50/P95/P99 latency heatmaps
- Cache hit rate and cost savings
- Per-provider token usage and spend
Alerting¶
# alerts.yml
groups:
- name: stratarouter
rules:
- alert: HighErrorRate
expr: rate(stratarouter_runtime_errors_total[5m]) > 0.05
for: 5m
annotations:
summary: "Error rate above 5%"
- alert: LowCacheHitRate
expr: stratarouter_runtime_cache_hit_rate < 0.7
for: 10m
annotations:
summary: "Cache hit rate below 70%"
Key SLIs to Monitor¶
| SLI | Target | Alert Threshold |
|---|---|---|
| P99 Latency | < 100ms | > 500ms |
| Error Rate | < 1% | > 5% |
| Cache Hit Rate | > 80% | < 70% |
| Throughput | > 1K req/s | < 500 req/s |