Observability¶

Full production monitoring with metrics, traces, and logs.

Overview¶

StrataRouter Runtime provides comprehensive observability through:

Prometheus Metrics - Performance and usage metrics
OpenTelemetry Traces - Distributed tracing
Structured Logging - JSON logs with context

Prometheus Metrics¶

Expose Metrics¶

from stratarouter_runtime import MetricsExporter

exporter = MetricsExporter(port=9090)
exporter.start()

# Metrics available at http://localhost:9090/metrics

Key Metrics¶

Request Metrics:

stratarouter_runtime_requests_total{provider,route,status}
stratarouter_runtime_request_duration_seconds{provider,route}
stratarouter_runtime_requests_in_flight

Cache Metrics:

stratarouter_runtime_cache_hits_total
stratarouter_runtime_cache_misses_total
stratarouter_runtime_cache_hit_rate
stratarouter_runtime_cache_size_bytes

Cost Metrics:

stratarouter_runtime_cost_usd_total{provider}
stratarouter_runtime_tokens_used_total{provider,model}

Error Metrics:

stratarouter_runtime_errors_total{type,provider}
stratarouter_runtime_timeouts_total{provider}
stratarouter_runtime_retries_total{provider}

OpenTelemetry Tracing¶

from stratarouter_runtime import TracingConfig

tracing = TracingConfig(
    enabled=True,
    endpoint="http://localhost:4317",
    service_name="stratarouter",
    sample_rate=0.1
)

Trace Example¶

Span: execute_query (50ms)
├─ Span: route_decision (1ms)
├─ Span: cache_lookup (2ms)
├─ Span: llm_call (45ms)
│  ├─ Span: openai_request (44ms)
│  └─ Span: response_parse (1ms)
└─ Span: cache_store (2ms)

Structured Logging¶

import logging

logger = logging.getLogger("stratarouter")
logger.setLevel(logging.INFO)

# Logs include:
{
  "timestamp": "2026-01-11T10:30:00Z",
  "level": "INFO",
  "message": "Request completed",
  "request_id": "req-123",
  "user_id": "user-456",
  "route_id": "billing",
  "latency_ms": 45,
  "cost_usd": 0.002
}

Grafana Dashboard¶

Import pre-built dashboard:

# Download dashboard JSON
curl -o stratarouter-dashboard.json \
  https://raw.githubusercontent.com/agentdyne9/stratarouter/main/grafana/dashboard.json

# Import via Grafana API
curl -X POST http://localhost:3000/api/dashboards/import \
  -H "Content-Type: application/json" \
  -d @stratarouter-dashboard.json

The dashboard includes panels for:

Request throughput and error rate
P50/P95/P99 latency heatmaps
Cache hit rate and cost savings
Per-provider token usage and spend

Alerting¶

# alerts.yml
groups:
  - name: stratarouter
    rules:
      - alert: HighErrorRate
        expr: rate(stratarouter_runtime_errors_total[5m]) > 0.05
        for: 5m
        annotations:
          summary: "Error rate above 5%"

      - alert: LowCacheHitRate
        expr: stratarouter_runtime_cache_hit_rate < 0.7
        for: 10m
        annotations:
          summary: "Cache hit rate below 70%"

Key SLIs to Monitor¶

SLI	Target	Alert Threshold
P99 Latency	< 100ms	> 500ms
Error Rate	< 1%	> 5%
Cache Hit Rate	> 80%	< 70%
Throughput	> 1K req/s	< 500 req/s

Next Steps¶

Production Monitoring

Full monitoring setup with Grafana dashboards.

→

Production Deployment

Deploy Runtime to production.

→