Monitoring¶

Production Observability Guide

Comprehensive monitoring setup for StrataRouter in production.

Overview¶

StrataRouter provides complete observability through:

Prometheus Metrics - Performance and usage statistics
OpenTelemetry Traces - Request flow and latency breakdown
Structured Logs - Detailed event tracking
Health Checks - Service availability monitoring

Metrics¶

Enable Prometheus¶

from stratarouter.runtime import RuntimeExecutor, RuntimeConfig

config = RuntimeConfig(
    metrics_enabled=True,
    metrics_port=9090
)

executor = RuntimeExecutor(router, config=config)

Key Metrics¶

Routing Performance:

# Routing latency (milliseconds)
stratarouter_routing_latency_ms{quantile="0.5"}
stratarouter_routing_latency_ms{quantile="0.99"}

# Throughput
stratarouter_requests_total
stratarouter_requests_per_second

# Errors
stratarouter_errors_total{error_type="timeout|validation|internal"}

Cache Metrics:

# Hit rate
stratarouter_cache_hit_rate
stratarouter_cache_hits_total
stratarouter_cache_misses_total

# Size
stratarouter_cache_size_bytes
stratarouter_cache_entries

Provider Metrics:

# Per-provider latency
stratarouter_provider_latency_ms{provider="openai|anthropic|google"}

# Provider costs
stratarouter_provider_cost_usd{provider="openai|anthropic|google"}

# Provider errors
stratarouter_provider_errors_total{provider="...",error="..."}

Grafana Dashboard¶

{
  "dashboard": {
    "title": "StrataRouter Production",
    "panels": [
      {
        "title": "P99 Latency",
        "targets": [{
          "expr": "stratarouter_routing_latency_ms{quantile=\"0.99\"}"
        }]
      },
      {
        "title": "Requests/sec",
        "targets": [{
          "expr": "rate(stratarouter_requests_total[1m])"
        }]
      },
      {
        "title": "Cache Hit Rate",
        "targets": [{
          "expr": "stratarouter_cache_hit_rate"
        }]
      },
      {
        "title": "Error Rate",
        "targets": [{
          "expr": "rate(stratarouter_errors_total[1m])"
        }]
      }
    ]
  }
}

Distributed Tracing¶

Enable OpenTelemetry¶

config = RuntimeConfig(
    tracing_enabled=True,
    tracing_endpoint="http://jaeger:4318",
    tracing_sample_rate=0.1,  # 10% sampling
    service_name="stratarouter"
)

Trace Structure¶

Request [85ms]
├─ Route Selection [1.2ms]
│  ├─ Embedding Generation [0.8ms]
│  └─ HNSW Search [0.4ms]
├─ Cache Lookup [0.3ms]
├─ LLM Execution [82ms]
│  ├─ Provider Call [80ms]
│  └─ Response Parse [2ms]
└─ State Update [1.5ms]

Custom Spans¶

from stratarouter.runtime import trace_span

@trace_span("custom_processing")
def process_result(result):
    # Your code here
    return processed

Logging¶

Configure Logging¶

import logging

config = RuntimeConfig(
    log_level=logging.INFO,
    log_format="json",
    log_structured=True
)

Log Levels¶

# ERROR - Critical issues
logging.error("Route lookup failed", extra={
    "route_id": route_id,
    "error": str(e)
})

# WARNING - Degraded performance
logging.warning("Cache miss rate above threshold", extra={
    "hit_rate": 0.45,
    "threshold": 0.85
})

# INFO - Normal operations
logging.info("Request routed successfully", extra={
    "route_id": route_id,
    "confidence": 0.92,
    "latency_ms": 1.2
})

# DEBUG - Detailed diagnostics
logging.debug("HNSW search completed", extra={
    "neighbors": 50,
    "search_time_ms": 0.4
})

Structured Logging¶

{
  "timestamp": "2026-01-11T10:30:45.123Z",
  "level": "INFO",
  "message": "Request routed successfully",
  "route_id": "billing",
  "confidence": 0.92,
  "latency_ms": 1.2,
  "cache_hit": true,
  "provider": "openai",
  "trace_id": "abc123"
}

Health Checks¶

Readiness Check¶

GET /health/ready

{
  "status": "ready",
  "checks": {
    "router": "ok",
    "cache": "ok",
    "providers": {
      "openai": "ok",
      "anthropic": "ok"
    }
  }
}

Liveness Check¶

GET /health/live

{
  "status": "alive",
  "uptime_seconds": 86400,
  "version": "1.0.0"
}

Alerting¶

Prometheus Alerts¶

groups:
  - name: stratarouter
    rules:
      - alert: HighP99Latency
        expr: stratarouter_routing_latency_ms{quantile="0.99"} > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High P99 latency: {{ $value }}ms"

      - alert: LowCacheHitRate
        expr: stratarouter_cache_hit_rate < 0.7
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Cache hit rate below 70%"

      - alert: HighErrorRate
        expr: rate(stratarouter_errors_total[5m]) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate > 1%"

      - alert: ProviderDown
        expr: stratarouter_provider_errors_total > 100
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Provider {{ $labels.provider }} experiencing errors"

Dashboards¶

Production Dashboard¶

Key metrics to monitor:

Performance
P50, P95, P99 latency
Requests per second
Error rate
Resources
CPU usage
Memory usage
Cache size
Cache
Hit rate
Miss rate
Evictions
Providers
Per-provider latency
Per-provider costs
Provider errors

Log Aggregation¶

ELK Stack¶

# Filebeat configuration
filebeat.inputs:
  - type: log
    paths:
      - /var/log/stratarouter/*.log
    json.keys_under_root: true

output.elasticsearch:
  hosts: ["elasticsearch:9200"]

CloudWatch Logs¶

import watchtower

config = RuntimeConfig(
    log_handler=watchtower.CloudWatchLogHandler(
        log_group="/aws/stratarouter",
        stream_name="production"
    )
)

Performance Monitoring¶

Key Performance Indicators¶

Metric	Target	Warning	Critical
P99 Latency	< 10ms	> 20ms	> 50ms
Cache Hit Rate	> 85%	< 70%	< 50%
Error Rate	< 0.1%	> 0.5%	> 1%
CPU Usage	< 70%	> 80%	> 90%
Memory Usage	< 80%	> 90%	> 95%

Incident Response¶

Runbook¶

High Latency: 1. Check provider status 2. Review cache hit rate 3. Check resource usage 4. Review recent changes

High Error Rate: 1. Check logs for error patterns 2. Verify provider API keys 3. Check network connectivity 4. Review configuration

Cache Issues: 1. Check Redis/cache backend status 2. Review cache size and evictions 3. Adjust TTL if needed 4. Check similarity threshold

Next Steps¶

Performance

Optimize your deployment

Tuning Guide →

Scaling

Handle more traffic

Scaling Guide →

Troubleshooting

Debug common issues

Troubleshoot →