Skip to content

Monitoring

Production Observability Guide

Comprehensive monitoring setup for StrataRouter in production.


Overview

StrataRouter provides complete observability through:

  • Prometheus Metrics - Performance and usage statistics
  • OpenTelemetry Traces - Request flow and latency breakdown
  • Structured Logs - Detailed event tracking
  • Health Checks - Service availability monitoring

Metrics

Enable Prometheus

from stratarouter.runtime import RuntimeExecutor, RuntimeConfig

config = RuntimeConfig(
    metrics_enabled=True,
    metrics_port=9090
)

executor = RuntimeExecutor(router, config=config)

Key Metrics

Routing Performance:

# Routing latency (milliseconds)
stratarouter_routing_latency_ms{quantile="0.5"}
stratarouter_routing_latency_ms{quantile="0.99"}

# Throughput
stratarouter_requests_total
stratarouter_requests_per_second

# Errors
stratarouter_errors_total{error_type="timeout|validation|internal"}

Cache Metrics:

# Hit rate
stratarouter_cache_hit_rate
stratarouter_cache_hits_total
stratarouter_cache_misses_total

# Size
stratarouter_cache_size_bytes
stratarouter_cache_entries

Provider Metrics:

# Per-provider latency
stratarouter_provider_latency_ms{provider="openai|anthropic|google"}

# Provider costs
stratarouter_provider_cost_usd{provider="openai|anthropic|google"}

# Provider errors
stratarouter_provider_errors_total{provider="...",error="..."}

Grafana Dashboard

{
  "dashboard": {
    "title": "StrataRouter Production",
    "panels": [
      {
        "title": "P99 Latency",
        "targets": [{
          "expr": "stratarouter_routing_latency_ms{quantile=\"0.99\"}"
        }]
      },
      {
        "title": "Requests/sec",
        "targets": [{
          "expr": "rate(stratarouter_requests_total[1m])"
        }]
      },
      {
        "title": "Cache Hit Rate",
        "targets": [{
          "expr": "stratarouter_cache_hit_rate"
        }]
      },
      {
        "title": "Error Rate",
        "targets": [{
          "expr": "rate(stratarouter_errors_total[1m])"
        }]
      }
    ]
  }
}

Distributed Tracing

Enable OpenTelemetry

config = RuntimeConfig(
    tracing_enabled=True,
    tracing_endpoint="http://jaeger:4318",
    tracing_sample_rate=0.1,  # 10% sampling
    service_name="stratarouter"
)

Trace Structure

Request [85ms]
├─ Route Selection [1.2ms]
│  ├─ Embedding Generation [0.8ms]
│  └─ HNSW Search [0.4ms]
├─ Cache Lookup [0.3ms]
├─ LLM Execution [82ms]
│  ├─ Provider Call [80ms]
│  └─ Response Parse [2ms]
└─ State Update [1.5ms]

Custom Spans

from stratarouter.runtime import trace_span

@trace_span("custom_processing")
def process_result(result):
    # Your code here
    return processed

Logging

Configure Logging

import logging

config = RuntimeConfig(
    log_level=logging.INFO,
    log_format="json",
    log_structured=True
)

Log Levels

# ERROR - Critical issues
logging.error("Route lookup failed", extra={
    "route_id": route_id,
    "error": str(e)
})

# WARNING - Degraded performance
logging.warning("Cache miss rate above threshold", extra={
    "hit_rate": 0.45,
    "threshold": 0.85
})

# INFO - Normal operations
logging.info("Request routed successfully", extra={
    "route_id": route_id,
    "confidence": 0.92,
    "latency_ms": 1.2
})

# DEBUG - Detailed diagnostics
logging.debug("HNSW search completed", extra={
    "neighbors": 50,
    "search_time_ms": 0.4
})

Structured Logging

{
  "timestamp": "2026-01-11T10:30:45.123Z",
  "level": "INFO",
  "message": "Request routed successfully",
  "route_id": "billing",
  "confidence": 0.92,
  "latency_ms": 1.2,
  "cache_hit": true,
  "provider": "openai",
  "trace_id": "abc123"
}

Health Checks

Readiness Check

GET /health/ready
{
  "status": "ready",
  "checks": {
    "router": "ok",
    "cache": "ok",
    "providers": {
      "openai": "ok",
      "anthropic": "ok"
    }
  }
}

Liveness Check

GET /health/live
{
  "status": "alive",
  "uptime_seconds": 86400,
  "version": "1.0.0"
}

Alerting

Prometheus Alerts

groups:
  - name: stratarouter
    rules:
      - alert: HighP99Latency
        expr: stratarouter_routing_latency_ms{quantile="0.99"} > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High P99 latency: {{ $value }}ms"

      - alert: LowCacheHitRate
        expr: stratarouter_cache_hit_rate < 0.7
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Cache hit rate below 70%"

      - alert: HighErrorRate
        expr: rate(stratarouter_errors_total[5m]) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate > 1%"

      - alert: ProviderDown
        expr: stratarouter_provider_errors_total > 100
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Provider {{ $labels.provider }} experiencing errors"

Dashboards

Production Dashboard

Key metrics to monitor:

  1. Performance
  2. P50, P95, P99 latency
  3. Requests per second
  4. Error rate

  5. Resources

  6. CPU usage
  7. Memory usage
  8. Cache size

  9. Cache

  10. Hit rate
  11. Miss rate
  12. Evictions

  13. Providers

  14. Per-provider latency
  15. Per-provider costs
  16. Provider errors

Log Aggregation

ELK Stack

# Filebeat configuration
filebeat.inputs:
  - type: log
    paths:
      - /var/log/stratarouter/*.log
    json.keys_under_root: true

output.elasticsearch:
  hosts: ["elasticsearch:9200"]

CloudWatch Logs

import watchtower

config = RuntimeConfig(
    log_handler=watchtower.CloudWatchLogHandler(
        log_group="/aws/stratarouter",
        stream_name="production"
    )
)

Performance Monitoring

Key Performance Indicators

Metric Target Warning Critical
P99 Latency < 10ms > 20ms > 50ms
Cache Hit Rate > 85% < 70% < 50%
Error Rate < 0.1% > 0.5% > 1%
CPU Usage < 70% > 80% > 90%
Memory Usage < 80% > 90% > 95%

Incident Response

Runbook

High Latency: 1. Check provider status 2. Review cache hit rate 3. Check resource usage 4. Review recent changes

High Error Rate: 1. Check logs for error patterns 2. Verify provider API keys 3. Check network connectivity 4. Review configuration

Cache Issues: 1. Check Redis/cache backend status 2. Review cache size and evictions 3. Adjust TTL if needed 4. Check similarity threshold


Next Steps

Performance

Optimize your deployment

Tuning Guide →

Scaling

Handle more traffic

Scaling Guide →

Troubleshooting

Debug common issues

Troubleshoot →