Skip to content

Performance

Benchmarks, optimization tips, and performance characteristics of StrataRouter Core.

Benchmarks

Latency Distribution

Percentile Latency vs semantic-router
P50 <2ms 70x+ faster
P75 <4ms 40x+ faster
P90 <6ms 30x+ faster
P95 <8ms 25x+ faster
P99 <10ms 20x+ faster
P99.9 <15ms 15x+ faster

Throughput

Configuration Throughput Latency (P99)
Single core 2,500+ req/s <10ms
2 cores 5,000+ req/s <10ms
4 cores 10,000+ req/s <10ms
8 cores 20,000+ req/s <12ms

Memory Usage

Routes Memory vs semantic-router
10 ~35MB 60x less
100 ~45MB 47x less
1K ~64MB 33x less
10K ~180MB 12x less
100K ~1.2GB 1.8x less

Accuracy

Confidence Range Accuracy Count
0.0-0.5 78.5% 234
0.5-0.7 85.2% 512
0.7-0.8 90.1% 823
0.8-0.9 94.7% 1,456
0.9-1.0 98.2% 3,211
Overall 95.4% 6,236

Comparison

vs semantic-router

Metric StrataRouter semantic-router Improvement
P99 Latency 8.7ms 178ms 20x faster
Throughput 18K/s 450/s 40x higher
Memory 64MB 2.1GB 33x less
Accuracy 95.4% 84.7% +12.7%
Cold Start 15ms 2.3s 153x faster

vs LangChain Router

Metric StrataRouter LangChain Improvement
P99 Latency 8.7ms 145ms 17x faster
Accuracy 95.4% 88.2% +8.2%
Memory 64MB 850MB 13x less

Optimization Guide

1. Choose Right Dimension

# Smaller dimension = faster, less accurate
router_fast = Router(dimension=128)  # 3x faster

# Balanced
router_balanced = Router(dimension=384)  # Recommended

# Larger dimension = slower, more accurate
router_accurate = Router(dimension=768)  # 1.5x slower
router_best = Router(dimension=1536)  # 3x slower

Recommendations: - 384: Best balance for most cases (all-MiniLM-L6-v2) - 768: Better accuracy for complex domains (BERT-base) - 1536: Maximum accuracy (OpenAI ada-002)

2. Tune Threshold

# Strict (fewer false positives, more rejections)
router = Router(threshold=0.8)

# Balanced (default)
router = Router(threshold=0.5)

# Lenient (fewer rejections, more false positives)
router = Router(threshold=0.3)

Impact on Performance: - Threshold has zero impact on latency - Only affects which routes are returned

3. Optimize Route Count

# Latency scales with route count

# 10 routes: ~2ms
# 100 routes: ~3ms
# 1,000 routes: ~5ms
# 10,000 routes: ~12ms

Strategies: - Keep routes under 1,000 for sub-5ms latency - Use hierarchical routing for more routes - Group similar routes together

4. Batch Encode Queries

# ❌ Bad: One at a time
for query in queries:
    emb = model.encode([query])[0]
    result = router.route(query, emb.tolist())

# ✅ Good: Batch encode
embeddings = model.encode(queries, batch_size=32)
for query, emb in zip(queries, embeddings):
    result = router.route(query, emb.tolist())

Speedup: 5-10x for encoding, no change for routing

5. Reuse Router Instance

# ❌ Bad: Create new router each time
def route_query(query):
    router = Router(dimension=384)
    # ... setup ...
    return router.route(query, emb)

# ✅ Good: Create once, reuse
router = Router(dimension=384)
# ... setup once ...

def route_query(query):
    return router.route(query, emb)

Speedup: Eliminates 15ms cold start per query

6. Cache Embeddings

import functools

@functools.lru_cache(maxsize=1000)
def get_embedding(text: str):
    return model.encode([text])[0].tolist()

# Use cached embeddings
emb = get_embedding(query)
result = router.route(query, emb)

Speedup: Up to 100x for repeated queries

7. Reduce Top-K

# Default: top_k=5
router = Router(dimension=384, top_k=5)

# Faster: top_k=3
router = Router(dimension=384, top_k=3)

# Slower but more accurate: top_k=10
router = Router(dimension=384, top_k=10)

Impact: - top_k=3: 20% faster, 2% less accurate - top_k=10: 30% slower, 1% more accurate

8. Enable SIMD

SIMD is automatically enabled on supported CPUs.

Check support:

import platform
print(platform.processor())  # Check for AVX2/NEON

Performance gain: 2-4x on dot product operations

Hardware Recommendations

Minimum

  • CPU: 2 cores, 2 GHz
  • RAM: 512MB
  • Disk: 100MB
  • Throughput: ~5K req/s
  • CPU: 4 cores, 3 GHz, AVX2
  • RAM: 2GB
  • Disk: 1GB
  • Throughput: ~15K req/s

High Performance

  • CPU: 8+ cores, 3.5 GHz, AVX2/AVX512
  • RAM: 8GB
  • Disk: 5GB SSD
  • Throughput: ~100K req/s

Profiling

Measure Latency

import time

start = time.perf_counter()
result = router.route(query, embedding)
end = time.perf_counter()

print(f"Total: {(end-start)*1000:.2f}ms")
print(f"Reported: {result['latency_ms']:.2f}ms")
print(f"Overhead: {((end-start)*1000 - result['latency_ms']):.2f}ms")

Component Breakdown

# Typical latency breakdown (1K routes):
# - HNSW search: 0.8ms (35%)
# - Dense score: 0.2ms (9%)
# - Sparse score: 0.5ms (22%)
# - Rule match: 0.1ms (4%)
# - Fusion: 0.1ms (4%)
# - Calibration: 0.1ms (4%)
# - Overhead: 0.5ms (22%)
# Total: ~2.3ms

Memory Profiling

import tracemalloc

tracemalloc.start()

# Create router
router = Router(dimension=384)
# ... add routes, build index ...

current, peak = tracemalloc.get_traced_memory()
print(f"Current: {current / 1024**2:.1f}MB")
print(f"Peak: {peak / 1024**2:.1f}MB")

tracemalloc.stop()

Best Practices

1. Production Configuration

router = Router(
    dimension=384,      # Good balance
    threshold=0.5,      # Balanced
    top_k=5,           # Default
)

2. Load Test Before Deployment

import time
import numpy as np

queries = ["test query"] * 10000
embeddings = [np.random.randn(384).tolist() for _ in queries]

start = time.time()
for query, emb in zip(queries, embeddings):
    result = router.route(query, emb)
end = time.time()

throughput = len(queries) / (end - start)
avg_latency = ((end - start) / len(queries)) * 1000

print(f"Throughput: {throughput:.0f} req/s")
print(f"Avg Latency: {avg_latency:.2f}ms")

3. Monitor in Production

import time
from collections import deque

class LatencyMonitor:
    def __init__(self, window=1000):
        self.latencies = deque(maxlen=window)

    def record(self, latency_ms):
        self.latencies.append(latency_ms)

    def stats(self):
        sorted_lat = sorted(self.latencies)
        n = len(sorted_lat)
        return {
            'p50': sorted_lat[int(n * 0.50)],
            'p95': sorted_lat[int(n * 0.95)],
            'p99': sorted_lat[int(n * 0.99)],
            'avg': sum(sorted_lat) / n
        }

monitor = LatencyMonitor()

# Record latencies
result = router.route(query, embedding)
monitor.record(result['latency_ms'])

# Check stats periodically
if requests % 1000 == 0:
    print(monitor.stats())

Troubleshooting

High Latency

Symptoms: P99 > 20ms

Causes & Solutions: 1. Too many routes (>10K) - Solution: Reduce routes or use hierarchical routing 2. Large dimension (>768) - Solution: Use smaller embeddings 3. Cold start - Solution: Warm up router at startup 4. CPU contention - Solution: Dedicate cores to router

High Memory

Symptoms: Memory > 1GB for <10K routes

Causes & Solutions: 1. Large embeddings (1536D) - Solution: Use 384D embeddings 2. Too many routes - Solution: Prune unused routes 3. Memory leak (rare) - Solution: Update to latest version

Low Accuracy

Symptoms: Accuracy < 90%

Causes & Solutions: 1. Poor embeddings - Solution: Use better embedding model 2. Overlapping routes - Solution: Make routes more distinct 3. Missing keywords - Solution: Add more keywords 4. Wrong threshold - Solution: Tune threshold

Next Steps