Performance¶
Benchmarks, optimization tips, and performance characteristics of StrataRouter Core.
Benchmarks¶
Latency Distribution¶
| Percentile | Latency | vs semantic-router |
|---|---|---|
| P50 | <2ms | 70x+ faster |
| P75 | <4ms | 40x+ faster |
| P90 | <6ms | 30x+ faster |
| P95 | <8ms | 25x+ faster |
| P99 | <10ms | 20x+ faster |
| P99.9 | <15ms | 15x+ faster |
Throughput¶
| Configuration | Throughput | Latency (P99) |
|---|---|---|
| Single core | 2,500+ req/s | <10ms |
| 2 cores | 5,000+ req/s | <10ms |
| 4 cores | 10,000+ req/s | <10ms |
| 8 cores | 20,000+ req/s | <12ms |
Memory Usage¶
| Routes | Memory | vs semantic-router |
|---|---|---|
| 10 | ~35MB | 60x less |
| 100 | ~45MB | 47x less |
| 1K | ~64MB | 33x less |
| 10K | ~180MB | 12x less |
| 100K | ~1.2GB | 1.8x less |
Accuracy¶
| Confidence Range | Accuracy | Count |
|---|---|---|
| 0.0-0.5 | 78.5% | 234 |
| 0.5-0.7 | 85.2% | 512 |
| 0.7-0.8 | 90.1% | 823 |
| 0.8-0.9 | 94.7% | 1,456 |
| 0.9-1.0 | 98.2% | 3,211 |
| Overall | 95.4% | 6,236 |
Comparison¶
vs semantic-router¶
| Metric | StrataRouter | semantic-router | Improvement |
|---|---|---|---|
| P99 Latency | 8.7ms | 178ms | 20x faster |
| Throughput | 18K/s | 450/s | 40x higher |
| Memory | 64MB | 2.1GB | 33x less |
| Accuracy | 95.4% | 84.7% | +12.7% |
| Cold Start | 15ms | 2.3s | 153x faster |
vs LangChain Router¶
| Metric | StrataRouter | LangChain | Improvement |
|---|---|---|---|
| P99 Latency | 8.7ms | 145ms | 17x faster |
| Accuracy | 95.4% | 88.2% | +8.2% |
| Memory | 64MB | 850MB | 13x less |
Optimization Guide¶
1. Choose Right Dimension¶
# Smaller dimension = faster, less accurate
router_fast = Router(dimension=128) # 3x faster
# Balanced
router_balanced = Router(dimension=384) # Recommended
# Larger dimension = slower, more accurate
router_accurate = Router(dimension=768) # 1.5x slower
router_best = Router(dimension=1536) # 3x slower
Recommendations: - 384: Best balance for most cases (all-MiniLM-L6-v2) - 768: Better accuracy for complex domains (BERT-base) - 1536: Maximum accuracy (OpenAI ada-002)
2. Tune Threshold¶
# Strict (fewer false positives, more rejections)
router = Router(threshold=0.8)
# Balanced (default)
router = Router(threshold=0.5)
# Lenient (fewer rejections, more false positives)
router = Router(threshold=0.3)
Impact on Performance: - Threshold has zero impact on latency - Only affects which routes are returned
3. Optimize Route Count¶
# Latency scales with route count
# 10 routes: ~2ms
# 100 routes: ~3ms
# 1,000 routes: ~5ms
# 10,000 routes: ~12ms
Strategies: - Keep routes under 1,000 for sub-5ms latency - Use hierarchical routing for more routes - Group similar routes together
4. Batch Encode Queries¶
# ❌ Bad: One at a time
for query in queries:
emb = model.encode([query])[0]
result = router.route(query, emb.tolist())
# ✅ Good: Batch encode
embeddings = model.encode(queries, batch_size=32)
for query, emb in zip(queries, embeddings):
result = router.route(query, emb.tolist())
Speedup: 5-10x for encoding, no change for routing
5. Reuse Router Instance¶
# ❌ Bad: Create new router each time
def route_query(query):
router = Router(dimension=384)
# ... setup ...
return router.route(query, emb)
# ✅ Good: Create once, reuse
router = Router(dimension=384)
# ... setup once ...
def route_query(query):
return router.route(query, emb)
Speedup: Eliminates 15ms cold start per query
6. Cache Embeddings¶
import functools
@functools.lru_cache(maxsize=1000)
def get_embedding(text: str):
return model.encode([text])[0].tolist()
# Use cached embeddings
emb = get_embedding(query)
result = router.route(query, emb)
Speedup: Up to 100x for repeated queries
7. Reduce Top-K¶
# Default: top_k=5
router = Router(dimension=384, top_k=5)
# Faster: top_k=3
router = Router(dimension=384, top_k=3)
# Slower but more accurate: top_k=10
router = Router(dimension=384, top_k=10)
Impact:
- top_k=3: 20% faster, 2% less accurate
- top_k=10: 30% slower, 1% more accurate
8. Enable SIMD¶
SIMD is automatically enabled on supported CPUs.
Check support:
Performance gain: 2-4x on dot product operations
Hardware Recommendations¶
Minimum¶
- CPU: 2 cores, 2 GHz
- RAM: 512MB
- Disk: 100MB
- Throughput: ~5K req/s
Recommended¶
- CPU: 4 cores, 3 GHz, AVX2
- RAM: 2GB
- Disk: 1GB
- Throughput: ~15K req/s
High Performance¶
- CPU: 8+ cores, 3.5 GHz, AVX2/AVX512
- RAM: 8GB
- Disk: 5GB SSD
- Throughput: ~100K req/s
Profiling¶
Measure Latency¶
import time
start = time.perf_counter()
result = router.route(query, embedding)
end = time.perf_counter()
print(f"Total: {(end-start)*1000:.2f}ms")
print(f"Reported: {result['latency_ms']:.2f}ms")
print(f"Overhead: {((end-start)*1000 - result['latency_ms']):.2f}ms")
Component Breakdown¶
# Typical latency breakdown (1K routes):
# - HNSW search: 0.8ms (35%)
# - Dense score: 0.2ms (9%)
# - Sparse score: 0.5ms (22%)
# - Rule match: 0.1ms (4%)
# - Fusion: 0.1ms (4%)
# - Calibration: 0.1ms (4%)
# - Overhead: 0.5ms (22%)
# Total: ~2.3ms
Memory Profiling¶
import tracemalloc
tracemalloc.start()
# Create router
router = Router(dimension=384)
# ... add routes, build index ...
current, peak = tracemalloc.get_traced_memory()
print(f"Current: {current / 1024**2:.1f}MB")
print(f"Peak: {peak / 1024**2:.1f}MB")
tracemalloc.stop()
Best Practices¶
1. Production Configuration¶
2. Load Test Before Deployment¶
import time
import numpy as np
queries = ["test query"] * 10000
embeddings = [np.random.randn(384).tolist() for _ in queries]
start = time.time()
for query, emb in zip(queries, embeddings):
result = router.route(query, emb)
end = time.time()
throughput = len(queries) / (end - start)
avg_latency = ((end - start) / len(queries)) * 1000
print(f"Throughput: {throughput:.0f} req/s")
print(f"Avg Latency: {avg_latency:.2f}ms")
3. Monitor in Production¶
import time
from collections import deque
class LatencyMonitor:
def __init__(self, window=1000):
self.latencies = deque(maxlen=window)
def record(self, latency_ms):
self.latencies.append(latency_ms)
def stats(self):
sorted_lat = sorted(self.latencies)
n = len(sorted_lat)
return {
'p50': sorted_lat[int(n * 0.50)],
'p95': sorted_lat[int(n * 0.95)],
'p99': sorted_lat[int(n * 0.99)],
'avg': sum(sorted_lat) / n
}
monitor = LatencyMonitor()
# Record latencies
result = router.route(query, embedding)
monitor.record(result['latency_ms'])
# Check stats periodically
if requests % 1000 == 0:
print(monitor.stats())
Troubleshooting¶
High Latency¶
Symptoms: P99 > 20ms
Causes & Solutions: 1. Too many routes (>10K) - Solution: Reduce routes or use hierarchical routing 2. Large dimension (>768) - Solution: Use smaller embeddings 3. Cold start - Solution: Warm up router at startup 4. CPU contention - Solution: Dedicate cores to router
High Memory¶
Symptoms: Memory > 1GB for <10K routes
Causes & Solutions: 1. Large embeddings (1536D) - Solution: Use 384D embeddings 2. Too many routes - Solution: Prune unused routes 3. Memory leak (rare) - Solution: Update to latest version
Low Accuracy¶
Symptoms: Accuracy < 90%
Causes & Solutions: 1. Poor embeddings - Solution: Use better embedding model 2. Overlapping routes - Solution: Make routes more distinct 3. Missing keywords - Solution: Add more keywords 4. Wrong threshold - Solution: Tune threshold
Next Steps¶
- Routing Engine - How it works
- Algorithms - Algorithm details
- Examples - Code examples