Performance¶

Benchmarks, optimization tips, and performance characteristics of StrataRouter Core.

Benchmarks¶

Latency Distribution¶

Percentile	Latency	vs semantic-router
P50	<2ms	70x+ faster
P75	<4ms	40x+ faster
P90	<6ms	30x+ faster
P95	<8ms	25x+ faster
P99	<10ms	20x+ faster
P99.9	<15ms	15x+ faster

Throughput¶

Configuration	Throughput	Latency (P99)
Single core	2,500+ req/s	<10ms
2 cores	5,000+ req/s	<10ms
4 cores	10,000+ req/s	<10ms
8 cores	20,000+ req/s	<12ms

Memory Usage¶

Routes	Memory	vs semantic-router
10	~35MB	60x less
100	~45MB	47x less
1K	~64MB	33x less
10K	~180MB	12x less
100K	~1.2GB	1.8x less

Accuracy¶

Confidence Range	Accuracy	Count
0.0-0.5	78.5%	234
0.5-0.7	85.2%	512
0.7-0.8	90.1%	823
0.8-0.9	94.7%	1,456
0.9-1.0	98.2%	3,211
Overall	95.4%	6,236

Comparison¶

vs semantic-router¶

Metric	StrataRouter	semantic-router	Improvement
P99 Latency	8.7ms	178ms	20x faster
Throughput	18K/s	450/s	40x higher
Memory	64MB	2.1GB	33x less
Accuracy	95.4%	84.7%	+12.7%
Cold Start	15ms	2.3s	153x faster

vs LangChain Router¶

Metric	StrataRouter	LangChain	Improvement
P99 Latency	8.7ms	145ms	17x faster
Accuracy	95.4%	88.2%	+8.2%
Memory	64MB	850MB	13x less

Optimization Guide¶

1. Choose Right Dimension¶

# Smaller dimension = faster, less accurate
router_fast = Router(dimension=128)  # 3x faster

# Balanced
router_balanced = Router(dimension=384)  # Recommended

# Larger dimension = slower, more accurate
router_accurate = Router(dimension=768)  # 1.5x slower
router_best = Router(dimension=1536)  # 3x slower

Recommendations: - 384: Best balance for most cases (all-MiniLM-L6-v2) - 768: Better accuracy for complex domains (BERT-base) - 1536: Maximum accuracy (OpenAI ada-002)

2. Tune Threshold¶

# Strict (fewer false positives, more rejections)
router = Router(threshold=0.8)

# Balanced (default)
router = Router(threshold=0.5)

# Lenient (fewer rejections, more false positives)
router = Router(threshold=0.3)

Impact on Performance: - Threshold has zero impact on latency - Only affects which routes are returned

3. Optimize Route Count¶

# Latency scales with route count

# 10 routes: ~2ms
# 100 routes: ~3ms
# 1,000 routes: ~5ms
# 10,000 routes: ~12ms

Strategies: - Keep routes under 1,000 for sub-5ms latency - Use hierarchical routing for more routes - Group similar routes together

4. Batch Encode Queries¶

# ❌ Bad: One at a time
for query in queries:
    emb = model.encode([query])[0]
    result = router.route(query, emb.tolist())

# ✅ Good: Batch encode
embeddings = model.encode(queries, batch_size=32)
for query, emb in zip(queries, embeddings):
    result = router.route(query, emb.tolist())

Speedup: 5-10x for encoding, no change for routing

5. Reuse Router Instance¶

# ❌ Bad: Create new router each time
def route_query(query):
    router = Router(dimension=384)
    # ... setup ...
    return router.route(query, emb)

# ✅ Good: Create once, reuse
router = Router(dimension=384)
# ... setup once ...

def route_query(query):
    return router.route(query, emb)

Speedup: Eliminates 15ms cold start per query

6. Cache Embeddings¶

import functools

@functools.lru_cache(maxsize=1000)
def get_embedding(text: str):
    return model.encode([text])[0].tolist()

# Use cached embeddings
emb = get_embedding(query)
result = router.route(query, emb)

Speedup: Up to 100x for repeated queries

7. Reduce Top-K¶

# Default: top_k=5
router = Router(dimension=384, top_k=5)

# Faster: top_k=3
router = Router(dimension=384, top_k=3)

# Slower but more accurate: top_k=10
router = Router(dimension=384, top_k=10)

Impact: - top_k=3: 20% faster, 2% less accurate - top_k=10: 30% slower, 1% more accurate

8. Enable SIMD¶

SIMD is automatically enabled on supported CPUs.

Check support:

import platform
print(platform.processor())  # Check for AVX2/NEON

Performance gain: 2-4x on dot product operations

Hardware Recommendations¶

Minimum¶

CPU: 2 cores, 2 GHz
RAM: 512MB
Disk: 100MB
Throughput: ~5K req/s

Recommended¶

CPU: 4 cores, 3 GHz, AVX2
RAM: 2GB
Disk: 1GB
Throughput: ~15K req/s

High Performance¶

CPU: 8+ cores, 3.5 GHz, AVX2/AVX512
RAM: 8GB
Disk: 5GB SSD
Throughput: ~100K req/s

Profiling¶

Measure Latency¶

import time

start = time.perf_counter()
result = router.route(query, embedding)
end = time.perf_counter()

print(f"Total: {(end-start)*1000:.2f}ms")
print(f"Reported: {result['latency_ms']:.2f}ms")
print(f"Overhead: {((end-start)*1000 - result['latency_ms']):.2f}ms")

Component Breakdown¶

# Typical latency breakdown (1K routes):
# - HNSW search: 0.8ms (35%)
# - Dense score: 0.2ms (9%)
# - Sparse score: 0.5ms (22%)
# - Rule match: 0.1ms (4%)
# - Fusion: 0.1ms (4%)
# - Calibration: 0.1ms (4%)
# - Overhead: 0.5ms (22%)
# Total: ~2.3ms

Memory Profiling¶

import tracemalloc

tracemalloc.start()

# Create router
router = Router(dimension=384)
# ... add routes, build index ...

current, peak = tracemalloc.get_traced_memory()
print(f"Current: {current / 1024**2:.1f}MB")
print(f"Peak: {peak / 1024**2:.1f}MB")

tracemalloc.stop()

Best Practices¶

1. Production Configuration¶

router = Router(
    dimension=384,      # Good balance
    threshold=0.5,      # Balanced
    top_k=5,           # Default
)

2. Load Test Before Deployment¶

import time
import numpy as np

queries = ["test query"] * 10000
embeddings = [np.random.randn(384).tolist() for _ in queries]

start = time.time()
for query, emb in zip(queries, embeddings):
    result = router.route(query, emb)
end = time.time()

throughput = len(queries) / (end - start)
avg_latency = ((end - start) / len(queries)) * 1000

print(f"Throughput: {throughput:.0f} req/s")
print(f"Avg Latency: {avg_latency:.2f}ms")

3. Monitor in Production¶

import time
from collections import deque

class LatencyMonitor:
    def __init__(self, window=1000):
        self.latencies = deque(maxlen=window)

    def record(self, latency_ms):
        self.latencies.append(latency_ms)

    def stats(self):
        sorted_lat = sorted(self.latencies)
        n = len(sorted_lat)
        return {
            'p50': sorted_lat[int(n * 0.50)],
            'p95': sorted_lat[int(n * 0.95)],
            'p99': sorted_lat[int(n * 0.99)],
            'avg': sum(sorted_lat) / n
        }

monitor = LatencyMonitor()

# Record latencies
result = router.route(query, embedding)
monitor.record(result['latency_ms'])

# Check stats periodically
if requests % 1000 == 0:
    print(monitor.stats())

Troubleshooting¶

High Latency¶

Symptoms: P99 > 20ms

Causes & Solutions: 1. Too many routes (>10K) - Solution: Reduce routes or use hierarchical routing 2. Large dimension (>768) - Solution: Use smaller embeddings 3. Cold start - Solution: Warm up router at startup 4. CPU contention - Solution: Dedicate cores to router

High Memory¶

Symptoms: Memory > 1GB for <10K routes

Causes & Solutions: 1. Large embeddings (1536D) - Solution: Use 384D embeddings 2. Too many routes - Solution: Prune unused routes 3. Memory leak (rare) - Solution: Update to latest version

Low Accuracy¶

Symptoms: Accuracy < 90%

Causes & Solutions: 1. Poor embeddings - Solution: Use better embedding model 2. Overlapping routes - Solution: Make routes more distinct 3. Missing keywords - Solution: Add more keywords 4. Wrong threshold - Solution: Tune threshold

Next Steps¶

Routing Engine - How it works
Algorithms - Algorithm details
Examples - Code examples