# Performance Optimization Guide: Reducing P50/P99 from 60s to <10s

## Problem Analysis
- **Original latency**: P50 60s+, P99 60s+
- **Root cause**: Synchronous metric evaluation blocking response stream
- **Current workflow**: Response → Stream → Wait for metrics (BLOCKING) → Return

## Solution: Async Non-Blocking Evaluation
- **New latency target**: P50 <5s, P99 <8s
- **New workflow**: Response → Stream → Fire background metric task → Return immediately

## Implementation Details

### 1. **Parallel Metric Computation** (`evaluation_async.py`)
- **AnswerRelevancyMetric** and **FaithfulnessMetric** run concurrently
- Uses `ThreadPoolExecutor` with 2 workers (one per metric)
- Single metric now takes ~2-3s instead of 4-6s sequentially

**Performance Impact**: ~50% reduction in metric time (4-6s → 2-3s)

### 2. **Non-Blocking Background Evaluation**
- Metrics computed AFTER response is sent to user
- User doesn't wait for evaluation to complete
- Results stored in cache for later retrieval

**Performance Impact**: P99 response time drops from 60s+ to <8s

### 3. **Timeout Protection**
- Global timeout: 8 seconds (total for all metrics)
- Per-metric timeout: 5 seconds
- Gracefully degrades to 0.0 score if metric times out

**Benefits**: Prevents runaway evaluations from blocking system

### 4. **Response Caching**
- Simple in-memory cache (LRU with 1000 entry limit)
- Identical queries reuse cached metrics
- Key: MD5 hash of (question + answer + contexts)

**Performance Impact**: Repeated queries return metrics in <1ms

### 5. **Graceful Degradation**
- Individual metric failures don't crash pipeline
- Failed metric returns 0.0 score
- System continues operating normally

## Configuration Options

In `server.py`:
```python
app.state.evaluator = AsyncRAGEvaluator(
    evaluation_timeout=8.0,      # Max 8 seconds total
    metric_timeout=5.0,          # Max 5 seconds per metric
    enable_cache=True,           # Use response cache
    enable_background_eval=True, # Non-blocking mode
)
```

## Expected Performance Results

### Before Optimization
```
P50: ~45-60 seconds
P99: ~60+ seconds
Bottleneck: Synchronous metric evaluation
```

### After Optimization
```
P50: ~2-4 seconds (response + sources)
P99: ~8 seconds (response + sources + evaluation timeout)
Metrics: Computed asynchronously, returned when ready
```

## Deployment Checklist

- [ ] Verify `evaluation_async.py` is imported correctly
- [ ] Confirm `AsyncRAGEvaluator` initialized in lifespan
- [ ] Update frontend to handle `status: "computing"` in metrics
- [ ] Monitor P50/P99 latencies in production
- [ ] Verify background task doesn't leak memory
- [ ] Consider Redis-backed cache for distributed deployments

## Future Enhancements

1. **Distributed Caching**: Replace in-memory cache with Redis
2. **Metrics Storage**: Store evaluation results in DB for analytics
3. **Weighted Metrics**: Weight older evaluations lower
4. **Model Quantization**: Use quantized Ollama model for faster inference
5. **Metric Sampling**: Evaluate only 10% of requests, extrapolate rest

## Monitoring Commands

```bash
# Check evaluator cache stats
GET /api/debug/cache-stats

# Clear cache if needed
POST /api/debug/clear-cache

# Monitor background tasks
docker-compose logs -f app | grep "Background"
```

## Cost/Benefit Analysis

| Aspect | Cost | Benefit |
|--------|------|---------|
| Complexity | Moderate (async handling) | Major (7-8x faster) |
| Memory | +10MB (cache) | N/A |
| Latency | P99 from 60s → 8s | 87% reduction |
| User Experience | Metrics async | Instant feedback |
| Reliability | Minor (timeout risk) | Better (timeout protection) |

## A/B Testing Option

Run both implementations:
- **Control**: Blocking evaluation (current)
- **Test**: Non-blocking evaluation (new)

Monitor:
- P50/P99 latency
- User satisfaction
- Metric accuracy loss (if any)