chat-with-your-data / PERFORMANCE_GUIDE.md
sanchitshaleen
Initial deployment of RAG with Gemma-3 to Hugging Face Spaces
4aec76b

Performance Optimization Guide: Reducing P50/P99 from 60s to <10s

Problem Analysis

  • Original latency: P50 60s+, P99 60s+
  • Root cause: Synchronous metric evaluation blocking response stream
  • Current workflow: Response β†’ Stream β†’ Wait for metrics (BLOCKING) β†’ Return

Solution: Async Non-Blocking Evaluation

  • New latency target: P50 <5s, P99 <8s
  • New workflow: Response β†’ Stream β†’ Fire background metric task β†’ Return immediately

Implementation Details

1. Parallel Metric Computation (evaluation_async.py)

  • AnswerRelevancyMetric and FaithfulnessMetric run concurrently
  • Uses ThreadPoolExecutor with 2 workers (one per metric)
  • Single metric now takes ~2-3s instead of 4-6s sequentially

Performance Impact: ~50% reduction in metric time (4-6s β†’ 2-3s)

2. Non-Blocking Background Evaluation

  • Metrics computed AFTER response is sent to user
  • User doesn't wait for evaluation to complete
  • Results stored in cache for later retrieval

Performance Impact: P99 response time drops from 60s+ to <8s

3. Timeout Protection

  • Global timeout: 8 seconds (total for all metrics)
  • Per-metric timeout: 5 seconds
  • Gracefully degrades to 0.0 score if metric times out

Benefits: Prevents runaway evaluations from blocking system

4. Response Caching

  • Simple in-memory cache (LRU with 1000 entry limit)
  • Identical queries reuse cached metrics
  • Key: MD5 hash of (question + answer + contexts)

Performance Impact: Repeated queries return metrics in <1ms

5. Graceful Degradation

  • Individual metric failures don't crash pipeline
  • Failed metric returns 0.0 score
  • System continues operating normally

Configuration Options

In server.py:

app.state.evaluator = AsyncRAGEvaluator(
    evaluation_timeout=8.0,      # Max 8 seconds total
    metric_timeout=5.0,          # Max 5 seconds per metric
    enable_cache=True,           # Use response cache
    enable_background_eval=True, # Non-blocking mode
)

Expected Performance Results

Before Optimization

P50: ~45-60 seconds
P99: ~60+ seconds
Bottleneck: Synchronous metric evaluation

After Optimization

P50: ~2-4 seconds (response + sources)
P99: ~8 seconds (response + sources + evaluation timeout)
Metrics: Computed asynchronously, returned when ready

Deployment Checklist

  • Verify evaluation_async.py is imported correctly
  • Confirm AsyncRAGEvaluator initialized in lifespan
  • Update frontend to handle status: "computing" in metrics
  • Monitor P50/P99 latencies in production
  • Verify background task doesn't leak memory
  • Consider Redis-backed cache for distributed deployments

Future Enhancements

  1. Distributed Caching: Replace in-memory cache with Redis
  2. Metrics Storage: Store evaluation results in DB for analytics
  3. Weighted Metrics: Weight older evaluations lower
  4. Model Quantization: Use quantized Ollama model for faster inference
  5. Metric Sampling: Evaluate only 10% of requests, extrapolate rest

Monitoring Commands

# Check evaluator cache stats
GET /api/debug/cache-stats

# Clear cache if needed
POST /api/debug/clear-cache

# Monitor background tasks
docker-compose logs -f app | grep "Background"

Cost/Benefit Analysis

Aspect Cost Benefit
Complexity Moderate (async handling) Major (7-8x faster)
Memory +10MB (cache) N/A
Latency P99 from 60s β†’ 8s 87% reduction
User Experience Metrics async Instant feedback
Reliability Minor (timeout risk) Better (timeout protection)

A/B Testing Option

Run both implementations:

  • Control: Blocking evaluation (current)
  • Test: Non-blocking evaluation (new)

Monitor:

  • P50/P99 latency
  • User satisfaction
  • Metric accuracy loss (if any)