Spaces:

sanchitshaleen
/

chat-with-your-data

Running

App Files Files Community

chat-with-your-data / PERFORMANCE_GUIDE.md

sanchitshaleen

Initial deployment of RAG with Gemma-3 to Hugging Face Spaces

4aec76b 4 days ago

preview code

raw

history blame contribute delete

3.88 kB

Performance Optimization Guide: Reducing P50/P99 from 60s to <10s

Problem Analysis

Original latency: P50 60s+, P99 60s+
Root cause: Synchronous metric evaluation blocking response stream
Current workflow: Response → Stream → Wait for metrics (BLOCKING) → Return

Solution: Async Non-Blocking Evaluation

New latency target: P50 <5s, P99 <8s
New workflow: Response → Stream → Fire background metric task → Return immediately

Implementation Details

1. Parallel Metric Computation (`evaluation_async.py`)

AnswerRelevancyMetric and FaithfulnessMetric run concurrently
Uses ThreadPoolExecutor with 2 workers (one per metric)
Single metric now takes ~2-3s instead of 4-6s sequentially

Performance Impact: ~50% reduction in metric time (4-6s → 2-3s)

2. Non-Blocking Background Evaluation

Metrics computed AFTER response is sent to user
User doesn't wait for evaluation to complete
Results stored in cache for later retrieval

Performance Impact: P99 response time drops from 60s+ to <8s

3. Timeout Protection

Global timeout: 8 seconds (total for all metrics)
Per-metric timeout: 5 seconds
Gracefully degrades to 0.0 score if metric times out

Benefits: Prevents runaway evaluations from blocking system

4. Response Caching

Simple in-memory cache (LRU with 1000 entry limit)
Identical queries reuse cached metrics
Key: MD5 hash of (question + answer + contexts)

Performance Impact: Repeated queries return metrics in <1ms

5. Graceful Degradation

Individual metric failures don't crash pipeline
Failed metric returns 0.0 score
System continues operating normally

Configuration Options

In server.py:

app.state.evaluator = AsyncRAGEvaluator(
    evaluation_timeout=8.0,      # Max 8 seconds total
    metric_timeout=5.0,          # Max 5 seconds per metric
    enable_cache=True,           # Use response cache
    enable_background_eval=True, # Non-blocking mode
)

Expected Performance Results

Before Optimization

P50: ~45-60 seconds
P99: ~60+ seconds
Bottleneck: Synchronous metric evaluation

After Optimization

P50: ~2-4 seconds (response + sources)
P99: ~8 seconds (response + sources + evaluation timeout)
Metrics: Computed asynchronously, returned when ready

Deployment Checklist

Verify evaluation_async.py is imported correctly
Confirm AsyncRAGEvaluator initialized in lifespan
Update frontend to handle status: "computing" in metrics
Monitor P50/P99 latencies in production
Verify background task doesn't leak memory
Consider Redis-backed cache for distributed deployments

Future Enhancements

Distributed Caching: Replace in-memory cache with Redis
Metrics Storage: Store evaluation results in DB for analytics
Weighted Metrics: Weight older evaluations lower
Model Quantization: Use quantized Ollama model for faster inference
Metric Sampling: Evaluate only 10% of requests, extrapolate rest

Monitoring Commands

# Check evaluator cache stats
GET /api/debug/cache-stats

# Clear cache if needed
POST /api/debug/clear-cache

# Monitor background tasks
docker-compose logs -f app | grep "Background"

Cost/Benefit Analysis

Aspect	Cost	Benefit
Complexity	Moderate (async handling)	Major (7-8x faster)
Memory	+10MB (cache)	N/A
Latency	P99 from 60s → 8s	87% reduction
User Experience	Metrics async	Instant feedback
Reliability	Minor (timeout risk)	Better (timeout protection)

A/B Testing Option

Run both implementations:

Control: Blocking evaluation (current)
Test: Non-blocking evaluation (new)

Monitor:

P50/P99 latency
User satisfaction
Metric accuracy loss (if any)