chat-with-your-data / README.md
sanchitshaleen
Initial deployment of RAG with Gemma-3 to Hugging Face Spaces
573e464
metadata
title: Chat with Your Data
emoji: πŸ¦™
colorFrom: blue
colorTo: indigo
sdk: docker
app_file: app.py
pinned: false
license: gpl-3.0

Chat with your data πŸš€

A production-grade document intelligence system built with Google DeepMind's Gemma 3 LLM served locally via Ollama. This system enables users to upload documents (PDFs, TXT, Markdown, etc.) and chat with their content using natural language queriesβ€”all processed locally for privacy and complete control.

Designed with modularity, performance, and production standards in mind, the system handles end-to-end RAG workflows including file ingestion, vector embeddings, history summarization, semantic search, context-aware generation, and streaming responses. Features include multi-file support per user, persistent session history, document management, and intelligent query caching for 700x faster responses on repeated queries.

Perfect for building educational assistants, personal knowledge bases, and enterprise document systems with full local control and no cloud dependencies.

Check out the live project deployment: HuggingFace Space Deployment Link

πŸ“ƒ Index:

🎯 Project Details:

Aim

The core objective of this project is to build a robust RAG system with modern components and clean modular design and proper error handling.

Methodology

  1. Make a responsive UI in Streamlit allowing user to upload documents, get previews to ensure correctness and interact with them.
  2. Use FastAPI to build a backend that handles file uploads, document processing, user authentication and streaming LLM responses.
  3. Code modular LLM System using LangChain components for chains, embeddings, retrievers, vector storage, history management, output parsers and overall LLM-Orchestration.
  4. Integrate locally hosted Gemma-3 LLM using Ollama for local inference.
  5. Use Qdrant for efficient vector storage, similarity search and user specific document storage and retrieval.
  6. Use PostgreSQL for user management, authentication, and data control.
  7. Create a dynamic Docker setup for easy deployment as either a development or deployment environment.
  8. Deploy project on Hugging Face Spaces for easy access and demonstration.

Due to hosting limitations of Gemma3, the Hugging Face Space deployment uses Google Gemini-2.0-Flash-Lite as the LLM backend.

RAG Samples:

  • Q: Highest possible grade: RAG Sample Q1 RAG Sample A1
  • Q: Formatted Output: RAG Sample Q2

Performance Benchmarks

Tested on local Ollama with Gemma3:latest on Apple Silicon:

Scenario Latency Notes
First Query (Cache Miss) ~70-90s Full LLM generation, response cached
Repeated Query (Cache Hit) <100ms Retrieved from in-memory cache, 700x faster
With Metrics (Disabled) ~70-90s Metrics computed in background (non-blocking)
With Metrics (Enabled) ~75-95s +5s for Answer Relevancy & Faithfulness checks
*P50 Latency (Mixed Workload) ~30-40s With 30-40% cache hit rate typical for RAG
*P99 Latency (Mixed Workload) ~50-60s Includes occasional cold cache misses

*Assuming typical RAG usage patterns with 30-40% query repetition rate.

Features

⚑ Performance Optimizations

  • Query Response Caching: Intelligent in-memory caching of RAG responses with TTL-based expiration

    • Cache hits return in <100ms (vs 70+ seconds for LLM generation)
    • ~700x performance improvement for repeated queries
    • Global cache key based on normalized questions (identical questions = cache hit regardless of user)
    • Configurable TTL and max cache size with automatic LRU eviction
  • Async Evaluation: Non-blocking background task evaluation for metrics computation

    • Metrics computed asynchronously without blocking the response stream
    • <8 second P99 latency even with evaluation enabled
    • Production-grade error handling and timeouts
  • Reference-Free Metrics: LLM-as-Judge evaluation using only the question and generated answer

    • Answer Relevancy: Measures how well the answer addresses the question
    • Faithfulness: Ensures the answer is grounded in the retrieved documents
    • No ground truth required (perfect for open-ended knowledge bases)

User Authentication

+ Authenticate users using `PostgreSQL` database and `bcrypt` based password hashing and salt.
    [![User Registration Screenshot](./Docs/1_Auth.png)](./Docs/1_Auth.png)
+ Store user data securely and also auto clear stale sessions data.
  • UI and User Controls:

    • Build a responsive UI using Streamlit app. Provide a chat interface for users to ask questions about their documents, get file previews and receive context-aware responses. User File Preview
    • User uploaded files and corresponding data are tracked in a PostgreSQL database.
    • Allow users to delete their uploaded documents, and manage their session history. User Chat Screenshot
    • Note: Files previews are cached for 10 minutes, so even after deletion, the file preview might be available for that duration.
    • Also works with FastAPI SSE to show real-time responses from the LLM and retrieved documents and metadata for verification. Source Documents Screenshot
    • UI supports thinking models also to show the LLM's thought process while generating responses. Thinking Model Screenshot
  • User wise document management:

    • Support multi-file embeddings per user, allowing users to upload multiple documents and retrieve relevant information based on their queries.
    • Some documents can also be added as public documents, which can be accessed by all users. (like shared rulebooks or manuals or documentation)
  • Embeddings, Vector Storage and Retrieval:

    • Implement vector embeddings using LangChain components to convert documents into vector representations.
    • Open source mxbai-embed-large model is used for generating embeddings, which is a lightweight and efficient embedding model.
    • Use Qdrant for efficient vector storage and retrieval of user-specific + public documents with persistent storage.
    • Integrate similarity search and document retrieval with Gemma-based LLM responses.
  • FastAPI Backend:

    • Build a FastAPI backend to handle file uploads, document processing, user authentication, and streaming LLM responses.
    • Integrate with 'LLM System' module to handle LLM tasks.
    • Provide status updates to UI for long running tasks: Step By Step Updates Screenshot
    • Implement Server-Sent Events (SSE) for real-time streaming of LLM responses to the frontend with NDJSON format for data transfer. SSE Streaming Screenshot
    • Provide UI with retrieved documents and metadata for verification of responses.
  • LLM System:

    • Modular LLM System using LangChain components for:
      1. Document Ingestion: Load files and process them into document chunks.
      2. Vector Embedding: Convert documents into vector representations.
      3. History Summarization: Summarize user session history for querying vector embeddings and retrieving relevant documents.
      4. Document Retrieval: Fetch relevant documents based on standalone query and user's metadata filters.
      5. History Management: Maintain session history for context-aware interactions.
      6. Response Generation: Generate context-aware responses using the LLM.
      7. Tracing: Enable tracing of LLM interactions using LangSmith for debugging and monitoring LLM interactions.
      8. Models: Use Ollama to run the Gemma-3 LLM and mxbai embeddings locally for inference, ensuring low latency and privacy.
  • Dockerization:

    • Create a dynamic Docker setup for easy deployment as either a development or deployment environment.
    • Use Dockerfile to manage both FastAPI and Streamlit server in a single container (mainly due to Hugging Face Spaces limitations).

πŸ§‘β€πŸ’» Tech Stack

  • 🦜 LangChain - LLM orchestration and RAG chains
  • ⚑ FastAPI - High-performance async API backend
  • πŸ‘‘ Streamlit - Interactive web UI for document management and chat
  • πŸ‹ Docker - Containerization for reproducible deployments
  • πŸ¦™ Ollama - Local inference engine
    • Gemma-3 - Large language model
    • mxbai-embed-large - Embedding model for semantic search
  • πŸ—„οΈ PostgreSQL - User authentication, file metadata, and session management
  • 🎯 Qdrant - Vector database for semantic search and document embeddings
  • πŸ› οΈ DeepEval - LLM-as-Judge evaluation metrics (Answer Relevancy, Faithfulness)
  • πŸ’Ύ In-Memory Cache - Query response caching with TTL (700x faster on cache hits)
  • πŸ›‘οΈ bcrypt - Secure password hashing for authentication
  • πŸ“Š LangSmith - LLM tracing and monitoring

Others:

  • πŸ€— Hugging Face Spaces:
    • Deploy the project in a Docker container using Dockerfile.
  • :octocat: GitHub actions and Branch Protection:
    • Process the repository for auto deployment to Hugging Face Spaces.
    • Check for any secret leaks in code.
    • Fail the commit on any secret leaks.

πŸ› οΈ Installation

Choose your deployment approach:

Virtual Environment

Best for: Local development, debugging, and testing with hot-reload capability.

Prerequisites:

  • Python 3.12+
  • Ollama with gemma3:latest model
  • PostgreSQL (optional, can use memory backend)
  • Redis (optional, for distributed cache)
  • Qdrant (optional, can run via Docker or use in-memory mode)

Step 1: Clone & Setup

git clone --depth 1 https://github.com/Bbs1412/chat-with-your-data.git
cd chat-with-your-data

# Create virtual environment
python3 -m venv venv

# Activate environment
source venv/bin/activate  # Linux/Mac
# or
venv\Scripts\activate  # Windows

Step 2: Install Dependencies

pip install --upgrade pip
pip install -r requirements.txt

Installs:

  • FastAPI + Uvicorn (async API server)
  • Streamlit (interactive web UI)
  • LangChain + Ollama integration
  • Qdrant client (vector search)
  • PostgreSQL driver (psycopg2)
  • Redis client
  • DeepEval (LLM-as-Judge metrics)
  • All other dependencies

Step 3: Configure Environment (Optional)

Create .env file in project root:

# Ollama configuration
OLLAMA_BASE_URL=http://localhost:11434

# Qdrant vector database
QDRANT_URL=http://localhost:6333
QDRANT_COLLECTION_NAME=rag_documents

# PostgreSQL (if not using defaults)
# DATABASE_URL=postgresql://user:password@localhost/ragdb

# Redis for chat history backend
REDIS_URL=redis://localhost:6379/0
REDIS_HISTORY_TTL_SECONDS=0

# Optional: LangSmith tracing
LANGCHAIN_TRACING_V2=false
# LANGCHAIN_API_KEY=your_key_here

Step 4: Start Services (3 Terminal Windows)

Terminal 1 - Ollama:

ollama serve
# Or ensure model is pulled:
ollama pull gemma3:latest

Terminal 2 - FastAPI Backend:

cd chat-with-your-data
source venv/bin/activate
cd server
uvicorn server:app --reload --port 8000

Access at:

Terminal 3 - Streamlit Frontend:

cd chat-with-your-data
source venv/bin/activate
streamlit run app.py --server.port 8501

Access at: http://localhost:8501

Optional: Start PostgreSQL & Redis (macOS with Homebrew)

# PostgreSQL
brew services start postgresql

# Redis
brew services start redis

# Check status
brew services list | grep -E "postgresql|redis"

Optional: Qdrant Vector Database

Option A: Docker (Recommended)

docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant:v1.12.5

Access UI at: http://localhost:6333/dashboard

Option B: In-Memory Mode

Edit server/llm_system/core/qdrant_database.py and use in-memory client for ephemeral storage.

Quick Test

# 1. Open browser β†’ http://localhost:8501
# 2. Register/Login with test credentials
# 3. Upload a PDF or text document
# 4. Embed the document
# 5. Ask questions about it
# 6. Check FastAPI logs for cache hits/misses (⚑ Cache HIT!)

Troubleshooting

Issue Solution
Port 8000 in use lsof -ti:8000 | xargs kill -9
Ollama not found brew install ollama && ollama pull gemma3:latest
PostgreSQL error Use memory backend: HISTORY_BACKEND=memory in config
Redis not found brew install redis && brew services start redis
Qdrant connection error Run Docker version or edit config for in-memory mode

πŸ‹ Docker

Best for: Production deployments, CI/CD pipelines, and consistent environments across machines.

Features:

  • Complete isolation with containerization
  • All services bundled (PostgreSQL, Qdrant, Redis)
  • Easy deployment to cloud platforms
  • No dependency conflicts on host machine
  • docker-compose for multi-container orchestration

Quick Start (Recommended)

Prerequisites:

  • Docker & Docker Compose installed
  • 4GB+ RAM available
  • Port 8000, 8501, 6333, 5432, 6379 available

Step 1: Build & Start All Services

cd chat-with-your-data

# Build fresh image without cache
docker-compose build --no-cache

# Start all containers (detached mode)
docker-compose up -d

# Verify all services are running
docker-compose ps

Step 2: Access Services

FastAPI Backend:     http://localhost:8000
FastAPI Swagger UI:  http://localhost:8000/docs
Streamlit Frontend:  http://localhost:8501
Qdrant UI:          http://localhost:6333/dashboard
Redis Commander:    http://localhost:8081
PostgreSQL:         localhost:5432

Step 3: View Logs

# View all logs
docker-compose logs -f

# View specific service
docker-compose logs -f app
docker-compose logs -f postgres
docker-compose logs -f qdrant

Step 4: Stop Services

# Stop all containers
docker-compose down

# Stop and remove volumes (complete reset)
docker-compose down -v

Architecture Overview

Services (docker-compose):

Service Image Port Purpose
app python:3.12-slim 8000, 8501 FastAPI + Streamlit
postgres postgres:16-alpine 5432 User data & file metadata
qdrant qdrant/qdrant:v1.12.5 6333 Vector database
redis redis:8.0-alpine 6379 Cache & session history
redis-commander redis-commander:latest 8081 Redis UI (optional)

Environment Variables:

# Created automatically via docker-compose
OLLAMA_BASE_URL: http://ollama:11434
QDRANT_URL: http://qdrant:6333
DATABASE_URL: postgresql://raguser:ragpass@postgres/ragdb
REDIS_URL: redis://redis:6379/0

Configuration

docker-compose.yml includes:

  • PostgreSQL with health checks
  • Qdrant with persistent storage
  • Redis with persistence
  • Redis Commander for debugging
  • Custom network for service-to-service communication
  • Volume management for data persistence

Production Deployment

For cloud deployment (AWS, GCP, Azure):

  1. Build image for production:

    docker build -t your-registry/chat-with-your-data:latest .
    
  2. Push to registry:

    docker push your-registry/chat-with-your-data:latest
    
  3. Deploy with orchestration (Kubernetes, ECS, Cloud Run):

    # Update image reference in deployment configs
    kubectl apply -f k8s-deployment.yaml  # if using K8s
    
  4. Environment configuration:

    • Set OLLAMA_BASE_URL to your LLM provider
    • Configure QDRANT_URL for cloud Qdrant instance
    • Set database credentials in secrets management
    • Configure CORS for your domain

Development with Docker

If using Docker for development (with Ollama on host):

# Expose host Ollama to container
docker run \
  -p 8000:8000 \
  -p 8501:8501 \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  -v $(pwd):/fastAPI \
  chat-with-your-data-app:latest

Troubleshooting

Issue Solution
Port already in use docker-compose down then retry, or change ports in docker-compose.yml
Out of memory Increase Docker memory: Preferences β†’ Resources β†’ Memory
PostgreSQL won't start Delete volume: docker volume rm chat-with-your-data_postgres-data
Qdrant connection error Wait 10s for startup: docker-compose logs qdrant
Container won't build docker-compose build --no-cache to rebuild fresh

Comparison: Virtual Environment vs Docker

Aspect Virtual Environment Docker
Setup Time 5-10 min 2-3 min
Dependencies Manual (Ollama, PG, Redis) Automatic
Hot Reload βœ… Yes ❌ No (manual rebuild)
Isolation ❌ Shared system βœ… Complete
Debugging βœ… Direct Python logs Logs via docker-compose
Production Ready ⚠️ With care βœ… Yes
CI/CD Ready ❌ βœ… Yes
Disk Space ~500MB ~2GB

Recommendation:

  • Development: Use Virtual Environment for faster iteration and hot-reload
  • Production: Use Docker for consistency, isolation, and easy deployment # This is necessary env variable
    -e GOOGLE_API_KEY=""
    # Port mapping, only 7860 is exposed
    -p 7860:7860
    BBS/chat-with-your-data:prod
    
    
  1. Start the Docker container:

    docker start -a chat-with-your-data-cont-prod
    
  2. You can now access the Project at http://localhost:7860

πŸ›‘οΈ Extra Measures

Reset Project

For Docker Deployment:

# Remove all containers and volumes (complete reset)
docker-compose down -v

# Rebuild and restart
docker-compose build --no-cache
docker-compose up -d

# View initialization logs
docker-compose logs -f app

For Virtual Environment:

# Remove cache files
find . -type d -name "__pycache__" -exec rm -r {} +  # Linux/Mac
# or
Get-ChildItem -Recurse -Directory -Filter "__pycache__" | Remove-Item -Recurse -Force  # Windows

# Clear user uploads
rm -rf ./user_uploads/

# Reset local PostgreSQL (if using Homebrew)
brew services stop postgresql
rm -rf /usr/local/var/postgres
brew services start postgresql

Persistent Storage with Docker

To persist user uploads and database across container restarts:

The docker-compose.yml already includes persistent volumes:

  • postgres-data: PostgreSQL database persistence
  • qdrant-data: Qdrant vector store persistence
  • redis-data: Redis cache persistence
  • ./user_uploads: Local directory mounted for user files

No additional setup needed - data persists automatically!

Using Host Ollama with Docker

If Ollama is running on your host machine:

macOS/Windows:

  • Works automatically via host.docker.internal:11434
  • Set in docker-compose: OLLAMA_BASE_URL=http://host.docker.internal:11434

Linux:

  • Add network flag to docker-compose:
    network_mode: "host"
    # Or add to services:
    extra_hosts:
      - "host.docker.internal:host-gateway"
    
    --add-host=host.docker.internal:host-gateway
    
    

Ollama Models:

  • To change LLM or Embedding model:

    • Go to ./server/llm_system/config.py file.
    • It is central configuration file for the project.
    • Any constant can be changed there to be used in the project.
    • There are two diff models saved in config, but, I have used same model for response generation and summarization, if you want to change it, you can update the summarization model in server.py (β‰ˆ line 63)
  • To change inference device:

    • I have configured the LLM model to work on GPU and embedding model to work on CPU.
    • 0 means 100% CPU, -1 means 100% GPU, and any other number specifies particular number of model's layers to be offloaded on GPU.
    • Delete this parameter if you are unsure of these values and your hardware capabilities. Ollama dynamically offloads layers to GPU based on available resources.

If you are using docker, make sure to do these changes in ./docker/dev_* files.

To test some sub-components:

  • This ensures that relative imports work correctly in the project.
    cd server
    python -m llm_system.utils.loader
    

Cache Architecture & Configuration

How the Cache Works

The system implements an intelligent query response caching layer that dramatically improves performance for repeated questions:

  1. Cache Key Generation: Uses SHA256 hash of the normalized question (lowercase, trimmed)

    • Global cache: Same question returns cached answer regardless of user
    • Normalized matching: Handles minor whitespace/punctuation variations
  2. Cache Operations:

    • On Request: Check if question exists in cache (if not expired)
    • Cache Hit: Return cached response in <100ms (no LLM call needed)
    • Cache Miss: Generate response via LLM, then cache for future requests
  3. TTL & Eviction:

    • Default TTL: 1 hour (configurable via ResponseCache(ttl_seconds=3600))
    • Max Size: 500 responses (configurable _CACHE_MAX_SIZE)
    • Eviction: LRU (least recently used) when cache exceeds max size

Configuration

Edit /server/llm_system/config.py or use environment variables:

# In server.py lifespan:
app.state.response_cache = ResponseCache(ttl_seconds=3600)  # 1 hour TTL

Monitoring Cache Performance

Check cache stats via debug endpoint:

curl http://localhost:8000/cache-debug | jq

Response shows current cache size and stored keys:

{
  "cache_size": 5,
  "cache_keys": ["61bf208516ecd284", "7a2c443f92b8e1ac"],
  "entries": [...]
}

Check logs for cache operations:

docker-compose exec app grep "Cache HIT\|CACHE HIT" /fastAPI/app.log

Disabling Cache (if needed)

For real-time information without caching:

# In server.py, comment out the cache check:
# if not dummy:
#     cached_answer = response_cache.get(chat_request.query, session_id)

πŸš€ Future Work

  • Add support for more file formats (DOCX, PPTX, Excel, etc.)
  • Implement web-based document loading (scrape websites on-the-fly)
  • Redis-backed distributed cache for multi-instance deployments
  • PostgreSQL migration for production-scale user management
  • Implement semantic caching (cache based on question embeddings, not exact match)
  • Add document versioning and change tracking
  • GPU optimization for faster embedding generation
  • LLM fine-tuning on domain-specific documents
  • Advanced analytics dashboard for cache hit rates and performance metrics
  • Multi-language support and cross-lingual search

🀝 Contributions

Any contributions or suggestions are welcome!

πŸ“œ License

Code-License

  • This project is licensed under the GNU General Public License v3.0
  • See the LICENSE file for details.
  • You can use the code with proper credits to the author.

πŸ“§ Contact