--- title: Chat with Your Data emoji: πŸ¦™ colorFrom: blue colorTo: indigo sdk: docker app_file: app.py pinned: false license: gpl-3.0 --- # `Chat with your data` πŸš€ A **production-grade document intelligence system** built with Google DeepMind's **Gemma 3** LLM served locally via Ollama. This system enables users to upload documents (PDFs, TXT, Markdown, etc.) and chat with their content using natural language queriesβ€”all processed locally for privacy and complete control. Designed with **modularity, performance, and production standards** in mind, the system handles end-to-end RAG workflows including file ingestion, vector embeddings, history summarization, semantic search, context-aware generation, and streaming responses. Features include multi-file support per user, persistent session history, document management, and **intelligent query caching** for 700x faster responses on repeated queries. Perfect for building **educational assistants, personal knowledge bases, and enterprise document systems** with full local control and no cloud dependencies. Check out the live project deployment: [![HuggingFace Space Deployment Link](https://img.shields.io/badge/sanchit--shaleen/-chat--with--your--data-ff8800.svg?logo=huggingface)](https://huggingface.co/spaces/sanchit-shaleen/chat-with-your-data) # πŸ“ƒ Index: - [RAG with Gemma-3](#rag-with-gemma-3) - [Project Details](#-project-details) - [Aim](#aim) - [Methodology](#methodology) - [Features](#features) - [Tech Stack](#-tech-stack) - [Installation](#%EF%B8%8F-installation) - [Virtual Environment (Development)](#virtual-environment) - [Docker (Production)](#-docker) - [Comparison & Recommendation](#comparison-virtual-environment-vs-docker) - [Extra Measures](#%EF%B8%8F-extra-measures) - [Reset Project](#reset-project) - [Persistent Storage](#persistent-storage-with-docker) - [Host Ollama Setup](#using-host-ollama-with-docker) - [Future Work](#-future-work) - [Contributions](#-contributions) - [License](#-license) - [Contact](#-contact) # 🎯 Project Details: ## Aim The core objective of this project is to build a **robust RAG system** with modern components and clean modular design and proper error handling. ## Methodology 1. Make a responsive UI in `Streamlit` allowing user to upload documents, get previews to ensure correctness and interact with them. 1. Use `FastAPI` to build a backend that handles file uploads, document processing, user authentication and streaming LLM responses. 1. Code modular `LLM System` using `LangChain` components for chains, embeddings, retrievers, vector storage, history management, output parsers and overall LLM-Orchestration. 1. Integrate locally hosted `Gemma-3` LLM using `Ollama` for local inference. 1. Use `Qdrant` for efficient vector storage, similarity search and user specific document storage and retrieval. 1. Use `PostgreSQL` for user management, authentication, and data control. 1. Create a dynamic `Docker` setup for easy deployment as either a development or deployment environment. 1. Deploy project on `Hugging Face Spaces` for easy access and demonstration. > [!Note] > Due to hosting limitations of Gemma3, the Hugging Face Space deployment uses `Google Gemini-2.0-Flash-Lite` as the LLM backend. ## RAG Samples: - Q: Highest possible grade: [![RAG Sample Q1](./Docs/5_Doc_Retr_Que.png)](./Docs/5_Doc_Retr_Que.png) [![RAG Sample A1](./Docs/5_Doc_Retr_Ans.png)](./Docs/5_Doc_Retr_Ans.png) - Q: Formatted Output: [![RAG Sample Q2](./Docs/2.3_Sample_Ans.png)](./Docs/2.3_Sample_Ans.png) ## Performance Benchmarks Tested on local Ollama with Gemma3:latest on Apple Silicon: | Scenario | Latency | Notes | |----------|---------|-------| | **First Query (Cache Miss)** | ~70-90s | Full LLM generation, response cached | | **Repeated Query (Cache Hit)** | <100ms | Retrieved from in-memory cache, 700x faster | | **With Metrics (Disabled)** | ~70-90s | Metrics computed in background (non-blocking) | | **With Metrics (Enabled)** | ~75-95s | +5s for Answer Relevancy & Faithfulness checks | | **P50 Latency (Mixed Workload)* | ~30-40s | With 30-40% cache hit rate typical for RAG | | **P99 Latency (Mixed Workload)* | ~50-60s | Includes occasional cold cache misses | *Assuming typical RAG usage patterns with 30-40% query repetition rate. ## Features ### ⚑ Performance Optimizations - **Query Response Caching**: Intelligent in-memory caching of RAG responses with TTL-based expiration - Cache hits return in <100ms (vs 70+ seconds for LLM generation) - ~700x performance improvement for repeated queries - Global cache key based on normalized questions (identical questions = cache hit regardless of user) - Configurable TTL and max cache size with automatic LRU eviction - **Async Evaluation**: Non-blocking background task evaluation for metrics computation - Metrics computed asynchronously without blocking the response stream - <8 second P99 latency even with evaluation enabled - Production-grade error handling and timeouts - **Reference-Free Metrics**: LLM-as-Judge evaluation using only the question and generated answer - Answer Relevancy: Measures how well the answer addresses the question - Faithfulness: Ensures the answer is grounded in the retrieved documents - No ground truth required (perfect for open-ended knowledge bases) ### User Authentication + Authenticate users using `PostgreSQL` database and `bcrypt` based password hashing and salt. [![User Registration Screenshot](./Docs/1_Auth.png)](./Docs/1_Auth.png) + Store user data securely and also auto clear stale sessions data. - UI and User Controls: + Build a responsive UI using `Streamlit` app. Provide a chat interface for users to ask questions about their documents, get file previews and receive context-aware responses. [![User File Preview](./Docs/2.2_Preview.png)](./Docs/2.2_Preview.png) + User uploaded files and corresponding data are tracked in a **PostgreSQL** database. + Allow users to delete their uploaded documents, and manage their session history. [![User Chat Screenshot](./Docs/3_Clear_Chat_Hist.png)](./Docs/3_Clear_Chat_Hist.png) + Note: Files previews are cached for 10 minutes, so even after deletion, the file preview might be available for that duration. + Also works with FastAPI SSE to show real-time responses from the LLM and retrieved documents and metadata for verification. [![Source Documents Screenshot](./Docs/7_Metadata_n_Src.png)](./Docs/7_Metadata_n_Src.png) + UI supports **thinking models** also to show the LLM's thought process while generating responses. [![Thinking Model Screenshot](./Docs/8_Thinking_Support.png)](./Docs/8_Thinking_Support.png) - User wise document management: + Support **multi-file embeddings** per user, allowing users to upload multiple documents and retrieve relevant information based on their queries. + Some documents can also be added as ***public*** documents, which can be accessed by all users. (like shared rulebooks or manuals or documentation) - Embeddings, Vector Storage and Retrieval: + Implement **vector embeddings** using `LangChain` components to convert documents into vector representations. + Open source `mxbai-embed-large` model is used for generating embeddings, which is a lightweight and efficient embedding model. + Use `Qdrant` for efficient vector storage and retrieval of user-specific + public documents with persistent storage. + Integrate **similarity search** and document retrieval with Gemma-based LLM responses. - FastAPI Backend: + Build a **FastAPI** backend to handle file uploads, document processing, user authentication, and streaming LLM responses. + Integrate with 'LLM System' module to handle LLM tasks. + Provide status updates to UI for long running tasks: [![Step By Step Updates Screenshot](./Docs/5_Doc_Retr_Que.png)](./Docs/5_Doc_Retr_Que.png) + Implement **Server-Sent Events (`SSE`)** for real-time streaming of LLM responses to the frontend with ***NDJSON*** format for data transfer. [![SSE Streaming Screenshot](./Docs/6_Streaming_Resp.png)](./Docs/6_Streaming_Resp.png) + Provide UI with retrieved documents and metadata for verification of responses. - LLM System: + Modular `LLM System` using `LangChain` components for: 1. **Document Ingestion**: Load files and process them into document chunks. 1. **Vector Embedding**: Convert documents into vector representations. 1. **History Summarization**: Summarize user session history for querying vector embeddings and retrieving relevant documents. 1. **Document Retrieval**: Fetch relevant documents based on standalone query and user's metadata filters. 1. **History Management**: Maintain session history for context-aware interactions. 1. **Response Generation**: Generate context-aware responses using the LLM. 1. **Tracing**: Enable tracing of LLM interactions using `LangSmith` for debugging and monitoring LLM interactions. 1. **Models**: Use `Ollama` to run the **Gemma-3** LLM and **mxbai embeddings** locally for inference, ensuring low latency and privacy. - Dockerization: + Create a dynamic `Docker` setup for easy deployment as either a development or deployment environment. + Use [`Dockerfile`](./Dockerfile) to manage both [FastAPI](./server/server.py) and [Streamlit](./app.py) server in a single container (mainly due to Hugging Face Spaces limitations). # πŸ§‘β€πŸ’» Tech Stack - 🦜 **LangChain** - LLM orchestration and RAG chains - ⚑ **FastAPI** - High-performance async API backend - πŸ‘‘ **Streamlit** - Interactive web UI for document management and chat - πŸ‹ **Docker** - Containerization for reproducible deployments - πŸ¦™ **Ollama** - Local inference engine - Gemma-3 - Large language model - mxbai-embed-large - Embedding model for semantic search - πŸ—„οΈ **PostgreSQL** - User authentication, file metadata, and session management - 🎯 **Qdrant** - Vector database for semantic search and document embeddings - πŸ› οΈ **DeepEval** - LLM-as-Judge evaluation metrics (Answer Relevancy, Faithfulness) - πŸ’Ύ **In-Memory Cache** - Query response caching with TTL (700x faster on cache hits) - πŸ›‘οΈ **bcrypt** - Secure password hashing for authentication - πŸ“Š **LangSmith** - LLM tracing and monitoring Others: - πŸ€— **Hugging Face Spaces**: + Deploy the project in a Docker container using Dockerfile. - :octocat: **GitHub actions and Branch Protection**: + Process the repository for auto deployment to Hugging Face Spaces. + Check for any secret leaks in code. + Fail the commit on any secret leaks. # πŸ› οΈ Installation Choose your deployment approach: - **πŸ–₯️ [Virtual Environment (Development)](#virtual-environment)**: Local development with hot-reload - **πŸ‹ [Docker (Production)](#-docker)**: Containerized setup with full isolation ## Virtual Environment **Best for**: Local development, debugging, and testing with hot-reload capability. **Prerequisites:** - Python 3.12+ - Ollama with `gemma3:latest` model - PostgreSQL (optional, can use memory backend) - Redis (optional, for distributed cache) - Qdrant (optional, can run via Docker or use in-memory mode) ### Step 1: Clone & Setup ```bash git clone --depth 1 https://github.com/Bbs1412/chat-with-your-data.git cd chat-with-your-data # Create virtual environment python3 -m venv venv # Activate environment source venv/bin/activate # Linux/Mac # or venv\Scripts\activate # Windows ``` ### Step 2: Install Dependencies ```bash pip install --upgrade pip pip install -r requirements.txt ``` **Installs:** - FastAPI + Uvicorn (async API server) - Streamlit (interactive web UI) - LangChain + Ollama integration - Qdrant client (vector search) - PostgreSQL driver (psycopg2) - Redis client - DeepEval (LLM-as-Judge metrics) - All other dependencies ### Step 3: Configure Environment (Optional) Create `.env` file in project root: ```bash # Ollama configuration OLLAMA_BASE_URL=http://localhost:11434 # Qdrant vector database QDRANT_URL=http://localhost:6333 QDRANT_COLLECTION_NAME=rag_documents # PostgreSQL (if not using defaults) # DATABASE_URL=postgresql://user:password@localhost/ragdb # Redis for chat history backend REDIS_URL=redis://localhost:6379/0 REDIS_HISTORY_TTL_SECONDS=0 # Optional: LangSmith tracing LANGCHAIN_TRACING_V2=false # LANGCHAIN_API_KEY=your_key_here ``` ### Step 4: Start Services (3 Terminal Windows) **Terminal 1 - Ollama:** ```bash ollama serve # Or ensure model is pulled: ollama pull gemma3:latest ``` **Terminal 2 - FastAPI Backend:** ```bash cd chat-with-your-data source venv/bin/activate cd server uvicorn server:app --reload --port 8000 ``` Access at: - **API**: http://localhost:8000 - **Swagger UI**: http://localhost:8000/docs - **Cache Debug**: http://localhost:8000/cache-debug **Terminal 3 - Streamlit Frontend:** ```bash cd chat-with-your-data source venv/bin/activate streamlit run app.py --server.port 8501 ``` Access at: http://localhost:8501 ### Optional: Start PostgreSQL & Redis (macOS with Homebrew) ```bash # PostgreSQL brew services start postgresql # Redis brew services start redis # Check status brew services list | grep -E "postgresql|redis" ``` ### Optional: Qdrant Vector Database **Option A: Docker (Recommended)** ```bash docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant:v1.12.5 ``` Access UI at: http://localhost:6333/dashboard **Option B: In-Memory Mode** Edit `server/llm_system/core/qdrant_database.py` and use in-memory client for ephemeral storage. ### Quick Test ```bash # 1. Open browser β†’ http://localhost:8501 # 2. Register/Login with test credentials # 3. Upload a PDF or text document # 4. Embed the document # 5. Ask questions about it # 6. Check FastAPI logs for cache hits/misses (⚑ Cache HIT!) ``` ### Troubleshooting | Issue | Solution | |-------|----------| | **Port 8000 in use** | `lsof -ti:8000 \| xargs kill -9` | | **Ollama not found** | `brew install ollama && ollama pull gemma3:latest` | | **PostgreSQL error** | Use memory backend: `HISTORY_BACKEND=memory` in config | | **Redis not found** | `brew install redis && brew services start redis` | | **Qdrant connection error** | Run Docker version or edit config for in-memory mode | ## πŸ‹ Docker **Best for**: Production deployments, CI/CD pipelines, and consistent environments across machines. **Features:** - Complete isolation with containerization - All services bundled (PostgreSQL, Qdrant, Redis) - Easy deployment to cloud platforms - No dependency conflicts on host machine - docker-compose for multi-container orchestration ### Quick Start (Recommended) **Prerequisites:** - Docker & Docker Compose installed - 4GB+ RAM available - Port 8000, 8501, 6333, 5432, 6379 available **Step 1: Build & Start All Services** ```bash cd chat-with-your-data # Build fresh image without cache docker-compose build --no-cache # Start all containers (detached mode) docker-compose up -d # Verify all services are running docker-compose ps ``` **Step 2: Access Services** ``` FastAPI Backend: http://localhost:8000 FastAPI Swagger UI: http://localhost:8000/docs Streamlit Frontend: http://localhost:8501 Qdrant UI: http://localhost:6333/dashboard Redis Commander: http://localhost:8081 PostgreSQL: localhost:5432 ``` **Step 3: View Logs** ```bash # View all logs docker-compose logs -f # View specific service docker-compose logs -f app docker-compose logs -f postgres docker-compose logs -f qdrant ``` **Step 4: Stop Services** ```bash # Stop all containers docker-compose down # Stop and remove volumes (complete reset) docker-compose down -v ``` ### Architecture Overview **Services (docker-compose):** | Service | Image | Port | Purpose | |---------|-------|------|---------| | **app** | python:3.12-slim | 8000, 8501 | FastAPI + Streamlit | | **postgres** | postgres:16-alpine | 5432 | User data & file metadata | | **qdrant** | qdrant/qdrant:v1.12.5 | 6333 | Vector database | | **redis** | redis:8.0-alpine | 6379 | Cache & session history | | **redis-commander** | redis-commander:latest | 8081 | Redis UI (optional) | **Environment Variables:** ```yaml # Created automatically via docker-compose OLLAMA_BASE_URL: http://ollama:11434 QDRANT_URL: http://qdrant:6333 DATABASE_URL: postgresql://raguser:ragpass@postgres/ragdb REDIS_URL: redis://redis:6379/0 ``` ### Configuration **docker-compose.yml** includes: - PostgreSQL with health checks - Qdrant with persistent storage - Redis with persistence - Redis Commander for debugging - Custom network for service-to-service communication - Volume management for data persistence ### Production Deployment For cloud deployment (AWS, GCP, Azure): 1. **Build image for production:** ```bash docker build -t your-registry/chat-with-your-data:latest . ``` 2. **Push to registry:** ```bash docker push your-registry/chat-with-your-data:latest ``` 3. **Deploy with orchestration** (Kubernetes, ECS, Cloud Run): ```bash # Update image reference in deployment configs kubectl apply -f k8s-deployment.yaml # if using K8s ``` 4. **Environment configuration:** - Set `OLLAMA_BASE_URL` to your LLM provider - Configure `QDRANT_URL` for cloud Qdrant instance - Set database credentials in secrets management - Configure CORS for your domain ### Development with Docker If using Docker for development (with Ollama on host): ```bash # Expose host Ollama to container docker run \ -p 8000:8000 \ -p 8501:8501 \ -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \ -v $(pwd):/fastAPI \ chat-with-your-data-app:latest ``` ### Troubleshooting | Issue | Solution | |-------|----------| | **Port already in use** | `docker-compose down` then retry, or change ports in docker-compose.yml | | **Out of memory** | Increase Docker memory: Preferences β†’ Resources β†’ Memory | | **PostgreSQL won't start** | Delete volume: `docker volume rm chat-with-your-data_postgres-data` | | **Qdrant connection error** | Wait 10s for startup: `docker-compose logs qdrant` | | **Container won't build** | `docker-compose build --no-cache` to rebuild fresh | --- ## Comparison: Virtual Environment vs Docker | Aspect | Virtual Environment | Docker | |--------|-------------------|--------| | **Setup Time** | 5-10 min | 2-3 min | | **Dependencies** | Manual (Ollama, PG, Redis) | Automatic | | **Hot Reload** | βœ… Yes | ❌ No (manual rebuild) | | **Isolation** | ❌ Shared system | βœ… Complete | | **Debugging** | βœ… Direct Python logs | Logs via docker-compose | | **Production Ready** | ⚠️ With care | βœ… Yes | | **CI/CD Ready** | ❌ | βœ… Yes | | **Disk Space** | ~500MB | ~2GB | ### Recommendation: - **Development**: Use Virtual Environment for faster iteration and hot-reload - **Production**: Use Docker for consistency, isolation, and easy deployment # This is necessary env variable \ -e GOOGLE_API_KEY="" \ # Port mapping, only 7860 is exposed \ -p 7860:7860 \ BBS/chat-with-your-data:prod ``` 1. Start the Docker container: ```bash docker start -a chat-with-your-data-cont-prod ``` 1. You can now access the Project at [http://localhost:7860](http://localhost:7860) # πŸ›‘οΈ Extra Measures ## Reset Project ### For Docker Deployment: ```bash # Remove all containers and volumes (complete reset) docker-compose down -v # Rebuild and restart docker-compose build --no-cache docker-compose up -d # View initialization logs docker-compose logs -f app ``` ### For Virtual Environment: ```bash # Remove cache files find . -type d -name "__pycache__" -exec rm -r {} + # Linux/Mac # or Get-ChildItem -Recurse -Directory -Filter "__pycache__" | Remove-Item -Recurse -Force # Windows # Clear user uploads rm -rf ./user_uploads/ # Reset local PostgreSQL (if using Homebrew) brew services stop postgresql rm -rf /usr/local/var/postgres brew services start postgresql ``` ## Persistent Storage with Docker To persist user uploads and database across container restarts: The docker-compose.yml already includes persistent volumes: - `postgres-data`: PostgreSQL database persistence - `qdrant-data`: Qdrant vector store persistence - `redis-data`: Redis cache persistence - `./user_uploads`: Local directory mounted for user files **No additional setup needed** - data persists automatically! ## Using Host Ollama with Docker If Ollama is running on your host machine: **macOS/Windows:** - Works automatically via `host.docker.internal:11434` - Set in docker-compose: `OLLAMA_BASE_URL=http://host.docker.internal:11434` **Linux:** - Add network flag to docker-compose: ```yaml network_mode: "host" # Or add to services: extra_hosts: - "host.docker.internal:host-gateway" ``` --add-host=host.docker.internal:host-gateway ``` ## Ollama Models: - To change LLM or Embedding model: + Go to [`./server/llm_system/config.py`](./server/llm_system/config.py) file. + It is central configuration file for the project. + Any constant can be changed there to be used in the project. + There are two diff models saved in config, but, I have used same model for response generation and summarization, if you want to change it, you can update the summarization model in `server.py` (β‰ˆ line 63) - To change inference device: + I have configured the LLM model to work on GPU and embedding model to work on CPU. - If you want to use GPU for embeddings too, you can change the **num_gpu** parameter in [`./server/llm_system/core/database.py`](./server/llm_system/core/database.py) (β‰ˆ line 58). + 0 means 100% CPU, -1 means 100% GPU, and any other number specifies particular number of model's layers to be offloaded on GPU. + Delete this parameter if you are unsure of these values and your hardware capabilities. Ollama dynamically offloads layers to GPU based on available resources. > [!Note] > If you are using docker, make sure to do these changes in [`./docker/dev_*`](./docker/) files. ## To test some sub-components: - This ensures that relative imports work correctly in the project. ```bash cd server python -m llm_system.utils.loader ``` ## Cache Architecture & Configuration ### How the Cache Works The system implements an intelligent query response caching layer that dramatically improves performance for repeated questions: 1. **Cache Key Generation**: Uses SHA256 hash of the normalized question (lowercase, trimmed) - Global cache: Same question returns cached answer regardless of user - Normalized matching: Handles minor whitespace/punctuation variations 2. **Cache Operations**: - **On Request**: Check if question exists in cache (if not expired) - **Cache Hit**: Return cached response in <100ms (no LLM call needed) - **Cache Miss**: Generate response via LLM, then cache for future requests 3. **TTL & Eviction**: - Default TTL: 1 hour (configurable via `ResponseCache(ttl_seconds=3600)`) - Max Size: 500 responses (configurable `_CACHE_MAX_SIZE`) - Eviction: LRU (least recently used) when cache exceeds max size ### Configuration Edit `/server/llm_system/config.py` or use environment variables: ```python # In server.py lifespan: app.state.response_cache = ResponseCache(ttl_seconds=3600) # 1 hour TTL ``` ### Monitoring Cache Performance Check cache stats via debug endpoint: ```bash curl http://localhost:8000/cache-debug | jq ``` Response shows current cache size and stored keys: ```json { "cache_size": 5, "cache_keys": ["61bf208516ecd284", "7a2c443f92b8e1ac"], "entries": [...] } ``` Check logs for cache operations: ```bash docker-compose exec app grep "Cache HIT\|CACHE HIT" /fastAPI/app.log ``` ### Disabling Cache (if needed) For real-time information without caching: ```python # In server.py, comment out the cache check: # if not dummy: # cached_answer = response_cache.get(chat_request.query, session_id) ``` --- # πŸš€ Future Work - Add support for more file formats (DOCX, PPTX, Excel, etc.) - Implement web-based document loading (scrape websites on-the-fly) - Redis-backed distributed cache for multi-instance deployments - PostgreSQL migration for production-scale user management - Implement semantic caching (cache based on question embeddings, not exact match) - Add document versioning and change tracking - GPU optimization for faster embedding generation - LLM fine-tuning on domain-specific documents - Advanced analytics dashboard for cache hit rates and performance metrics - Multi-language support and cross-lingual search # 🀝 Contributions Any contributions or suggestions are welcome! # πŸ“œ License [![Code-License](https://img.shields.io/badge/License%20-GNU%20--%20GPL%20v3.0-blue.svg?logo=GNU)](https://www.gnu.org/licenses/gpl-3.0) - This project is licensed under the `GNU General Public License v3.0` - See the [LICENSE](LICENSE) file for details. - You can use the code with proper credits to the author. # πŸ“§ Contact - **Forked from** - [Bbs1412/chat-with-your-data](https://github.com/Bbs1412/chat-with-your-data) - **Original Author -** Bhushan Songire - **Enhanced & Deployed by** - Sanchit Shaleen