Spaces:

sanchitshaleen
/

chat-with-your-data

Running

App Files Files Community

chat-with-your-data / README.md

sanchitshaleen

Initial deployment of RAG with Gemma-3 to Hugging Face Spaces

573e464 4 days ago

preview code

raw

history blame contribute delete

25.5 kB

metadata

title: Chat with Your Data
emoji: 🦙
colorFrom: blue
colorTo: indigo
sdk: docker
app_file: app.py
pinned: false
license: gpl-3.0

`Chat with your data` 🚀

A production-grade document intelligence system built with Google DeepMind's Gemma 3 LLM served locally via Ollama. This system enables users to upload documents (PDFs, TXT, Markdown, etc.) and chat with their content using natural language queries—all processed locally for privacy and complete control.

Designed with modularity, performance, and production standards in mind, the system handles end-to-end RAG workflows including file ingestion, vector embeddings, history summarization, semantic search, context-aware generation, and streaming responses. Features include multi-file support per user, persistent session history, document management, and intelligent query caching for 700x faster responses on repeated queries.

Perfect for building educational assistants, personal knowledge bases, and enterprise document systems with full local control and no cloud dependencies.

Check out the live project deployment:

📃 Index:

RAG with Gemma-3
Project Details
- Aim
- Methodology
- Features
Tech Stack
Installation
Extra Measures
Future Work
Contributions
License
Contact

🎯 Project Details:

Aim

The core objective of this project is to build a robust RAG system with modern components and clean modular design and proper error handling.

Methodology

Make a responsive UI in Streamlit allowing user to upload documents, get previews to ensure correctness and interact with them.
Use FastAPI to build a backend that handles file uploads, document processing, user authentication and streaming LLM responses.
Code modular LLM System using LangChain components for chains, embeddings, retrievers, vector storage, history management, output parsers and overall LLM-Orchestration.
Integrate locally hosted Gemma-3 LLM using Ollama for local inference.
Use Qdrant for efficient vector storage, similarity search and user specific document storage and retrieval.
Use PostgreSQL for user management, authentication, and data control.
Create a dynamic Docker setup for easy deployment as either a development or deployment environment.
Deploy project on Hugging Face Spaces for easy access and demonstration.

Due to hosting limitations of Gemma3, the Hugging Face Space deployment uses Google Gemini-2.0-Flash-Lite as the LLM backend.

RAG Samples:

Q: Highest possible grade:
Q: Formatted Output:

Performance Benchmarks

Tested on local Ollama with Gemma3:latest on Apple Silicon:

Scenario	Latency	Notes
First Query (Cache Miss)	~70-90s	Full LLM generation, response cached
Repeated Query (Cache Hit)	<100ms	Retrieved from in-memory cache, 700x faster
With Metrics (Disabled)	~70-90s	Metrics computed in background (non-blocking)
With Metrics (Enabled)	~75-95s	+5s for Answer Relevancy & Faithfulness checks
*P50 Latency (Mixed Workload)	~30-40s	With 30-40% cache hit rate typical for RAG
*P99 Latency (Mixed Workload)	~50-60s	Includes occasional cold cache misses

*Assuming typical RAG usage patterns with 30-40% query repetition rate.

Features

⚡ Performance Optimizations

Query Response Caching: Intelligent in-memory caching of RAG responses with TTL-based expiration
- Cache hits return in <100ms (vs 70+ seconds for LLM generation)
- ~700x performance improvement for repeated queries
- Global cache key based on normalized questions (identical questions = cache hit regardless of user)
- Configurable TTL and max cache size with automatic LRU eviction
Async Evaluation: Non-blocking background task evaluation for metrics computation
- Metrics computed asynchronously without blocking the response stream
- <8 second P99 latency even with evaluation enabled
- Production-grade error handling and timeouts
Reference-Free Metrics: LLM-as-Judge evaluation using only the question and generated answer
- Answer Relevancy: Measures how well the answer addresses the question
- Faithfulness: Ensures the answer is grounded in the retrieved documents
- No ground truth required (perfect for open-ended knowledge bases)

User Authentication

+ Authenticate users using `PostgreSQL` database and `bcrypt` based password hashing and salt.
    [![User Registration Screenshot](./Docs/1_Auth.png)](./Docs/1_Auth.png)
+ Store user data securely and also auto clear stale sessions data.

UI and User Controls:
- Build a responsive UI using Streamlit app. Provide a chat interface for users to ask questions about their documents, get file previews and receive context-aware responses.
- User uploaded files and corresponding data are tracked in a PostgreSQL database.
- Allow users to delete their uploaded documents, and manage their session history.
- Note: Files previews are cached for 10 minutes, so even after deletion, the file preview might be available for that duration.
- Also works with FastAPI SSE to show real-time responses from the LLM and retrieved documents and metadata for verification.
- UI supports thinking models also to show the LLM's thought process while generating responses.
User wise document management:
- Support multi-file embeddings per user, allowing users to upload multiple documents and retrieve relevant information based on their queries.
- Some documents can also be added as public documents, which can be accessed by all users. (like shared rulebooks or manuals or documentation)
Embeddings, Vector Storage and Retrieval:
- Implement vector embeddings using LangChain components to convert documents into vector representations.
- Open source mxbai-embed-large model is used for generating embeddings, which is a lightweight and efficient embedding model.
- Use Qdrant for efficient vector storage and retrieval of user-specific + public documents with persistent storage.
- Integrate similarity search and document retrieval with Gemma-based LLM responses.
FastAPI Backend:
- Build a FastAPI backend to handle file uploads, document processing, user authentication, and streaming LLM responses.
- Integrate with 'LLM System' module to handle LLM tasks.
- Provide status updates to UI for long running tasks:
- Implement Server-Sent Events (SSE) for real-time streaming of LLM responses to the frontend with NDJSON format for data transfer.
- Provide UI with retrieved documents and metadata for verification of responses.
LLM System:
- Modular LLM System using LangChain components for:
  1. Document Ingestion: Load files and process them into document chunks.
  2. Vector Embedding: Convert documents into vector representations.
  3. History Summarization: Summarize user session history for querying vector embeddings and retrieving relevant documents.
  4. Document Retrieval: Fetch relevant documents based on standalone query and user's metadata filters.
  5. History Management: Maintain session history for context-aware interactions.
  6. Response Generation: Generate context-aware responses using the LLM.
  7. Tracing: Enable tracing of LLM interactions using LangSmith for debugging and monitoring LLM interactions.
  8. Models: Use Ollama to run the Gemma-3 LLM and mxbai embeddings locally for inference, ensuring low latency and privacy.
Dockerization:
- Create a dynamic Docker setup for easy deployment as either a development or deployment environment.
- Use Dockerfile to manage both FastAPI and Streamlit server in a single container (mainly due to Hugging Face Spaces limitations).

🧑‍💻 Tech Stack

🦜 LangChain - LLM orchestration and RAG chains
⚡ FastAPI - High-performance async API backend
👑 Streamlit - Interactive web UI for document management and chat
🐋 Docker - Containerization for reproducible deployments
🦙 Ollama - Local inference engine
- Gemma-3 - Large language model
- mxbai-embed-large - Embedding model for semantic search
🗄️ PostgreSQL - User authentication, file metadata, and session management
🎯 Qdrant - Vector database for semantic search and document embeddings
🛠️ DeepEval - LLM-as-Judge evaluation metrics (Answer Relevancy, Faithfulness)
💾 In-Memory Cache - Query response caching with TTL (700x faster on cache hits)
🛡️ bcrypt - Secure password hashing for authentication
📊 LangSmith - LLM tracing and monitoring

Others:

🤗 Hugging Face Spaces:
- Deploy the project in a Docker container using Dockerfile.
:octocat: GitHub actions and Branch Protection:
- Process the repository for auto deployment to Hugging Face Spaces.
- Check for any secret leaks in code.
- Fail the commit on any secret leaks.

🛠️ Installation

Choose your deployment approach:

🖥️ Virtual Environment (Development): Local development with hot-reload
🐋 Docker (Production): Containerized setup with full isolation

Virtual Environment

Best for: Local development, debugging, and testing with hot-reload capability.

Prerequisites:

Python 3.12+
Ollama with gemma3:latest model
PostgreSQL (optional, can use memory backend)
Redis (optional, for distributed cache)
Qdrant (optional, can run via Docker or use in-memory mode)

Step 1: Clone & Setup

git clone --depth 1 https://github.com/Bbs1412/chat-with-your-data.git
cd chat-with-your-data

# Create virtual environment
python3 -m venv venv

# Activate environment
source venv/bin/activate  # Linux/Mac
# or
venv\Scripts\activate  # Windows

Step 2: Install Dependencies

pip install --upgrade pip
pip install -r requirements.txt

Installs:

FastAPI + Uvicorn (async API server)
Streamlit (interactive web UI)
LangChain + Ollama integration
Qdrant client (vector search)
PostgreSQL driver (psycopg2)
Redis client
DeepEval (LLM-as-Judge metrics)
All other dependencies

Step 3: Configure Environment (Optional)

Create .env file in project root:

# Ollama configuration
OLLAMA_BASE_URL=http://localhost:11434

# Qdrant vector database
QDRANT_URL=http://localhost:6333
QDRANT_COLLECTION_NAME=rag_documents

# PostgreSQL (if not using defaults)
# DATABASE_URL=postgresql://user:password@localhost/ragdb

# Redis for chat history backend
REDIS_URL=redis://localhost:6379/0
REDIS_HISTORY_TTL_SECONDS=0

# Optional: LangSmith tracing
LANGCHAIN_TRACING_V2=false
# LANGCHAIN_API_KEY=your_key_here

Step 4: Start Services (3 Terminal Windows)

Terminal 1 - Ollama:

ollama serve
# Or ensure model is pulled:
ollama pull gemma3:latest

Terminal 2 - FastAPI Backend:

cd chat-with-your-data
source venv/bin/activate
cd server
uvicorn server:app --reload --port 8000

Access at:

API: http://localhost:8000
Swagger UI: http://localhost:8000/docs
Cache Debug: http://localhost:8000/cache-debug

Terminal 3 - Streamlit Frontend:

cd chat-with-your-data
source venv/bin/activate
streamlit run app.py --server.port 8501

Access at: http://localhost:8501

Optional: Start PostgreSQL & Redis (macOS with Homebrew)

# PostgreSQL
brew services start postgresql

# Redis
brew services start redis

# Check status
brew services list | grep -E "postgresql|redis"

Optional: Qdrant Vector Database

Option A: Docker (Recommended)

docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant:v1.12.5

Access UI at: http://localhost:6333/dashboard

Option B: In-Memory Mode

Edit server/llm_system/core/qdrant_database.py and use in-memory client for ephemeral storage.

Quick Test

# 1. Open browser → http://localhost:8501
# 2. Register/Login with test credentials
# 3. Upload a PDF or text document
# 4. Embed the document
# 5. Ask questions about it
# 6. Check FastAPI logs for cache hits/misses (⚡ Cache HIT!)

Troubleshooting

Issue	Solution
Port 8000 in use	`lsof -ti:8000 \| xargs kill -9`
Ollama not found	`brew install ollama && ollama pull gemma3:latest`
PostgreSQL error	Use memory backend: `HISTORY_BACKEND=memory` in config
Redis not found	`brew install redis && brew services start redis`
Qdrant connection error	Run Docker version or edit config for in-memory mode

🐋 Docker

Best for: Production deployments, CI/CD pipelines, and consistent environments across machines.

Features:

Complete isolation with containerization
All services bundled (PostgreSQL, Qdrant, Redis)
Easy deployment to cloud platforms
No dependency conflicts on host machine
docker-compose for multi-container orchestration

Quick Start (Recommended)

Prerequisites:

Docker & Docker Compose installed
4GB+ RAM available
Port 8000, 8501, 6333, 5432, 6379 available

Step 1: Build & Start All Services

cd chat-with-your-data

# Build fresh image without cache
docker-compose build --no-cache

# Start all containers (detached mode)
docker-compose up -d

# Verify all services are running
docker-compose ps

Step 2: Access Services

FastAPI Backend:     http://localhost:8000
FastAPI Swagger UI:  http://localhost:8000/docs
Streamlit Frontend:  http://localhost:8501
Qdrant UI:          http://localhost:6333/dashboard
Redis Commander:    http://localhost:8081
PostgreSQL:         localhost:5432

Step 3: View Logs

# View all logs
docker-compose logs -f

# View specific service
docker-compose logs -f app
docker-compose logs -f postgres
docker-compose logs -f qdrant

Step 4: Stop Services

# Stop all containers
docker-compose down

# Stop and remove volumes (complete reset)
docker-compose down -v

Architecture Overview

Services (docker-compose):

Service	Image	Port	Purpose
app	python:3.12-slim	8000, 8501	FastAPI + Streamlit
postgres	postgres:16-alpine	5432	User data & file metadata
qdrant	qdrant/qdrant:v1.12.5	6333	Vector database
redis	redis:8.0-alpine	6379	Cache & session history
redis-commander	redis-commander:latest	8081	Redis UI (optional)

Environment Variables:

# Created automatically via docker-compose
OLLAMA_BASE_URL: http://ollama:11434
QDRANT_URL: http://qdrant:6333
DATABASE_URL: postgresql://raguser:ragpass@postgres/ragdb
REDIS_URL: redis://redis:6379/0

Configuration

docker-compose.yml includes:

PostgreSQL with health checks
Qdrant with persistent storage
Redis with persistence
Redis Commander for debugging
Custom network for service-to-service communication
Volume management for data persistence

Production Deployment

For cloud deployment (AWS, GCP, Azure):

Build image for production:

docker build -t your-registry/chat-with-your-data:latest .

Push to registry:

docker push your-registry/chat-with-your-data:latest

Deploy with orchestration (Kubernetes, ECS, Cloud Run):

# Update image reference in deployment configs
kubectl apply -f k8s-deployment.yaml  # if using K8s

Environment configuration:
- Set OLLAMA_BASE_URL to your LLM provider
- Configure QDRANT_URL for cloud Qdrant instance
- Set database credentials in secrets management
- Configure CORS for your domain

Development with Docker

If using Docker for development (with Ollama on host):

# Expose host Ollama to container
docker run \
  -p 8000:8000 \
  -p 8501:8501 \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  -v $(pwd):/fastAPI \
  chat-with-your-data-app:latest

Troubleshooting

Issue	Solution
Port already in use	`docker-compose down` then retry, or change ports in docker-compose.yml
Out of memory	Increase Docker memory: Preferences → Resources → Memory
PostgreSQL won't start	Delete volume: `docker volume rm chat-with-your-data_postgres-data`
Qdrant connection error	Wait 10s for startup: `docker-compose logs qdrant`
Container won't build	`docker-compose build --no-cache` to rebuild fresh

Comparison: Virtual Environment vs Docker

Aspect	Virtual Environment	Docker
Setup Time	5-10 min	2-3 min
Dependencies	Manual (Ollama, PG, Redis)	Automatic
Hot Reload	✅ Yes	❌ No (manual rebuild)
Isolation	❌ Shared system	✅ Complete
Debugging	✅ Direct Python logs	Logs via docker-compose
Production Ready	⚠️ With care	✅ Yes
CI/CD Ready	❌	✅ Yes
Disk Space	~500MB	~2GB

Recommendation:

Development: Use Virtual Environment for faster iteration and hot-reload
Production: Use Docker for consistency, isolation, and easy deployment # This is necessary env variable
-e GOOGLE_API_KEY=""
# Port mapping, only 7860 is exposed
-p 7860:7860
BBS/chat-with-your-data:prod

Start the Docker container:

docker start -a chat-with-your-data-cont-prod

You can now access the Project at http://localhost:7860

🛡️ Extra Measures

Reset Project

For Docker Deployment:

# Remove all containers and volumes (complete reset)
docker-compose down -v

# Rebuild and restart
docker-compose build --no-cache
docker-compose up -d

# View initialization logs
docker-compose logs -f app

For Virtual Environment:

# Remove cache files
find . -type d -name "__pycache__" -exec rm -r {} +  # Linux/Mac
# or
Get-ChildItem -Recurse -Directory -Filter "__pycache__" | Remove-Item -Recurse -Force  # Windows

# Clear user uploads
rm -rf ./user_uploads/

# Reset local PostgreSQL (if using Homebrew)
brew services stop postgresql
rm -rf /usr/local/var/postgres
brew services start postgresql

Persistent Storage with Docker

To persist user uploads and database across container restarts:

The docker-compose.yml already includes persistent volumes:

postgres-data: PostgreSQL database persistence
qdrant-data: Qdrant vector store persistence
redis-data: Redis cache persistence
./user_uploads: Local directory mounted for user files

No additional setup needed - data persists automatically!

Using Host Ollama with Docker

If Ollama is running on your host machine:

macOS/Windows:

Works automatically via host.docker.internal:11434
Set in docker-compose: OLLAMA_BASE_URL=http://host.docker.internal:11434

Linux:

Add network flag to docker-compose:

network_mode: "host"
# Or add to services:
extra_hosts:
  - "host.docker.internal:host-gateway"

--add-host=host.docker.internal:host-gateway

Ollama Models:

To change LLM or Embedding model:
- Go to ./server/llm_system/config.py file.
- It is central configuration file for the project.
- Any constant can be changed there to be used in the project.
- There are two diff models saved in config, but, I have used same model for response generation and summarization, if you want to change it, you can update the summarization model in server.py (≈ line 63)
To change inference device:
- I have configured the LLM model to work on GPU and embedding model to work on CPU.
- If you want to use GPU for embeddings too, you can change the num_gpu parameter in ./server/llm_system/core/database.py (≈ line 58).
- 0 means 100% CPU, -1 means 100% GPU, and any other number specifies particular number of model's layers to be offloaded on GPU.
- Delete this parameter if you are unsure of these values and your hardware capabilities. Ollama dynamically offloads layers to GPU based on available resources.

If you are using docker, make sure to do these changes in ./docker/dev_* files.

To test some sub-components:

This ensures that relative imports work correctly in the project.
```
cd server
python -m llm_system.utils.loader
```

Cache Architecture & Configuration

How the Cache Works

The system implements an intelligent query response caching layer that dramatically improves performance for repeated questions:

Cache Key Generation: Uses SHA256 hash of the normalized question (lowercase, trimmed)
- Global cache: Same question returns cached answer regardless of user
- Normalized matching: Handles minor whitespace/punctuation variations
Cache Operations:
- On Request: Check if question exists in cache (if not expired)
- Cache Hit: Return cached response in <100ms (no LLM call needed)
- Cache Miss: Generate response via LLM, then cache for future requests
TTL & Eviction:
- Default TTL: 1 hour (configurable via ResponseCache(ttl_seconds=3600))
- Max Size: 500 responses (configurable _CACHE_MAX_SIZE)
- Eviction: LRU (least recently used) when cache exceeds max size

Configuration

Edit /server/llm_system/config.py or use environment variables:

# In server.py lifespan:
app.state.response_cache = ResponseCache(ttl_seconds=3600)  # 1 hour TTL

Monitoring Cache Performance

Check cache stats via debug endpoint:

curl http://localhost:8000/cache-debug | jq

Response shows current cache size and stored keys:

{
  "cache_size": 5,
  "cache_keys": ["61bf208516ecd284", "7a2c443f92b8e1ac"],
  "entries": [...]
}

Check logs for cache operations:

docker-compose exec app grep "Cache HIT\|CACHE HIT" /fastAPI/app.log

Disabling Cache (if needed)

For real-time information without caching:

# In server.py, comment out the cache check:
# if not dummy:
#     cached_answer = response_cache.get(chat_request.query, session_id)

🚀 Future Work

Add support for more file formats (DOCX, PPTX, Excel, etc.)
Implement web-based document loading (scrape websites on-the-fly)
Redis-backed distributed cache for multi-instance deployments
PostgreSQL migration for production-scale user management
Implement semantic caching (cache based on question embeddings, not exact match)
Add document versioning and change tracking
GPU optimization for faster embedding generation
LLM fine-tuning on domain-specific documents
Advanced analytics dashboard for cache hit rates and performance metrics
Multi-language support and cross-lingual search

🤝 Contributions

Any contributions or suggestions are welcome!

📜 License

This project is licensed under the GNU General Public License v3.0
See the LICENSE file for details.
You can use the code with proper credits to the author.

📧 Contact

Forked from - Bbs1412/chat-with-your-data
Original Author - Bhushan Songire
Enhanced & Deployed by - Sanchit Shaleen

Chat with your data 🚀

📃 Index:

🎯 Project Details:

Aim

Methodology

RAG Samples:

Performance Benchmarks

Features

⚡ Performance Optimizations

User Authentication

🧑‍💻 Tech Stack

🛠️ Installation

Virtual Environment

Step 1: Clone & Setup

Step 2: Install Dependencies

Step 3: Configure Environment (Optional)

Step 4: Start Services (3 Terminal Windows)

Optional: Start PostgreSQL & Redis (macOS with Homebrew)

Optional: Qdrant Vector Database

Quick Test

Troubleshooting

🐋 Docker

Quick Start (Recommended)

Architecture Overview

Configuration

Production Deployment

Development with Docker

Troubleshooting

Comparison: Virtual Environment vs Docker

Recommendation:

🛡️ Extra Measures

Reset Project

For Docker Deployment:

For Virtual Environment:

Persistent Storage with Docker

Using Host Ollama with Docker

Ollama Models:

To test some sub-components:

Cache Architecture & Configuration

How the Cache Works

Configuration

Monitoring Cache Performance

Disabling Cache (if needed)

🚀 Future Work

🤝 Contributions

📜 License

📧 Contact

`Chat with your data` 🚀