title: Chat with Your Data
emoji: π¦
colorFrom: blue
colorTo: indigo
sdk: docker
app_file: app.py
pinned: false
license: gpl-3.0
Chat with your data π
A production-grade document intelligence system built with Google DeepMind's Gemma 3 LLM served locally via Ollama. This system enables users to upload documents (PDFs, TXT, Markdown, etc.) and chat with their content using natural language queriesβall processed locally for privacy and complete control.
Designed with modularity, performance, and production standards in mind, the system handles end-to-end RAG workflows including file ingestion, vector embeddings, history summarization, semantic search, context-aware generation, and streaming responses. Features include multi-file support per user, persistent session history, document management, and intelligent query caching for 700x faster responses on repeated queries.
Perfect for building educational assistants, personal knowledge bases, and enterprise document systems with full local control and no cloud dependencies.
Check out the live project deployment:
π Index:
- RAG with Gemma-3
- Project Details
- Tech Stack
- Installation
- Extra Measures
- Future Work
- Contributions
- License
- Contact
π― Project Details:
Aim
The core objective of this project is to build a robust RAG system with modern components and clean modular design and proper error handling.
Methodology
- Make a responsive UI in
Streamlitallowing user to upload documents, get previews to ensure correctness and interact with them. - Use
FastAPIto build a backend that handles file uploads, document processing, user authentication and streaming LLM responses. - Code modular
LLM SystemusingLangChaincomponents for chains, embeddings, retrievers, vector storage, history management, output parsers and overall LLM-Orchestration. - Integrate locally hosted
Gemma-3LLM usingOllamafor local inference. - Use
Qdrantfor efficient vector storage, similarity search and user specific document storage and retrieval. - Use
PostgreSQLfor user management, authentication, and data control. - Create a dynamic
Dockersetup for easy deployment as either a development or deployment environment. - Deploy project on
Hugging Face Spacesfor easy access and demonstration.
Due to hosting limitations of Gemma3, the Hugging Face Space deployment uses
Google Gemini-2.0-Flash-Liteas the LLM backend.
RAG Samples:
Performance Benchmarks
Tested on local Ollama with Gemma3:latest on Apple Silicon:
| Scenario | Latency | Notes |
|---|---|---|
| First Query (Cache Miss) | ~70-90s | Full LLM generation, response cached |
| Repeated Query (Cache Hit) | <100ms | Retrieved from in-memory cache, 700x faster |
| With Metrics (Disabled) | ~70-90s | Metrics computed in background (non-blocking) |
| With Metrics (Enabled) | ~75-95s | +5s for Answer Relevancy & Faithfulness checks |
| *P50 Latency (Mixed Workload) | ~30-40s | With 30-40% cache hit rate typical for RAG |
| *P99 Latency (Mixed Workload) | ~50-60s | Includes occasional cold cache misses |
*Assuming typical RAG usage patterns with 30-40% query repetition rate.
Features
β‘ Performance Optimizations
Query Response Caching: Intelligent in-memory caching of RAG responses with TTL-based expiration
- Cache hits return in <100ms (vs 70+ seconds for LLM generation)
- ~700x performance improvement for repeated queries
- Global cache key based on normalized questions (identical questions = cache hit regardless of user)
- Configurable TTL and max cache size with automatic LRU eviction
Async Evaluation: Non-blocking background task evaluation for metrics computation
- Metrics computed asynchronously without blocking the response stream
- <8 second P99 latency even with evaluation enabled
- Production-grade error handling and timeouts
Reference-Free Metrics: LLM-as-Judge evaluation using only the question and generated answer
- Answer Relevancy: Measures how well the answer addresses the question
- Faithfulness: Ensures the answer is grounded in the retrieved documents
- No ground truth required (perfect for open-ended knowledge bases)
User Authentication
+ Authenticate users using `PostgreSQL` database and `bcrypt` based password hashing and salt.
[](./Docs/1_Auth.png)
+ Store user data securely and also auto clear stale sessions data.
UI and User Controls:
- Build a responsive UI using
Streamlitapp. Provide a chat interface for users to ask questions about their documents, get file previews and receive context-aware responses.
- User uploaded files and corresponding data are tracked in a PostgreSQL database.
- Allow users to delete their uploaded documents, and manage their session history.

- Note: Files previews are cached for 10 minutes, so even after deletion, the file preview might be available for that duration.
- Also works with FastAPI SSE to show real-time responses from the LLM and retrieved documents and metadata for verification.

- UI supports thinking models also to show the LLM's thought process while generating responses.

- Build a responsive UI using
User wise document management:
- Support multi-file embeddings per user, allowing users to upload multiple documents and retrieve relevant information based on their queries.
- Some documents can also be added as public documents, which can be accessed by all users. (like shared rulebooks or manuals or documentation)
Embeddings, Vector Storage and Retrieval:
- Implement vector embeddings using
LangChaincomponents to convert documents into vector representations. - Open source
mxbai-embed-largemodel is used for generating embeddings, which is a lightweight and efficient embedding model. - Use
Qdrantfor efficient vector storage and retrieval of user-specific + public documents with persistent storage. - Integrate similarity search and document retrieval with Gemma-based LLM responses.
- Implement vector embeddings using
FastAPI Backend:
- Build a FastAPI backend to handle file uploads, document processing, user authentication, and streaming LLM responses.
- Integrate with 'LLM System' module to handle LLM tasks.
- Provide status updates to UI for long running tasks:

- Implement Server-Sent Events (
SSE) for real-time streaming of LLM responses to the frontend with NDJSON format for data transfer.
- Provide UI with retrieved documents and metadata for verification of responses.
LLM System:
- Modular
LLM SystemusingLangChaincomponents for:- Document Ingestion: Load files and process them into document chunks.
- Vector Embedding: Convert documents into vector representations.
- History Summarization: Summarize user session history for querying vector embeddings and retrieving relevant documents.
- Document Retrieval: Fetch relevant documents based on standalone query and user's metadata filters.
- History Management: Maintain session history for context-aware interactions.
- Response Generation: Generate context-aware responses using the LLM.
- Tracing: Enable tracing of LLM interactions using
LangSmithfor debugging and monitoring LLM interactions. - Models: Use
Ollamato run the Gemma-3 LLM and mxbai embeddings locally for inference, ensuring low latency and privacy.
- Modular
Dockerization:
- Create a dynamic
Dockersetup for easy deployment as either a development or deployment environment. - Use
Dockerfileto manage both FastAPI and Streamlit server in a single container (mainly due to Hugging Face Spaces limitations).
- Create a dynamic
π§βπ» Tech Stack
- π¦ LangChain - LLM orchestration and RAG chains
- β‘ FastAPI - High-performance async API backend
- π Streamlit - Interactive web UI for document management and chat
- π Docker - Containerization for reproducible deployments
- π¦ Ollama - Local inference engine
- Gemma-3 - Large language model
- mxbai-embed-large - Embedding model for semantic search
- ποΈ PostgreSQL - User authentication, file metadata, and session management
- π― Qdrant - Vector database for semantic search and document embeddings
- π οΈ DeepEval - LLM-as-Judge evaluation metrics (Answer Relevancy, Faithfulness)
- πΎ In-Memory Cache - Query response caching with TTL (700x faster on cache hits)
- π‘οΈ bcrypt - Secure password hashing for authentication
- π LangSmith - LLM tracing and monitoring
Others:
- π€ Hugging Face Spaces:
- Deploy the project in a Docker container using Dockerfile.
- :octocat: GitHub actions and Branch Protection:
- Process the repository for auto deployment to Hugging Face Spaces.
- Check for any secret leaks in code.
- Fail the commit on any secret leaks.
π οΈ Installation
Choose your deployment approach:
- π₯οΈ Virtual Environment (Development): Local development with hot-reload
- π Docker (Production): Containerized setup with full isolation
Virtual Environment
Best for: Local development, debugging, and testing with hot-reload capability.
Prerequisites:
- Python 3.12+
- Ollama with
gemma3:latestmodel - PostgreSQL (optional, can use memory backend)
- Redis (optional, for distributed cache)
- Qdrant (optional, can run via Docker or use in-memory mode)
Step 1: Clone & Setup
git clone --depth 1 https://github.com/Bbs1412/chat-with-your-data.git
cd chat-with-your-data
# Create virtual environment
python3 -m venv venv
# Activate environment
source venv/bin/activate # Linux/Mac
# or
venv\Scripts\activate # Windows
Step 2: Install Dependencies
pip install --upgrade pip
pip install -r requirements.txt
Installs:
- FastAPI + Uvicorn (async API server)
- Streamlit (interactive web UI)
- LangChain + Ollama integration
- Qdrant client (vector search)
- PostgreSQL driver (psycopg2)
- Redis client
- DeepEval (LLM-as-Judge metrics)
- All other dependencies
Step 3: Configure Environment (Optional)
Create .env file in project root:
# Ollama configuration
OLLAMA_BASE_URL=http://localhost:11434
# Qdrant vector database
QDRANT_URL=http://localhost:6333
QDRANT_COLLECTION_NAME=rag_documents
# PostgreSQL (if not using defaults)
# DATABASE_URL=postgresql://user:password@localhost/ragdb
# Redis for chat history backend
REDIS_URL=redis://localhost:6379/0
REDIS_HISTORY_TTL_SECONDS=0
# Optional: LangSmith tracing
LANGCHAIN_TRACING_V2=false
# LANGCHAIN_API_KEY=your_key_here
Step 4: Start Services (3 Terminal Windows)
Terminal 1 - Ollama:
ollama serve
# Or ensure model is pulled:
ollama pull gemma3:latest
Terminal 2 - FastAPI Backend:
cd chat-with-your-data
source venv/bin/activate
cd server
uvicorn server:app --reload --port 8000
Access at:
- API: http://localhost:8000
- Swagger UI: http://localhost:8000/docs
- Cache Debug: http://localhost:8000/cache-debug
Terminal 3 - Streamlit Frontend:
cd chat-with-your-data
source venv/bin/activate
streamlit run app.py --server.port 8501
Access at: http://localhost:8501
Optional: Start PostgreSQL & Redis (macOS with Homebrew)
# PostgreSQL
brew services start postgresql
# Redis
brew services start redis
# Check status
brew services list | grep -E "postgresql|redis"
Optional: Qdrant Vector Database
Option A: Docker (Recommended)
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant:v1.12.5
Access UI at: http://localhost:6333/dashboard
Option B: In-Memory Mode
Edit server/llm_system/core/qdrant_database.py and use in-memory client for ephemeral storage.
Quick Test
# 1. Open browser β http://localhost:8501
# 2. Register/Login with test credentials
# 3. Upload a PDF or text document
# 4. Embed the document
# 5. Ask questions about it
# 6. Check FastAPI logs for cache hits/misses (β‘ Cache HIT!)
Troubleshooting
| Issue | Solution |
|---|---|
| Port 8000 in use | lsof -ti:8000 | xargs kill -9 |
| Ollama not found | brew install ollama && ollama pull gemma3:latest |
| PostgreSQL error | Use memory backend: HISTORY_BACKEND=memory in config |
| Redis not found | brew install redis && brew services start redis |
| Qdrant connection error | Run Docker version or edit config for in-memory mode |
π Docker
Best for: Production deployments, CI/CD pipelines, and consistent environments across machines.
Features:
- Complete isolation with containerization
- All services bundled (PostgreSQL, Qdrant, Redis)
- Easy deployment to cloud platforms
- No dependency conflicts on host machine
- docker-compose for multi-container orchestration
Quick Start (Recommended)
Prerequisites:
- Docker & Docker Compose installed
- 4GB+ RAM available
- Port 8000, 8501, 6333, 5432, 6379 available
Step 1: Build & Start All Services
cd chat-with-your-data
# Build fresh image without cache
docker-compose build --no-cache
# Start all containers (detached mode)
docker-compose up -d
# Verify all services are running
docker-compose ps
Step 2: Access Services
FastAPI Backend: http://localhost:8000
FastAPI Swagger UI: http://localhost:8000/docs
Streamlit Frontend: http://localhost:8501
Qdrant UI: http://localhost:6333/dashboard
Redis Commander: http://localhost:8081
PostgreSQL: localhost:5432
Step 3: View Logs
# View all logs
docker-compose logs -f
# View specific service
docker-compose logs -f app
docker-compose logs -f postgres
docker-compose logs -f qdrant
Step 4: Stop Services
# Stop all containers
docker-compose down
# Stop and remove volumes (complete reset)
docker-compose down -v
Architecture Overview
Services (docker-compose):
| Service | Image | Port | Purpose |
|---|---|---|---|
| app | python:3.12-slim | 8000, 8501 | FastAPI + Streamlit |
| postgres | postgres:16-alpine | 5432 | User data & file metadata |
| qdrant | qdrant/qdrant:v1.12.5 | 6333 | Vector database |
| redis | redis:8.0-alpine | 6379 | Cache & session history |
| redis-commander | redis-commander:latest | 8081 | Redis UI (optional) |
Environment Variables:
# Created automatically via docker-compose
OLLAMA_BASE_URL: http://ollama:11434
QDRANT_URL: http://qdrant:6333
DATABASE_URL: postgresql://raguser:ragpass@postgres/ragdb
REDIS_URL: redis://redis:6379/0
Configuration
docker-compose.yml includes:
- PostgreSQL with health checks
- Qdrant with persistent storage
- Redis with persistence
- Redis Commander for debugging
- Custom network for service-to-service communication
- Volume management for data persistence
Production Deployment
For cloud deployment (AWS, GCP, Azure):
Build image for production:
docker build -t your-registry/chat-with-your-data:latest .Push to registry:
docker push your-registry/chat-with-your-data:latestDeploy with orchestration (Kubernetes, ECS, Cloud Run):
# Update image reference in deployment configs kubectl apply -f k8s-deployment.yaml # if using K8sEnvironment configuration:
- Set
OLLAMA_BASE_URLto your LLM provider - Configure
QDRANT_URLfor cloud Qdrant instance - Set database credentials in secrets management
- Configure CORS for your domain
- Set
Development with Docker
If using Docker for development (with Ollama on host):
# Expose host Ollama to container
docker run \
-p 8000:8000 \
-p 8501:8501 \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
-v $(pwd):/fastAPI \
chat-with-your-data-app:latest
Troubleshooting
| Issue | Solution |
|---|---|
| Port already in use | docker-compose down then retry, or change ports in docker-compose.yml |
| Out of memory | Increase Docker memory: Preferences β Resources β Memory |
| PostgreSQL won't start | Delete volume: docker volume rm chat-with-your-data_postgres-data |
| Qdrant connection error | Wait 10s for startup: docker-compose logs qdrant |
| Container won't build | docker-compose build --no-cache to rebuild fresh |
Comparison: Virtual Environment vs Docker
| Aspect | Virtual Environment | Docker |
|---|---|---|
| Setup Time | 5-10 min | 2-3 min |
| Dependencies | Manual (Ollama, PG, Redis) | Automatic |
| Hot Reload | β Yes | β No (manual rebuild) |
| Isolation | β Shared system | β Complete |
| Debugging | β Direct Python logs | Logs via docker-compose |
| Production Ready | β οΈ With care | β Yes |
| CI/CD Ready | β | β Yes |
| Disk Space | ~500MB | ~2GB |
Recommendation:
- Development: Use Virtual Environment for faster iteration and hot-reload
- Production: Use Docker for consistency, isolation, and easy deployment
# This is necessary env variable
-e GOOGLE_API_KEY=""
# Port mapping, only 7860 is exposed
-p 7860:7860
BBS/chat-with-your-data:prod
Start the Docker container:
docker start -a chat-with-your-data-cont-prodYou can now access the Project at http://localhost:7860
π‘οΈ Extra Measures
Reset Project
For Docker Deployment:
# Remove all containers and volumes (complete reset)
docker-compose down -v
# Rebuild and restart
docker-compose build --no-cache
docker-compose up -d
# View initialization logs
docker-compose logs -f app
For Virtual Environment:
# Remove cache files
find . -type d -name "__pycache__" -exec rm -r {} + # Linux/Mac
# or
Get-ChildItem -Recurse -Directory -Filter "__pycache__" | Remove-Item -Recurse -Force # Windows
# Clear user uploads
rm -rf ./user_uploads/
# Reset local PostgreSQL (if using Homebrew)
brew services stop postgresql
rm -rf /usr/local/var/postgres
brew services start postgresql
Persistent Storage with Docker
To persist user uploads and database across container restarts:
The docker-compose.yml already includes persistent volumes:
postgres-data: PostgreSQL database persistenceqdrant-data: Qdrant vector store persistenceredis-data: Redis cache persistence./user_uploads: Local directory mounted for user files
No additional setup needed - data persists automatically!
Using Host Ollama with Docker
If Ollama is running on your host machine:
macOS/Windows:
- Works automatically via
host.docker.internal:11434 - Set in docker-compose:
OLLAMA_BASE_URL=http://host.docker.internal:11434
Linux:
- Add network flag to docker-compose:
--add-host=host.docker.internal:host-gatewaynetwork_mode: "host" # Or add to services: extra_hosts: - "host.docker.internal:host-gateway"
Ollama Models:
To change LLM or Embedding model:
- Go to
./server/llm_system/config.pyfile. - It is central configuration file for the project.
- Any constant can be changed there to be used in the project.
- There are two diff models saved in config, but, I have used same model for response generation and summarization, if you want to change it, you can update the summarization model in
server.py(β line 63)
- Go to
To change inference device:
- I have configured the LLM model to work on GPU and embedding model to work on CPU.
- If you want to use GPU for embeddings too, you can change the num_gpu parameter in
./server/llm_system/core/database.py(β line 58).
- 0 means 100% CPU, -1 means 100% GPU, and any other number specifies particular number of model's layers to be offloaded on GPU.
- Delete this parameter if you are unsure of these values and your hardware capabilities. Ollama dynamically offloads layers to GPU based on available resources.
If you are using docker, make sure to do these changes in
./docker/dev_*files.
To test some sub-components:
- This ensures that relative imports work correctly in the project.
cd server python -m llm_system.utils.loader
Cache Architecture & Configuration
How the Cache Works
The system implements an intelligent query response caching layer that dramatically improves performance for repeated questions:
Cache Key Generation: Uses SHA256 hash of the normalized question (lowercase, trimmed)
- Global cache: Same question returns cached answer regardless of user
- Normalized matching: Handles minor whitespace/punctuation variations
Cache Operations:
- On Request: Check if question exists in cache (if not expired)
- Cache Hit: Return cached response in <100ms (no LLM call needed)
- Cache Miss: Generate response via LLM, then cache for future requests
TTL & Eviction:
- Default TTL: 1 hour (configurable via
ResponseCache(ttl_seconds=3600)) - Max Size: 500 responses (configurable
_CACHE_MAX_SIZE) - Eviction: LRU (least recently used) when cache exceeds max size
- Default TTL: 1 hour (configurable via
Configuration
Edit /server/llm_system/config.py or use environment variables:
# In server.py lifespan:
app.state.response_cache = ResponseCache(ttl_seconds=3600) # 1 hour TTL
Monitoring Cache Performance
Check cache stats via debug endpoint:
curl http://localhost:8000/cache-debug | jq
Response shows current cache size and stored keys:
{
"cache_size": 5,
"cache_keys": ["61bf208516ecd284", "7a2c443f92b8e1ac"],
"entries": [...]
}
Check logs for cache operations:
docker-compose exec app grep "Cache HIT\|CACHE HIT" /fastAPI/app.log
Disabling Cache (if needed)
For real-time information without caching:
# In server.py, comment out the cache check:
# if not dummy:
# cached_answer = response_cache.get(chat_request.query, session_id)
π Future Work
- Add support for more file formats (DOCX, PPTX, Excel, etc.)
- Implement web-based document loading (scrape websites on-the-fly)
- Redis-backed distributed cache for multi-instance deployments
- PostgreSQL migration for production-scale user management
- Implement semantic caching (cache based on question embeddings, not exact match)
- Add document versioning and change tracking
- GPU optimization for faster embedding generation
- LLM fine-tuning on domain-specific documents
- Advanced analytics dashboard for cache hit rates and performance metrics
- Multi-language support and cross-lingual search
π€ Contributions
Any contributions or suggestions are welcome!
π License
- This project is licensed under the
GNU General Public License v3.0 - See the LICENSE file for details.
- You can use the code with proper credits to the author.
π§ Contact
- Forked from - Bbs1412/chat-with-your-data
- Original Author - Bhushan Songire
- Enhanced & Deployed by - Sanchit Shaleen

