MiniEmbed: Tiny, Powerful Embedding Models from Scratch
MiniEmbed is an ultra-compact text embedding model (Bi-Encoder) built entirely from scratch in PyTorch. No HuggingFace Transformers, no pre-trained weights -- just pure PyTorch.
GitHub: github.com/bhandarisuraz/miniembed (full repo with examples, tests, interactive demo, and documentation)
| Spec | Value |
|---|---|
| Parameters | ~10.8M |
| Model Size | ~42 MB |
| Embedding Dim | 256 |
| Vocab Size | 30,000 |
| Max Seq Length | 128 tokens |
| Architecture | 4-layer Transformer Encoder |
| Pooling | Mean Pooling + L2 Normalization |
| Training Loss | MNRL (Multiple Negatives Ranking Loss) |
| Training Data | ~3.8M pairs (NQ, GooAQ, MSMARCO, WDC, ECInstruct) |
Quick Start
pip install torch numpy scikit-learn huggingface_hub
from huggingface_hub import snapshot_download
# Download model (one-time)
model_dir = snapshot_download("surazbhandari/miniembed")
# Add src to path
import sys
sys.path.insert(0, model_dir)
from src.inference import EmbeddingInference
# Load model
model = EmbeddingInference.from_pretrained(model_dir)
# 1. Similarity
score = model.similarity("Machine learning is great", "AI is wonderful")
print(f"Similarity: {score:.4f}") # 0.4287
# 2. Normal Embeddings
embeddings = model.encode(["Machine learning is great", "AI is wonderful"])
# 3. Manual Cosine Similarity
# Since embeddings are L2-normalized, dot product is cosine similarity
import numpy as np
score = np.dot(embeddings[0], embeddings[1])
print(f"Similarity: {score:.4f}")
# Semantic Search
docs = ["Python is great for AI", "I love pizza", "Neural networks learn patterns"]
results = model.search("deep learning frameworks", docs, top_k=2)
for r in results:
print(f" [{r['score']:.3f}] {r['text']}")
# [0.498] Neural networks learn patterns
# [0.413] Python is great for AI
# Clustering
result = model.cluster_texts(["ML is cool", "Pizza is food", "AI rocks"], n_clusters=2)
for cluster_id, texts in result['texts_by_cluster'].items():
print(f"Cluster {cluster_id + 1}: {texts}")
# Cluster 1: ['Pizza is food']
# Cluster 2: ['ML is cool', 'AI rocks']
Also Available via GitHub
git clone https://github.com/bhandarisuraz/miniembed.git
cd miniembed
pip install -r requirements.txt
python -c "
from src.inference import EmbeddingInference
model = EmbeddingInference.from_pretrained('models/mini')
print(model.similarity('hello world', 'hi there'))
"
Capabilities
- Semantic Search -- Find meaning-based matches, not keyword overlap.
- Re-Ranking -- Sort candidates by true semantic relevance.
- Clustering -- Group texts into logical categories automatically.
- Product Matching -- Match items across platforms with messy titles.
Architecture
Custom 4-layer Transformer encoder built from first principles:
- Token Embedding (30K vocab) + Sinusoidal Positional Encoding
- 4x Pre-LayerNorm Transformer Encoder Layers
- Multi-Head Self-Attention (4 heads, d_k=64)
- Position-wise Feed-Forward (GELU activation, d_ff=1024)
- Mean Pooling over non-padded tokens
- L2 Normalization (unit hypersphere projection)
Training
Trained on ~3.8 million text pairs from public datasets:
| Dataset | Type |
|---|---|
| Natural Questions (NQ) | Q&A / General |
| GooAQ | Knowledge Search |
| WDC Product Matching | E-commerce |
| ECInstruct | E-commerce Tasks |
| MS MARCO | Web Search |
Training details:
- Training time: ~49 hours
- Final loss: 0.0748
- Optimizer: AdamW
- Batch size: 256
Files
surazbhandari/miniembed
|-- README.md # This model card
|-- config.json # Architecture config
|-- model.safetensors # Pre-trained weights (Safe & Fast)
|-- model.pt # Pre-trained weights (Legacy PyTorch)
|-- tokenizer.json # 30K word-level vocabulary
|-- training_info.json # Training metadata
|-- src/
|-- __init__.py
|-- model.py # Full architecture code
|-- tokenizer.py # Tokenizer implementation
|-- inference.py # High-level API (supports HF auto-download)
Limitations
- Word-level tokenizer (no subword/BPE) -- unknown words map to [UNK]
- 128 token max sequence length
- Trained primarily on English text
- Best suited for short-form text (queries, product titles, sentences)
Citation
@software{Bhandari_MiniEmbed_2026,
author = {Bhandari, Suraj},
title = {{MiniEmbed: Tiny, Powerful Embedding Models from Scratch}},
url = {https://github.com/bhandarisuraz/miniembed},
version = {1.0.0},
year = {2026}
}
License
MIT
- Downloads last month
- 93