MiniEmbed: Tiny, Powerful Embedding Models from Scratch

MiniEmbed is an ultra-compact text embedding model (Bi-Encoder) built entirely from scratch in PyTorch. No HuggingFace Transformers, no pre-trained weights -- just pure PyTorch.

GitHub: github.com/bhandarisuraz/miniembed (full repo with examples, tests, interactive demo, and documentation)

Spec Value
Parameters ~10.8M
Model Size ~42 MB
Embedding Dim 256
Vocab Size 30,000
Max Seq Length 128 tokens
Architecture 4-layer Transformer Encoder
Pooling Mean Pooling + L2 Normalization
Training Loss MNRL (Multiple Negatives Ranking Loss)
Training Data ~3.8M pairs (NQ, GooAQ, MSMARCO, WDC, ECInstruct)

Quick Start

pip install torch numpy scikit-learn huggingface_hub
from huggingface_hub import snapshot_download

# Download model (one-time)
model_dir = snapshot_download("surazbhandari/miniembed")

# Add src to path
import sys
sys.path.insert(0, model_dir)

from src.inference import EmbeddingInference

# Load model
model = EmbeddingInference.from_pretrained(model_dir)

# 1. Similarity
score = model.similarity("Machine learning is great", "AI is wonderful")
print(f"Similarity: {score:.4f}")  # 0.4287

# 2. Normal Embeddings
embeddings = model.encode(["Machine learning is great", "AI is wonderful"])

# 3. Manual Cosine Similarity
# Since embeddings are L2-normalized, dot product is cosine similarity
import numpy as np
score = np.dot(embeddings[0], embeddings[1])
print(f"Similarity: {score:.4f}")

# Semantic Search
docs = ["Python is great for AI", "I love pizza", "Neural networks learn patterns"]
results = model.search("deep learning frameworks", docs, top_k=2)
for r in results:
    print(f"  [{r['score']:.3f}] {r['text']}")
# [0.498] Neural networks learn patterns
# [0.413] Python is great for AI

# Clustering
result = model.cluster_texts(["ML is cool", "Pizza is food", "AI rocks"], n_clusters=2)
for cluster_id, texts in result['texts_by_cluster'].items():
    print(f"Cluster {cluster_id + 1}: {texts}")
# Cluster 1: ['Pizza is food']
# Cluster 2: ['ML is cool', 'AI rocks']

Also Available via GitHub

git clone https://github.com/bhandarisuraz/miniembed.git
cd miniembed
pip install -r requirements.txt

python -c "
from src.inference import EmbeddingInference
model = EmbeddingInference.from_pretrained('models/mini')
print(model.similarity('hello world', 'hi there'))
"

Capabilities

  • Semantic Search -- Find meaning-based matches, not keyword overlap.
  • Re-Ranking -- Sort candidates by true semantic relevance.
  • Clustering -- Group texts into logical categories automatically.
  • Product Matching -- Match items across platforms with messy titles.

Architecture

Custom 4-layer Transformer encoder built from first principles:

  • Token Embedding (30K vocab) + Sinusoidal Positional Encoding
  • 4x Pre-LayerNorm Transformer Encoder Layers
  • Multi-Head Self-Attention (4 heads, d_k=64)
  • Position-wise Feed-Forward (GELU activation, d_ff=1024)
  • Mean Pooling over non-padded tokens
  • L2 Normalization (unit hypersphere projection)

Training

Trained on ~3.8 million text pairs from public datasets:

Dataset Type
Natural Questions (NQ) Q&A / General
GooAQ Knowledge Search
WDC Product Matching E-commerce
ECInstruct E-commerce Tasks
MS MARCO Web Search

Training details:

  • Training time: ~49 hours
  • Final loss: 0.0748
  • Optimizer: AdamW
  • Batch size: 256

Files

surazbhandari/miniembed
|-- README.md           # This model card
|-- config.json         # Architecture config
|-- model.safetensors   # Pre-trained weights (Safe & Fast)
|-- model.pt            # Pre-trained weights (Legacy PyTorch)
|-- tokenizer.json      # 30K word-level vocabulary
|-- training_info.json  # Training metadata
|-- src/
    |-- __init__.py
    |-- model.py        # Full architecture code
    |-- tokenizer.py    # Tokenizer implementation
    |-- inference.py    # High-level API (supports HF auto-download)

Limitations

  • Word-level tokenizer (no subword/BPE) -- unknown words map to [UNK]
  • 128 token max sequence length
  • Trained primarily on English text
  • Best suited for short-form text (queries, product titles, sentences)

Citation

@software{Bhandari_MiniEmbed_2026,
  author  = {Bhandari, Suraj},
  title   = {{MiniEmbed: Tiny, Powerful Embedding Models from Scratch}},
  url     = {https://github.com/bhandarisuraz/miniembed},
  version = {1.0.0},
  year    = {2026}
}

License

MIT

Downloads last month
93
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support