TinyV4 β 11M Bilingual Base Model
TinyV4 is a compact 11 million parameter bilingual (Indonesian & English) base model. Think of it as a solid foundation β pre-trained, ready to be fine-tuned for your specific downstream task.
At just 58 MB, it's small enough to run anywhere. Smart enough to be worth your time.
What is this?
Most base models start at 100M+ parameters. Want to experiment with fine-tuning? You need a GPU. Want to iterate fast? Good luck.
TinyV4 is different. 11M parameters with a Mixture-of-Experts architecture β pre-trained on bilingual data so it already understands both Indonesian and English. You bring the task, it brings the foundation.
Why use TinyV4 as your base?
| Reason | Why it matters |
|---|---|
| 11M params | Fine-tune in minutes, not days |
| 58 MB | Fits anywhere β mobile, edge, browser |
| CPU-friendly | No GPU? No problem |
| Bilingual | Already understands ID + EN |
| MoE architecture | Efficient capacity without the bloat |
| MIT license | No restrictions, no strings |
Architecture
| Component | Spec |
|---|---|
| Parameters | 11,034,955 |
| Dimension | 128 |
| Layers | 6 |
| Attention Heads | 4 (Query), 4 (Index) |
| MoE Experts | 4 routed + 1 shared |
| Active Experts | 2 per token |
| Vocab Size | 32,000 |
| Max Sequence | 512 tokens |
| File Size | 58 MB |
Built with Mixture-of-Experts (MoE), Sinkhorn-Knopp load balancing, Multi-Token Prediction (MTP), and Hierarchical Compressed Attention β techniques typically reserved for models 100x larger. We just refused to believe you need billions of parameters to be useful.
What can you fine-tune it for?
TinyV4 is a blank canvas. Some ideas:
- Translation (ID β EN) β it already has bilingual foundations
- Text classification β sentiment, topic, intent
- Story generation β fine-tune on your own narrative dataset
- Chat / instruction following β add conversation data
- Code generation β yes, even at 11M, it can learn patterns
- Domain-specific tasks β medical, legal, technical β your data, your model
The point is: you control the final model. TinyV4 just gives you a running start.
Quick Start
pip install transformers safetensors torch
Load the base model
from transformers import AutoTokenizer, AutoModel
# Load model & tokenizer (trust_remote_code=True karena arsitektur custom)
model = AutoModel.from_pretrained("ukung/tinyv4", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ukung/tinyv4")
# Tie embeddings (custom step untuk TinyV4)
model.head.weight = model.embed.weight
model.eval()
print(f"Loaded: {sum(p.numel()):,} params")
Generate text (zero-shot)
@torch.no_grad()
def generate(prompt, max_new_tokens=60, temperature=0.8, top_k=40):
input_ids = tokenizer.encode(prompt, return_tensors="pt")
for _ in range(max_new_tokens):
idx = input_ids[:, -512:]
logits, _, _ = model(idx)
logits = logits[:, -1, :] / temperature
v, _ = torch.topk(logits, top_k)
logits[logits < v[:, [-1]]] = float('-inf')
probs = torch.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, 1)
input_ids = torch.cat([input_ids, next_token], dim=1)
if next_token.item() == tokenizer.eos_token_id:
break
return tokenizer.decode(input_ids[0], skip_special_tokens=True)
# Try it out
print(generate("Once upon a time,"))
print(generate("Pada suatu hari,"))
Fine-tune for your task
from torch.optim import AdamW
model.train()
optimizer = AdamW(model.parameters(), lr=3e-4)
# Your dataset, your task
for batch in your_dataloader:
logits, mtp_logits, bal_loss = model(batch)
loss = compute_your_loss(logits, batch)
loss.backward()
optimizer.step()
optimizer.zero_grad()
# Save your fine-tuned model
from safetensors.torch import save_file
save_file(model.state_dict(), "my-finetuned-model.safetensors")
Comparison: Sub-100M Base Models
Let's be honest β most base models under 100M parameters are either:
- Distilled from larger models (not truly small)
- Overly specialized (can't adapt to new tasks)
- Poorly architected (waste parameters on the wrong things)
TinyV4 is different. At 11M parameters, it delivers:
- Real bilingual understanding β not just token overlap
- MoE efficiency β 4 experts, 2 active, more capacity per parameter
- Proven adaptability β fine-tunes well across diverse tasks
- Zero-shot generation β coherent output without any task-specific training
We're not saying 11M beats 1B. We're saying that at this size, nothing else gives you this much to work with.
Pre-training Details
| Metric | Value |
|---|---|
| Steps | 5,000 |
| Final Loss | 3.97 |
| Optimizer | AdamW |
| Schedule | Cosine decay with warmup |
| Weight Decay | 0.01 |
Limitations
Be realistic about what 11M parameters can do:
- Zero-shot output will be basic β this is a base model, not a finished product
- Long-form coherence requires fine-tuning with appropriate data
- Domain expertise needs your data β it won't magically know medical terms or legal jargon
- Reasoning is limited β complex logical chains need more parameters
Think of TinyV4 as the best possible starting point at 11M. Not the finish line.
License
MIT β use it, modify it, ship it. No attribution required (but appreciated).
Citation
@misc{tinyv4-11m,
title = {TinyV4: A 11M Bilingual Base Model with Mixture-of-Experts},
year = {2025},
url = {https://huggingface.co/ukung/tinyv4}
}
- Downloads last month
- 45