TinyV4 — 11M Bilingual Base Model

TinyV4 is a compact 11 million parameter bilingual (Indonesian & English) base model. Think of it as a solid foundation — pre-trained, ready to be fine-tuned for your specific downstream task.

At just 58 MB, it's small enough to run anywhere. Smart enough to be worth your time.

What is this?

Most base models start at 100M+ parameters. Want to experiment with fine-tuning? You need a GPU. Want to iterate fast? Good luck.

TinyV4 is different. 11M parameters with a Mixture-of-Experts architecture — pre-trained on bilingual data so it already understands both Indonesian and English. You bring the task, it brings the foundation.

Why use TinyV4 as your base?

Reason	Why it matters
11M params	Fine-tune in minutes, not days
58 MB	Fits anywhere — mobile, edge, browser
CPU-friendly	No GPU? No problem
Bilingual	Already understands ID + EN
MoE architecture	Efficient capacity without the bloat
MIT license	No restrictions, no strings

Architecture

Component	Spec
Parameters	11,034,955
Dimension	128
Layers	6
Attention Heads	4 (Query), 4 (Index)
MoE Experts	4 routed + 1 shared
Active Experts	2 per token
Vocab Size	32,000
Max Sequence	512 tokens
File Size	58 MB

Built with Mixture-of-Experts (MoE), Sinkhorn-Knopp load balancing, Multi-Token Prediction (MTP), and Hierarchical Compressed Attention — techniques typically reserved for models 100x larger. We just refused to believe you need billions of parameters to be useful.

What can you fine-tune it for?

TinyV4 is a blank canvas. Some ideas:

Translation (ID ↔ EN) — it already has bilingual foundations
Text classification — sentiment, topic, intent
Story generation — fine-tune on your own narrative dataset
Chat / instruction following — add conversation data
Code generation — yes, even at 11M, it can learn patterns
Domain-specific tasks — medical, legal, technical — your data, your model

The point is: you control the final model. TinyV4 just gives you a running start.

Quick Start

pip install transformers safetensors torch

Load the base model

from transformers import AutoTokenizer, AutoModel

# Load model & tokenizer (trust_remote_code=True karena arsitektur custom)
model = AutoModel.from_pretrained("ukung/tinyv4", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ukung/tinyv4")

# Tie embeddings (custom step untuk TinyV4)
model.head.weight = model.embed.weight
model.eval()

print(f"Loaded: {sum(p.numel()):,} params")

Generate text (zero-shot)

@torch.no_grad()
def generate(prompt, max_new_tokens=60, temperature=0.8, top_k=40):
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    for _ in range(max_new_tokens):
        idx = input_ids[:, -512:]
        logits, _, _ = model(idx)
        logits = logits[:, -1, :] / temperature

        v, _ = torch.topk(logits, top_k)
        logits[logits < v[:, [-1]]] = float('-inf')
        probs = torch.softmax(logits, dim=-1)

        next_token = torch.multinomial(probs, 1)
        input_ids = torch.cat([input_ids, next_token], dim=1)

        if next_token.item() == tokenizer.eos_token_id:
            break

    return tokenizer.decode(input_ids[0], skip_special_tokens=True)

# Try it out
print(generate("Once upon a time,"))
print(generate("Pada suatu hari,"))

Fine-tune for your task

from torch.optim import AdamW

model.train()
optimizer = AdamW(model.parameters(), lr=3e-4)

# Your dataset, your task
for batch in your_dataloader:
    logits, mtp_logits, bal_loss = model(batch)
    loss = compute_your_loss(logits, batch)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

# Save your fine-tuned model
from safetensors.torch import save_file
save_file(model.state_dict(), "my-finetuned-model.safetensors")

Comparison: Sub-100M Base Models

Let's be honest — most base models under 100M parameters are either:

Distilled from larger models (not truly small)
Overly specialized (can't adapt to new tasks)
Poorly architected (waste parameters on the wrong things)

TinyV4 is different. At 11M parameters, it delivers:

Real bilingual understanding — not just token overlap
MoE efficiency — 4 experts, 2 active, more capacity per parameter
Proven adaptability — fine-tunes well across diverse tasks
Zero-shot generation — coherent output without any task-specific training

We're not saying 11M beats 1B. We're saying that at this size, nothing else gives you this much to work with.

Pre-training Details

Metric	Value
Steps	5,000
Final Loss	3.97
Optimizer	AdamW
Schedule	Cosine decay with warmup
Weight Decay	0.01

Limitations

Be realistic about what 11M parameters can do:

Zero-shot output will be basic — this is a base model, not a finished product
Long-form coherence requires fine-tuning with appropriate data
Domain expertise needs your data — it won't magically know medical terms or legal jargon
Reasoning is limited — complex logical chains need more parameters

Think of TinyV4 as the best possible starting point at 11M. Not the finish line.

License

MIT — use it, modify it, ship it. No attribution required (but appreciated).

Citation

@misc{tinyv4-11m,
  title  = {TinyV4: A 11M Bilingual Base Model with Mixture-of-Experts},
  year   = {2025},
  url    = {https://huggingface.co/ukung/tinyv4}
}

Downloads last month: 45

Safetensors

Model size

15.2M params

Tensor type

F32