|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- binary-neural-network |
|
|
- zero-tokenization |
|
|
- wire-speed-learning |
|
|
- bit-level |
|
|
- byte-level |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# Binary Transformers: Learning Language from Raw Binary |
|
|
|
|
|
**Zero-tokenization transformers that learn directly from network bytes, bits, and beyond.** |
|
|
|
|
|
This repository contains four novel transformer architectures exploring the limits of minimal vocabulary learning: |
|
|
|
|
|
| Model | Vocab | Input | Weights | Description | |
|
|
|-------|-------|-------|---------|-------------| |
|
|
| **Byte-level** | 256 | bytes (0x00-0xFF) | real | One token per byte value | |
|
|
| **Bit-level** | 2 | bits (0, 1) | real | Pure binary, 8 tokens per byte | |
|
|
| **Dibit** | 4 | dibits (00,01,10,11) | real | 2-bit tokens, 4 per byte | |
|
|
| **Pure Binary** | 2 | bits (0, 1) | **binary (-1/+1)** | BITS ALL THE WAY DOWN | |
|
|
|
|
|
## Why? |
|
|
|
|
|
Traditional LLMs use tokenizers (BPE, SentencePiece) with 32k-256k vocabulary. This creates: |
|
|
- Tokenizer overhead and complexity |
|
|
- Language/domain bias baked into vocabulary |
|
|
- Preprocessing bottleneck |
|
|
|
|
|
**What if we eliminated tokenization entirely?** |
|
|
|
|
|
These models learn directly from raw binary data - no tokenizer, no preprocessing, just bytes flowing into neural networks. The ultimate goal: **wire-speed learning** where models absorb network traffic in real-time. |
|
|
|
|
|
## Results (Live Experiments - 16 Jan 2026) |
|
|
|
|
|
### Byte-Level (vocab=256) |
|
|
``` |
|
|
Data: 350KB web crawl |
|
|
BPB: 4.68 (vs 8.0 random = 41% compression) |
|
|
Speed: 8.7 KB/s learning rate |
|
|
Params: 0.6M |
|
|
``` |
|
|
Learns HTML structure, XML tags, timestamps from raw bytes. |
|
|
|
|
|
### Bit-Level (vocab=2) |
|
|
``` |
|
|
Data: 550KB |
|
|
Entropy: 1.008 bit/bit (vs 1.0 random = 0.8% compression) |
|
|
Speed: 0.7 KB/s |
|
|
Params: 85M |
|
|
``` |
|
|
Pure binary learning - discovers byte boundaries and ASCII from 0s and 1s. |
|
|
|
|
|
### Dibit (vocab=4: 00,01,10,11) |
|
|
``` |
|
|
Data: 437KB |
|
|
BPB: 7.55 (vs 8.0 random = 5.7% compression) |
|
|
Speed: 0.25 KB/s |
|
|
Params: 37.8M |
|
|
``` |
|
|
2-bit tokens provide 2x context efficiency vs bit-level. **Best compression so far!** |
|
|
|
|
|
### Pure Binary (vocab=2, binary weights) |
|
|
``` |
|
|
Data: 806KB |
|
|
Entropy: 0.995 bit/bit (0.5% compression) |
|
|
Binary params: 99.8% |
|
|
Params: 4.7M |
|
|
``` |
|
|
**BITS ALL THE WAY DOWN** - input bits, binary weights (-1/+1), output bits. |
|
|
On specialized hardware, this enables XNOR+popcount operations instead of multiply-accumulate. |
|
|
|
|
|
## Architecture |
|
|
|
|
|
All models use standard transformer architecture with: |
|
|
- Causal self-attention |
|
|
- GELU activation |
|
|
- LayerNorm |
|
|
- AdamW optimizer |
|
|
- Straight-Through Estimator (STE) for binary weight gradients |
|
|
|
|
|
### Key Innovation: Online Learning |
|
|
|
|
|
Unlike traditional batch training, these models learn from streaming data: |
|
|
- Micro-batches (32-512 tokens) |
|
|
- Single-pass, no data curation |
|
|
- Real-time network stream compatible |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Byte-Level |
|
|
```bash |
|
|
# Pipe any data source |
|
|
cat data.bin | python byte_trainer.py |
|
|
curl -s http://example.com | python byte_trainer.py |
|
|
zcat crawl.jsonl.gz | python byte_trainer.py |
|
|
``` |
|
|
|
|
|
### Bit-Level |
|
|
```bash |
|
|
cat data.bin | python bit_trainer.py |
|
|
``` |
|
|
|
|
|
### Dibit (2-bit tokens) |
|
|
```bash |
|
|
cat data.bin | python dibit_trainer.py |
|
|
``` |
|
|
|
|
|
### Pure Binary (binary weights) |
|
|
```bash |
|
|
cat data.bin | python purebit_trainer.py |
|
|
``` |
|
|
|
|
|
## Configuration |
|
|
|
|
|
Edit the CONFIG dict in each trainer: |
|
|
|
|
|
```python |
|
|
CONFIG = { |
|
|
"d": 256, # embedding dimension |
|
|
"layers": 6, # transformer layers |
|
|
"heads": 8, # attention heads |
|
|
"vocab": 2, # vocabulary size |
|
|
"ctx": 2048, # context length |
|
|
} |
|
|
``` |
|
|
|
|
|
## Files |
|
|
|
|
|
``` |
|
|
byte_trainer.py # Vocab=256, one token per byte |
|
|
bit_trainer.py # Vocab=2, pure bits |
|
|
dibit_trainer.py # Vocab=4, 2-bit tokens (00,01,10,11) |
|
|
purebit_trainer.py # Vocab=2 + binary weights (-1/+1) |
|
|
``` |
|
|
|
|
|
## Insights |
|
|
|
|
|
1. **Byte-level is sweet spot** - 256 vocab captures ASCII structure efficiently while eliminating tokenizer overhead |
|
|
|
|
|
2. **Bit-level works but slow** - 8x longer sequences mean 8x less context per forward pass |
|
|
|
|
|
3. **Dibit balances** - 2-bit tokens give 2x context vs bit-level while staying "pure binary" |
|
|
|
|
|
4. **Binary weights viable** - 99.8% binary params learn almost as well as real weights, enabling massive hardware speedups |
|
|
|
|
|
5. **HTML is natural SFT** - Web data contains instruction-following patterns: `<h3>Question</h3><p>Answer`, `<dt>Term</dt><dd>Definition</dd>`, JSON Q&A |
|
|
|
|
|
## Future Work |
|
|
|
|
|
- Scale to billions of parameters |
|
|
- Custom CUDA kernels for binary ops (XNOR + popcount) |
|
|
- FPGA/ASIC implementation for true wire-speed learning |
|
|
- Hierarchical binary models (bit → byte → word emergence) |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{opentransformer2026binary, |
|
|
title={Binary Transformers: Learning Language from Raw Binary}, |
|
|
author={OpenTransformer}, |
|
|
year={2026}, |
|
|
publisher={HuggingFace}, |
|
|
url={https://huggingface.co/OpenTransformer/binary-transformers} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
MIT |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
Built with PyTorch. Trained on vast.ai GPU instances. Part of the AGILLM research project. |
|
|
|