binary-transformers / README.md
OpenTransformer's picture
Upload folder using huggingface_hub
892b5b4 verified
---
license: mit
tags:
- binary-neural-network
- zero-tokenization
- wire-speed-learning
- bit-level
- byte-level
language:
- en
pipeline_tag: text-generation
---
# Binary Transformers: Learning Language from Raw Binary
**Zero-tokenization transformers that learn directly from network bytes, bits, and beyond.**
This repository contains four novel transformer architectures exploring the limits of minimal vocabulary learning:
| Model | Vocab | Input | Weights | Description |
|-------|-------|-------|---------|-------------|
| **Byte-level** | 256 | bytes (0x00-0xFF) | real | One token per byte value |
| **Bit-level** | 2 | bits (0, 1) | real | Pure binary, 8 tokens per byte |
| **Dibit** | 4 | dibits (00,01,10,11) | real | 2-bit tokens, 4 per byte |
| **Pure Binary** | 2 | bits (0, 1) | **binary (-1/+1)** | BITS ALL THE WAY DOWN |
## Why?
Traditional LLMs use tokenizers (BPE, SentencePiece) with 32k-256k vocabulary. This creates:
- Tokenizer overhead and complexity
- Language/domain bias baked into vocabulary
- Preprocessing bottleneck
**What if we eliminated tokenization entirely?**
These models learn directly from raw binary data - no tokenizer, no preprocessing, just bytes flowing into neural networks. The ultimate goal: **wire-speed learning** where models absorb network traffic in real-time.
## Results (Live Experiments - 16 Jan 2026)
### Byte-Level (vocab=256)
```
Data: 350KB web crawl
BPB: 4.68 (vs 8.0 random = 41% compression)
Speed: 8.7 KB/s learning rate
Params: 0.6M
```
Learns HTML structure, XML tags, timestamps from raw bytes.
### Bit-Level (vocab=2)
```
Data: 550KB
Entropy: 1.008 bit/bit (vs 1.0 random = 0.8% compression)
Speed: 0.7 KB/s
Params: 85M
```
Pure binary learning - discovers byte boundaries and ASCII from 0s and 1s.
### Dibit (vocab=4: 00,01,10,11)
```
Data: 437KB
BPB: 7.55 (vs 8.0 random = 5.7% compression)
Speed: 0.25 KB/s
Params: 37.8M
```
2-bit tokens provide 2x context efficiency vs bit-level. **Best compression so far!**
### Pure Binary (vocab=2, binary weights)
```
Data: 806KB
Entropy: 0.995 bit/bit (0.5% compression)
Binary params: 99.8%
Params: 4.7M
```
**BITS ALL THE WAY DOWN** - input bits, binary weights (-1/+1), output bits.
On specialized hardware, this enables XNOR+popcount operations instead of multiply-accumulate.
## Architecture
All models use standard transformer architecture with:
- Causal self-attention
- GELU activation
- LayerNorm
- AdamW optimizer
- Straight-Through Estimator (STE) for binary weight gradients
### Key Innovation: Online Learning
Unlike traditional batch training, these models learn from streaming data:
- Micro-batches (32-512 tokens)
- Single-pass, no data curation
- Real-time network stream compatible
## Usage
### Byte-Level
```bash
# Pipe any data source
cat data.bin | python byte_trainer.py
curl -s http://example.com | python byte_trainer.py
zcat crawl.jsonl.gz | python byte_trainer.py
```
### Bit-Level
```bash
cat data.bin | python bit_trainer.py
```
### Dibit (2-bit tokens)
```bash
cat data.bin | python dibit_trainer.py
```
### Pure Binary (binary weights)
```bash
cat data.bin | python purebit_trainer.py
```
## Configuration
Edit the CONFIG dict in each trainer:
```python
CONFIG = {
"d": 256, # embedding dimension
"layers": 6, # transformer layers
"heads": 8, # attention heads
"vocab": 2, # vocabulary size
"ctx": 2048, # context length
}
```
## Files
```
byte_trainer.py # Vocab=256, one token per byte
bit_trainer.py # Vocab=2, pure bits
dibit_trainer.py # Vocab=4, 2-bit tokens (00,01,10,11)
purebit_trainer.py # Vocab=2 + binary weights (-1/+1)
```
## Insights
1. **Byte-level is sweet spot** - 256 vocab captures ASCII structure efficiently while eliminating tokenizer overhead
2. **Bit-level works but slow** - 8x longer sequences mean 8x less context per forward pass
3. **Dibit balances** - 2-bit tokens give 2x context vs bit-level while staying "pure binary"
4. **Binary weights viable** - 99.8% binary params learn almost as well as real weights, enabling massive hardware speedups
5. **HTML is natural SFT** - Web data contains instruction-following patterns: `<h3>Question</h3><p>Answer`, `<dt>Term</dt><dd>Definition</dd>`, JSON Q&A
## Future Work
- Scale to billions of parameters
- Custom CUDA kernels for binary ops (XNOR + popcount)
- FPGA/ASIC implementation for true wire-speed learning
- Hierarchical binary models (bit → byte → word emergence)
## Citation
```bibtex
@misc{opentransformer2026binary,
title={Binary Transformers: Learning Language from Raw Binary},
author={OpenTransformer},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/OpenTransformer/binary-transformers}
}
```
## License
MIT
## Acknowledgments
Built with PyTorch. Trained on vast.ai GPU instances. Part of the AGILLM research project.