Upload folder using huggingface_hub

892b5b4 verified 10 days ago

4.87 kB

	---
	license: mit
	tags:
	- binary-neural-network
	- zero-tokenization
	- wire-speed-learning
	- bit-level
	- byte-level
	language:
	- en
	pipeline_tag: text-generation
	---

	# Binary Transformers: Learning Language from Raw Binary

	Zero-tokenization transformers that learn directly from network bytes, bits, and beyond.

	This repository contains four novel transformer architectures exploring the limits of minimal vocabulary learning:

	\| Model \| Vocab \| Input \| Weights \| Description \|
	\|-------\|-------\|-------\|---------\|-------------\|
	\| Byte-level \| 256 \| bytes (0x00-0xFF) \| real \| One token per byte value \|
	\| Bit-level \| 2 \| bits (0, 1) \| real \| Pure binary, 8 tokens per byte \|
	\| Dibit \| 4 \| dibits (00,01,10,11) \| real \| 2-bit tokens, 4 per byte \|
	\| Pure Binary \| 2 \| bits (0, 1) \| binary (-1/+1) \| BITS ALL THE WAY DOWN \|

	## Why?

	Traditional LLMs use tokenizers (BPE, SentencePiece) with 32k-256k vocabulary. This creates:
	- Tokenizer overhead and complexity
	- Language/domain bias baked into vocabulary
	- Preprocessing bottleneck

	What if we eliminated tokenization entirely?

	These models learn directly from raw binary data - no tokenizer, no preprocessing, just bytes flowing into neural networks. The ultimate goal: wire-speed learning where models absorb network traffic in real-time.

	## Results (Live Experiments - 16 Jan 2026)

	### Byte-Level (vocab=256)
	```
	Data: 350KB web crawl
	BPB: 4.68 (vs 8.0 random = 41% compression)
	Speed: 8.7 KB/s learning rate
	Params: 0.6M
	```
	Learns HTML structure, XML tags, timestamps from raw bytes.

	### Bit-Level (vocab=2)
	```
	Data: 550KB
	Entropy: 1.008 bit/bit (vs 1.0 random = 0.8% compression)
	Speed: 0.7 KB/s
	Params: 85M
	```
	Pure binary learning - discovers byte boundaries and ASCII from 0s and 1s.

	### Dibit (vocab=4: 00,01,10,11)
	```
	Data: 437KB
	BPB: 7.55 (vs 8.0 random = 5.7% compression)
	Speed: 0.25 KB/s
	Params: 37.8M
	```
	2-bit tokens provide 2x context efficiency vs bit-level. Best compression so far!

	### Pure Binary (vocab=2, binary weights)
	```
	Data: 806KB
	Entropy: 0.995 bit/bit (0.5% compression)
	Binary params: 99.8%
	Params: 4.7M
	```
	BITS ALL THE WAY DOWN - input bits, binary weights (-1/+1), output bits.
	On specialized hardware, this enables XNOR+popcount operations instead of multiply-accumulate.

	## Architecture

	All models use standard transformer architecture with:
	- Causal self-attention
	- GELU activation
	- LayerNorm
	- AdamW optimizer
	- Straight-Through Estimator (STE) for binary weight gradients

	### Key Innovation: Online Learning

	Unlike traditional batch training, these models learn from streaming data:
	- Micro-batches (32-512 tokens)
	- Single-pass, no data curation
	- Real-time network stream compatible

	## Usage

	### Byte-Level
	```bash
	# Pipe any data source
	cat data.bin \| python byte_trainer.py
	curl -s http://example.com \| python byte_trainer.py
	zcat crawl.jsonl.gz \| python byte_trainer.py
	```

	### Bit-Level
	```bash
	cat data.bin \| python bit_trainer.py
	```

	### Dibit (2-bit tokens)
	```bash
	cat data.bin \| python dibit_trainer.py
	```

	### Pure Binary (binary weights)
	```bash
	cat data.bin \| python purebit_trainer.py
	```

	## Configuration

	Edit the CONFIG dict in each trainer:

	```python
	CONFIG = {
	"d": 256, # embedding dimension
	"layers": 6, # transformer layers
	"heads": 8, # attention heads
	"vocab": 2, # vocabulary size
	"ctx": 2048, # context length
	}
	```

	## Files

	```
	byte_trainer.py # Vocab=256, one token per byte
	bit_trainer.py # Vocab=2, pure bits
	dibit_trainer.py # Vocab=4, 2-bit tokens (00,01,10,11)
	purebit_trainer.py # Vocab=2 + binary weights (-1/+1)
	```

	## Insights

	1. Byte-level is sweet spot - 256 vocab captures ASCII structure efficiently while eliminating tokenizer overhead

	2. Bit-level works but slow - 8x longer sequences mean 8x less context per forward pass

	3. Dibit balances - 2-bit tokens give 2x context vs bit-level while staying "pure binary"

	4. Binary weights viable - 99.8% binary params learn almost as well as real weights, enabling massive hardware speedups

	5. HTML is natural SFT - Web data contains instruction-following patterns: `<h3>Question</h3><p>Answer`, `<dt>Term</dt><dd>Definition</dd>`, JSON Q&A

	## Future Work

	- Scale to billions of parameters
	- Custom CUDA kernels for binary ops (XNOR + popcount)
	- FPGA/ASIC implementation for true wire-speed learning
	- Hierarchical binary models (bit → byte → word emergence)

	## Citation

	```bibtex
	@misc{opentransformer2026binary,
	title={Binary Transformers: Learning Language from Raw Binary},
	author={OpenTransformer},
	year={2026},
	publisher={HuggingFace},
	url={https://huggingface.co/OpenTransformer/binary-transformers}
	}
	```

	## License

	MIT

	## Acknowledgments

	Built with PyTorch. Trained on vast.ai GPU instances. Part of the AGILLM research project.