-
toksuite/supertoken_models-llama_meta-llama-Llama-3.2-1B-textmatched
2B • Updated • 35 -
toksuite/supertoken_models-llama_Qwen-Qwen3-8B-textmatched
2B • Updated • 36 -
toksuite/supertoken_models-llama_common-pile-comma-v0.1-textmatched
2B • Updated • 34 -
toksuite/supertoken_models-llama_meta-llama-Llama-3.2-300M
0.6B • Updated • 57
TokSuite
community
AI & ML interests
Tokenization, Robustness, LLMs
Recent Activity
View all activity
Organization Card
TokSuite is a collection of models and benchmarks designed to isolate and study the impact of tokenization on language model behavior across English, Chinese, Turkish, Italian, and Farsi languages, as well as STEM and mathematical text. It includes fourteen models that share the same architecture, training data, training budget, and initialization but differ only in their tokenizers, alongside a set of benchmarks that evaluate performance under real-world perturbations that affect tokenization.
Our code is available at https://github.com/r-three/Tokenizers.
-
toksuite/tokenizer_robustness_completion_stem
Viewer • Updated • 614 • 200 -
toksuite/tokenizer_robustness_completion_italian
Viewer • Updated • 1.09k • 391 -
toksuite/tokenizer_robustness_completion_english
Viewer • Updated • 1.14k • 554 -
toksuite/tokenizer_robustness_completion_math
Viewer • Updated • 189 • 336
-
toksuite/supertoken_models-llama_meta-llama-Llama-3.2-1B-textmatched
2B • Updated • 35 -
toksuite/supertoken_models-llama_Qwen-Qwen3-8B-textmatched
2B • Updated • 36 -
toksuite/supertoken_models-llama_common-pile-comma-v0.1-textmatched
2B • Updated • 34 -
toksuite/supertoken_models-llama_meta-llama-Llama-3.2-300M
0.6B • Updated • 57
-
toksuite/tokenizer_robustness_completion_stem
Viewer • Updated • 614 • 200 -
toksuite/tokenizer_robustness_completion_italian
Viewer • Updated • 1.09k • 391 -
toksuite/tokenizer_robustness_completion_english
Viewer • Updated • 1.14k • 554 -
toksuite/tokenizer_robustness_completion_math
Viewer • Updated • 189 • 336
models
20
toksuite/supertoken_models-llama_tiktoken-gpt-4o
Text Generation
•
2B
•
Updated
•
52
toksuite/supertoken_models-llama_CohereLabs-aya-expanse-8b
Text Generation
•
2B
•
Updated
•
53
toksuite/supertoken_models-llama_meta-llama-Llama-3.2-1B
Text Generation
•
2B
•
Updated
•
94
toksuite/supertoken_models-llama_google-gemma-2-2b
Text Generation
•
2B
•
Updated
•
237
toksuite/supertoken_models-llama_bigscience-bloom
Text Generation
•
2B
•
Updated
•
91
toksuite/supertoken_models-llama_gpt2
Text Generation
•
1B
•
Updated
•
74
toksuite/supertoken_models-llama_mistralai-tekken
Text Generation
•
2B
•
Updated
•
42
toksuite/supertoken_models-llama_tokenmonster-englishcode-32000-consistent-v1
Text Generation
•
1B
•
Updated
•
42
toksuite/supertoken_models-llama_Qwen-Qwen3-8B
Text Generation
•
2B
•
Updated
•
36
toksuite/supertoken_models-llama_google-bert-bert-base-multilingual-cased
Text Generation
•
2B
•
Updated
•
54
datasets
10
toksuite/tokenizer_robustness_completion_general
Viewer
•
Updated
•
68
•
73
toksuite/tokenizer_robustness_completion_italian
Viewer
•
Updated
•
1.09k
•
391
toksuite/tokenizer_robustness_completion_farsi
Viewer
•
Updated
•
747
•
139
toksuite/tokenizer_robustness_completion_turkish
Viewer
•
Updated
•
621
•
264
toksuite/tokenizer_robustness_completion_math
Viewer
•
Updated
•
189
•
336
toksuite/tokenizer_robustness_completion_chinese
Viewer
•
Updated
•
485
•
645
toksuite/tokenizer_robustness_completion_stem
Viewer
•
Updated
•
614
•
200
toksuite/tokenizer_robustness_completion_english
Viewer
•
Updated
•
1.14k
•
554
toksuite/toksuite_pretraining_data
Viewer
•
Updated
•
107M
•
512
toksuite/Qwen-Qwen3-8B-toksuite-detokenized
Viewer
•
Updated
•
28M
•
355