stefan-it commited on
Commit
b2cebd5
Β·
verified Β·
1 Parent(s): 567a7d4

docs: add more sections, incl. reference to original GitHub project

Browse files
Files changed (1) hide show
  1. README.md +20 -1
README.md CHANGED
@@ -9,4 +9,23 @@ pinned: false
9
 
10
  # πŸ‡©πŸ‡ͺ German Tokenizer Benchmark
11
 
12
- Useful resources for building a Tokenizer Benchmark for German.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
  # πŸ‡©πŸ‡ͺ German Tokenizer Benchmark
11
 
12
+ A curated collection of German datasets for comprehensive tokenizer evaluation across diverse domains and text types.
13
+
14
+ This organization hosts datasets used by the [GerTokEval](https://github.com/stefan-it/german-tokenizer-benchmark) framework to evaluate German tokenizers with standardized metrics and fairness analysis.
15
+
16
+ ## πŸ”Ž Datasets
17
+
18
+ The following datasets are currently supported in the main framework:
19
+
20
+ * [GermEval 2014](https://huggingface.co/datasets/german-tokenizer-benchmark/germeval14)
21
+ * [BIOfid](https://huggingface.co/datasets/german-tokenizer-benchmark/biofid)
22
+ * [CO-Fun](https://huggingface.co/datasets/german-tokenizer-benchmark/co-funer)
23
+ * [German LER](https://huggingface.co/datasets/german-tokenizer-benchmark/german-ler)
24
+ * [DFKI MobIE](https://huggingface.co/datasets/german-tokenizer-benchmark/mobie)
25
+ * [UD HDT](https://huggingface.co/datasets/german-tokenizer-benchmark/ud-hdt)
26
+
27
+ The main goal for choosing these datasets is to evaluate tokenizers on a broad range of domains.
28
+
29
+ # ❀️ Acknowledgements
30
+
31
+ Many thanks to [Clara Meister](https://github.com/cimeister) for releasing the amazing [TokEval framework](https://github.com/cimeister/tokenizer-analysis-suite)!