--- license: apache-2.0 language: - multilingual - en - de - fr - es - pt - nl base_model: distilbert-base-multilingual-cased tags: - token-classification - semantic-parsing - hypergraph - nlp pipeline_tag: token-classification library_name: transformers --- # Atom Classifier A multilingual token classifier for **semantic hypergraph parsing**. It classifies each token in a sentence into one of 39 semantic atom types/subtypes, serving as the first stage (alpha) of the [Alpha-Beta semantic hypergraph parser](https://github.com/hyperquest-hq/hyperbase-parser-ab). ## Model Details - **Architecture:** DistilBertForTokenClassification - **Base model:** distilbert-base-multilingual-cased - **Labels:** 39 semantic atom types - **Max sequence length:** 512 ## Label Taxonomy Atoms are typed according to the [Semantic Hyperedge (SH) notation system](https://hyperquest.ai/hyperbase/manual/notation/). The 7 main types and their subtypes: ### Concepts (C) | Label | Description | |-------|-------------| | `C` | Generic concept | | `Cc` | Common noun | | `Cp` | Proper noun | | `Ca` | Adjective (as concept) | | `Ci` | Pronoun | | `Cd` | Determiner (as concept) | | `Cm` | Nominal modifier | | `Cw` | Interrogative word | | `C#` | Number | ### Predicates (P) | Label | Description | |-------|-------------| | `P` | Generic predicate | | `Pd` | Declarative predicate | | `P!` | Imperative predicate | ### Modifiers (M) | Label | Description | |-------|-------------| | `M` | Generic modifier | | `Ma` | Adjective modifier | | `Mc` | Conceptual modifier | | `Md` | Determiner modifier | | `Me` | Adverbial modifier | | `Mi` | Infinitive particle | | `Mj` | Conjunctional modifier | | `Ml` | Particle | | `Mm` | Modal (auxiliary verb) | | `Mn` | Negation | | `Mp` | Possessive modifier | | `Ms` | Superlative modifier | | `Mt` | Prepositional modifier | | `Mv` | Verbal modifier | | `Mw` | Specifier | | `M#` | Number modifier | | `M=` | Comparative modifier | | `M^` | Degree modifier | ### Builders (B) | Label | Description | |-------|-------------| | `B` | Generic builder | | `Bp` | Possessive builder | | `Br` | Relational builder (preposition) | ### Triggers (T) | Label | Description | |-------|-------------| | `T` | Generic trigger | | `Tt` | Temporal trigger | | `Tv` | Verbal trigger | ### Conjunctions (J) | Label | Description | |-------|-------------| | `J` | Generic conjunction | | `Jr` | Relational conjunction | ### Special | Label | Description | |-------|-------------| | `X` | Excluded token (punctuation, etc.) | ## Usage ```python from transformers import AutoTokenizer, AutoModelForTokenClassification import torch tokenizer = AutoTokenizer.from_pretrained("hyperquest/atom-classifier") model = AutoModelForTokenClassification.from_pretrained("hyperquest/atom-classifier") sentence = "Berlin is the capital of Germany." encoded = tokenizer(sentence, return_tensors="pt", return_offsets_mapping=True) offset_mapping = encoded.pop("offset_mapping") with torch.no_grad(): outputs = model(**encoded) predictions = outputs.logits.argmax(-1)[0].tolist() word_ids = encoded.word_ids(0) for idx, word_id in enumerate(word_ids): if word_id is not None: start, end = offset_mapping[0][idx].tolist() label = model.config.id2label[predictions[idx]] print(f"{sentence[start:end]:15s} -> {label}") ``` ## Intended Use This model is designed to be used as the first stage of the Alpha-Beta semantic hypergraph parser (`hyperbase-parser-ab`). It assigns atom types to tokens, which are then combined into nested hypergraph structures by rule-based grammar in the beta stage. ## Part of - [hyperbase](https://github.com/hyperquest-hq/hyperbase) -- Semantic Hypergraph toolkit - [hyperbase-parser-ab](https://github.com/hyperquest-hq/hyperbase-parser-ab) -- Alpha-Beta parser