Atom Classifier
A multilingual token classifier for semantic hypergraph parsing. It classifies each token in a sentence into one of 39 semantic atom types/subtypes, serving as the first stage (alpha) of the Alpha-Beta semantic hypergraph parser.
Model Details
- Architecture: DistilBertForTokenClassification
- Base model: distilbert-base-multilingual-cased
- Labels: 39 semantic atom types
- Max sequence length: 512
Label Taxonomy
Atoms are typed according to the Semantic Hyperedge (SH) notation system. The 7 main types and their subtypes:
Concepts (C)
| Label |
Description |
C |
Generic concept |
Cc |
Common noun |
Cp |
Proper noun |
Ca |
Adjective (as concept) |
Ci |
Pronoun |
Cd |
Determiner (as concept) |
Cm |
Nominal modifier |
Cw |
Interrogative word |
C# |
Number |
Predicates (P)
| Label |
Description |
P |
Generic predicate |
Pd |
Declarative predicate |
P! |
Imperative predicate |
Modifiers (M)
| Label |
Description |
M |
Generic modifier |
Ma |
Adjective modifier |
Mc |
Conceptual modifier |
Md |
Determiner modifier |
Me |
Adverbial modifier |
Mi |
Infinitive particle |
Mj |
Conjunctional modifier |
Ml |
Particle |
Mm |
Modal (auxiliary verb) |
Mn |
Negation |
Mp |
Possessive modifier |
Ms |
Superlative modifier |
Mt |
Prepositional modifier |
Mv |
Verbal modifier |
Mw |
Specifier |
M# |
Number modifier |
M= |
Comparative modifier |
M^ |
Degree modifier |
Builders (B)
| Label |
Description |
B |
Generic builder |
Bp |
Possessive builder |
Br |
Relational builder (preposition) |
Triggers (T)
| Label |
Description |
T |
Generic trigger |
Tt |
Temporal trigger |
Tv |
Verbal trigger |
Conjunctions (J)
| Label |
Description |
J |
Generic conjunction |
Jr |
Relational conjunction |
Special
| Label |
Description |
X |
Excluded token (punctuation, etc.) |
Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("hyperquest/atom-classifier")
model = AutoModelForTokenClassification.from_pretrained("hyperquest/atom-classifier")
sentence = "Berlin is the capital of Germany."
encoded = tokenizer(sentence, return_tensors="pt", return_offsets_mapping=True)
offset_mapping = encoded.pop("offset_mapping")
with torch.no_grad():
outputs = model(**encoded)
predictions = outputs.logits.argmax(-1)[0].tolist()
word_ids = encoded.word_ids(0)
for idx, word_id in enumerate(word_ids):
if word_id is not None:
start, end = offset_mapping[0][idx].tolist()
label = model.config.id2label[predictions[idx]]
print(f"{sentence[start:end]:15s} -> {label}")
Intended Use
This model is designed to be used as the first stage of the Alpha-Beta semantic hypergraph parser (hyperbase-parser-ab). It assigns atom types to tokens, which are then combined into nested hypergraph structures by rule-based grammar in the beta stage.
Part of