Atom Classifier

A multilingual token classifier for semantic hypergraph parsing. It classifies each token in a sentence into one of 39 semantic atom types/subtypes, serving as the first stage (alpha) of the Alpha-Beta semantic hypergraph parser.

Model Details

Architecture: DistilBertForTokenClassification
Base model: distilbert-base-multilingual-cased
Labels: 39 semantic atom types
Max sequence length: 512

Label Taxonomy

Atoms are typed according to the Semantic Hyperedge (SH) notation system. The 7 main types and their subtypes:

Concepts (C)

Label	Description
`C`	Generic concept
`Cc`	Common noun
`Cp`	Proper noun
`Ca`	Adjective (as concept)
`Ci`	Pronoun
`Cd`	Determiner (as concept)
`Cm`	Nominal modifier
`Cw`	Interrogative word
`C#`	Number

Predicates (P)

Label	Description
`P`	Generic predicate
`Pd`	Declarative predicate
`P!`	Imperative predicate

Modifiers (M)

Label	Description
`M`	Generic modifier
`Ma`	Adjective modifier
`Mc`	Conceptual modifier
`Md`	Determiner modifier
`Me`	Adverbial modifier
`Mi`	Infinitive particle
`Mj`	Conjunctional modifier
`Ml`	Particle
`Mm`	Modal (auxiliary verb)
`Mn`	Negation
`Mp`	Possessive modifier
`Ms`	Superlative modifier
`Mt`	Prepositional modifier
`Mv`	Verbal modifier
`Mw`	Specifier
`M#`	Number modifier
`M=`	Comparative modifier
`M^`	Degree modifier

Builders (B)

Label	Description
`B`	Generic builder
`Bp`	Possessive builder
`Br`	Relational builder (preposition)

Triggers (T)

Label	Description
`T`	Generic trigger
`Tt`	Temporal trigger
`Tv`	Verbal trigger

Conjunctions (J)

Label	Description
`J`	Generic conjunction
`Jr`	Relational conjunction

Special

Label	Description
`X`	Excluded token (punctuation, etc.)

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("hyperquest/atom-classifier")
model = AutoModelForTokenClassification.from_pretrained("hyperquest/atom-classifier")

sentence = "Berlin is the capital of Germany."
encoded = tokenizer(sentence, return_tensors="pt", return_offsets_mapping=True)
offset_mapping = encoded.pop("offset_mapping")

with torch.no_grad():
    outputs = model(**encoded)

predictions = outputs.logits.argmax(-1)[0].tolist()
word_ids = encoded.word_ids(0)

for idx, word_id in enumerate(word_ids):
    if word_id is not None:
        start, end = offset_mapping[0][idx].tolist()
        label = model.config.id2label[predictions[idx]]
        print(f"{sentence[start:end]:15s} -> {label}")

Intended Use

This model is designed to be used as the first stage of the Alpha-Beta semantic hypergraph parser (hyperbase-parser-ab). It assigns atom types to tokens, which are then combined into nested hypergraph structures by rule-based grammar in the beta stage.

Part of

hyperbase -- Semantic Hypergraph toolkit
hyperbase-parser-ab -- Alpha-Beta parser

Downloads last month: 40

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for hyperquest/atom-classifier

Base model

distilbert/distilbert-base-multilingual-cased

Finetuned

(431)

this model