BERTose and AFFINose Training Code

This repository is the public training and reproducibility companion for the BERTose and AFFINose Hugging Face release. It contains the code, configs, tokenizer assets, split files, and provenance notes needed to understand how the released checkpoints were trained and evaluated.

The released inference checkpoints live separately:

BERTose glycan encoder: supanthadey1/bertose-glycan-encoder
BERTose IAR resolver: supanthadey1/bertose-iar-resolver
AFFINose interaction model: supanthadey1/affinose-interaction-model
Cloud inference notebook: supanthadey1/bertose-affinose-inference

Repository Map

Public workflow	Files
BERTose multimodal pretraining	`code/training/train_multimodal.py`, `code/training/multimodal_dataset.py`, `code/training/multimodal_masking.py`, `configs/`
BERTose WURCS-BPE tokenizer training	`code/training/train_wurcs_bpe.py`, `code/model/wurcs_bpe_tokenizer.py`, `data/vocab/`
BERTose IAR / contrastive refinement	`code/contrastive/`, `code/contrastive_training/`, `code/tokenizer/generate_bpe_ambiguity.py`
AFFINose interaction training	`code/affinose/README.md`
Benchmark reproduction	`code/benchmarks/`, `code/downstream_tasks/utils/`
Embedding and biology probes	`code/probes/`
Split and vocabulary assets	`data/splits/`, `data/vocab/`
Compute and lineage provenance	`provenance/`

Some executable filenames and directories retain historical names so that provenance remains traceable to the original training runs. In public-facing text, use BERTose for the glycan encoder / IAR workflows and AFFINose for the protein-glycan interaction workflow.

What Is Included

Core BERTose architecture and dataset utilities.
BERTose multimodal pretraining entrypoint and configs.
WURCS-BPE tokenizer training and ambiguity-token generation code.
Contrastive refinement / IAR training and negative generation scripts.
AFFINose interaction-model data construction, split generation, training, and inference code.
Benchmark and probe scripts used for manuscript analyses.
Vocabulary files, ambiguity-token maps, train/validation split metadata, and leakage-exclusion lists.
Compute-provenance notes and representative launch scripts.

What Is Not Included

Large training artifacts are intentionally not bundled here:

Full pretraining corpus pickles such as sequences_bpe.pkl.
Multi-GB intermediate mapping files.
Full training checkpoints; released checkpoints are hosted in the separate model repositories listed above.
ESM-C protein embeddings required for AFFINose training; users should generate or provide those separately according to the ESM-C access rules.

Quick Import Check

After downloading the repository, a lightweight import check should work without access to the full training data:

cd code
PYTHONDONTWRITEBYTECODE=1 python - <<'PY'
import importlib

for module in [
    "training.multimodal_dataset",
    "training.masking",
    "training.multimodal_masking",
    "training.train_multimodal",
    "training.train_wurcs_bpe",
    "model.dataset",
    "downstream_tasks.utils.tokenizer",
    "downstream_tasks.utils.wurcs_bpe_tokenizer",
    "downstream_tasks.utils.dataset",
]:
    importlib.import_module(module)
    print("ok", module)
PY

Full training requires the original large data artifacts and appropriate GPU resources.

Environment

Install the core dependencies with:

pip install -r requirements.txt

Some probe scripts also use optional scientific plotting and glycan-analysis packages. See requirements.txt for the split between core and optional dependencies.

Reproducibility Notes

MANIFEST.md documents the file provenance and model-lineage mapping.
RELEASE_AUDIT.md documents the public-release state, fixed gaps, verification checks, and known scope limits.
Cluster launch scripts under provenance/compute_provenance/ preserve the original compute context. They are provenance records, not portable one-command launchers for every environment.

License And Citation

This repository is released as a research reproducibility companion for the BERTose and AFFINose manuscript. The Hugging Face metadata uses license: other until the final publication license and citation are assigned. Please cite the associated manuscript once available.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support