BERTose and AFFINose Training Code
This repository is the public training and reproducibility companion for the BERTose and AFFINose Hugging Face release. It contains the code, configs, tokenizer assets, split files, and provenance notes needed to understand how the released checkpoints were trained and evaluated.
The released inference checkpoints live separately:
- BERTose glycan encoder:
supanthadey1/bertose-glycan-encoder - BERTose IAR resolver:
supanthadey1/bertose-iar-resolver - AFFINose interaction model:
supanthadey1/affinose-interaction-model - Cloud inference notebook:
supanthadey1/bertose-affinose-inference
Repository Map
| Public workflow | Files |
|---|---|
| BERTose multimodal pretraining | code/training/train_multimodal.py, code/training/multimodal_dataset.py, code/training/multimodal_masking.py, configs/ |
| BERTose WURCS-BPE tokenizer training | code/training/train_wurcs_bpe.py, code/model/wurcs_bpe_tokenizer.py, data/vocab/ |
| BERTose IAR / contrastive refinement | code/contrastive/, code/contrastive_training/, code/tokenizer/generate_bpe_ambiguity.py |
| AFFINose interaction training | code/affinose/README.md |
| Benchmark reproduction | code/benchmarks/, code/downstream_tasks/utils/ |
| Embedding and biology probes | code/probes/ |
| Split and vocabulary assets | data/splits/, data/vocab/ |
| Compute and lineage provenance | provenance/ |
Some executable filenames and directories retain historical names so that provenance remains traceable to the original training runs. In public-facing text, use BERTose for the glycan encoder / IAR workflows and AFFINose for the protein-glycan interaction workflow.
What Is Included
- Core BERTose architecture and dataset utilities.
- BERTose multimodal pretraining entrypoint and configs.
- WURCS-BPE tokenizer training and ambiguity-token generation code.
- Contrastive refinement / IAR training and negative generation scripts.
- AFFINose interaction-model data construction, split generation, training, and inference code.
- Benchmark and probe scripts used for manuscript analyses.
- Vocabulary files, ambiguity-token maps, train/validation split metadata, and leakage-exclusion lists.
- Compute-provenance notes and representative launch scripts.
What Is Not Included
Large training artifacts are intentionally not bundled here:
- Full pretraining corpus pickles such as
sequences_bpe.pkl. - Multi-GB intermediate mapping files.
- Full training checkpoints; released checkpoints are hosted in the separate model repositories listed above.
- ESM-C protein embeddings required for AFFINose training; users should generate or provide those separately according to the ESM-C access rules.
Quick Import Check
After downloading the repository, a lightweight import check should work without access to the full training data:
cd code
PYTHONDONTWRITEBYTECODE=1 python - <<'PY'
import importlib
for module in [
"training.multimodal_dataset",
"training.masking",
"training.multimodal_masking",
"training.train_multimodal",
"training.train_wurcs_bpe",
"model.dataset",
"downstream_tasks.utils.tokenizer",
"downstream_tasks.utils.wurcs_bpe_tokenizer",
"downstream_tasks.utils.dataset",
]:
importlib.import_module(module)
print("ok", module)
PY
Full training requires the original large data artifacts and appropriate GPU resources.
Environment
Install the core dependencies with:
pip install -r requirements.txt
Some probe scripts also use optional scientific plotting and glycan-analysis packages. See requirements.txt for the split between core and optional dependencies.
Reproducibility Notes
MANIFEST.mddocuments the file provenance and model-lineage mapping.RELEASE_AUDIT.mddocuments the public-release state, fixed gaps, verification checks, and known scope limits.- Cluster launch scripts under
provenance/compute_provenance/preserve the original compute context. They are provenance records, not portable one-command launchers for every environment.
License And Citation
This repository is released as a research reproducibility companion for the BERTose and AFFINose manuscript. The Hugging Face metadata uses license: other until the final publication license and citation are assigned. Please cite the associated manuscript once available.