π― RoBERTa Clickbait Classifier
A clickbait detection model built on RoBERTa-base (125M parameters), fine-tuned on multiple combined and deduplicated English datasets.
π Quick Start
from transformers import pipeline
classifier = pipeline("text-classification", model="ENTUM-AI/roberta-clickbait-classifier")
# Clickbait
result = classifier("You Won't BELIEVE What This Celebrity Did Next!")
print(result) # [{'label': 'Clickbait', 'score': 0.99...}]
# Non-Clickbait
result = classifier("Federal Reserve raises interest rates by 0.25 percentage points")
print(result) # [{'label': 'Non-Clickbait', 'score': 0.99...}]
Model Details
| Architecture | RoBERTa-base (125M parameters) |
| Task | Binary text classification |
| Labels | Clickbait (1), Non-Clickbait (0) |
| Language | English |
| License | Apache 2.0 |
| Max input length | 128 tokens |
π Training Data
Three public English clickbait datasets, combined and deduplicated:
| Dataset | Source |
|---|---|
| christinacdl/Clickbait_New | 58.6K samples from multiple sources |
| marksverdhei/clickbait_title_classification | 32K samples (Chakraborty et al., ASONAM 2016) |
| contemmcm/clickbait | 26K samples |
After deduplication and balancing: ~48K samples (train/val/test split 85/10/5).
βοΈ Training
Fine-tuned with HuggingFace Trainer using linear LR schedule with warmup, AdamW optimizer, and early stopping on F1 score.
π‘ Use Cases
- News aggregators β filter low-quality clickbait articles
- Social media β content moderation and feed quality scoring
- Browser extensions β warn users about clickbait headlines
- Email filters β detect clickbait-style subject lines
- Content platforms β automated content quality assessment
β οΈ Limitations
- English only
- Optimized for short texts (headlines, titles, tweets); longer texts will be truncated to 128 tokens
- Reflects patterns and biases present in the training data sources
- Downloads last month
- -