# Math Question Classifier - Quick Start Guide

## Execution Order

### Setup (Blocks 1-7)
**Run once to setup environment and define classes**

1. **Block 1**: Install packages
2. **Block 2**: Import libraries  
3. **Block 3**: Set data path
4. **Block 4**: Convert JSON to Parquet (one-time data preparation)
5. **Block 5**: Define MathDatasetLoader class
6. **Block 6**: Define MathFeatureExtractor class
7. **Block 7**: Define MathQuestionClassifier class

### Training & Evaluation (Blocks 8-13)
**Run to train and evaluate models**

8. **Block 8**: Load dataset from Parquet files
9. **Block 9**: Extract features (text preprocessing + math symbols + numeric)
10. **Block 10**: Vectorize features (TF-IDF + scaling)
11. **Block 11**: Train 5 models and compare performance
12. **Block 12**: Detailed evaluation of best model
13. **Block 13**: Complete test set analysis with 6 visualizations

---

## What Each Block Does

### Block 1-3: Environment Setup
- Installs scikit-learn, pandas, matplotlib, seaborn, nltk
- Imports all necessary libraries
- Sets path to data directory (`./math`)

### Block 4: Data Consolidation
**Purpose**: Convert JSON files to Parquet format
- **Input**: `./math/train/` and `./math/test/` folders with JSON files
- **Output**: `train.parquet` and `test.parquet`
- **Benefit**: 10-100x faster loading than JSON
- **Run**: Only once (skip if Parquet files already exist)

### Block 5-7: Class Definitions
Define three main classes:
- **MathDatasetLoader**: Loads Parquet files, shows statistics
- **MathFeatureExtractor**: Cleans LaTeX, extracts math symbols, preprocesses text
- **MathQuestionClassifier**: Trains models, evaluates performance

### Block 8: Load Data
- Loads `train.parquet` and `test.parquet`
- Shows class distribution for train and test sets
- Displays 2 bar charts (train/test distribution)

### Block 9: Feature Extraction
Extracts three types of features:
1. **Text features**: Preprocessed text (LaTeX cleaning, lemmatization)
2. **Math symbol features**: 10 binary indicators (has_fraction, has_sqrt, etc.)
3. **Numeric features**: 5 statistical measures (num_count, avg_number, etc.)

### Block 10: Vectorization
- Creates TF-IDF features (5000 dimensions, trigrams)
- Scales additional features to [0,1] using MinMaxScaler
- **Critical**: Fits ONLY on training data (prevents data leakage)
- Converts to CSR format for efficient operations

### Block 11: Model Training
Trains 5 optimized models:
1. **Naive Bayes** (baseline)
2. **Logistic Regression** (linear classifier)
3. **SVM** (maximum margin)
4. **Random Forest** (ensemble)
5. **Gradient Boosting** (sequential ensemble)

**Output**:
- Comparison table with Accuracy, F1-Score, Training Time
- 2 bar charts comparing performance and speed
- Selects best model automatically

### Block 12: Detailed Evaluation
- Confusion matrix visualization
- Classification report (precision, recall, F1 per class)
- Feature importance (for tree-based models)

### Block 13: Complete Analysis
**Comprehensive evaluation on entire test set**

**6 Visualizations**:
1. Confusion Matrix (absolute counts)
2. Normalized Confusion Matrix (proportions)
3. F1-Score by Topic (horizontal bar chart)
4. Precision vs Recall (scatter plot, size = support)
5. Test Set Distribution (bar chart)
6. Confidence Distribution (histogram: correct vs incorrect)

**Analysis Sections**:
- Overall performance (accuracy, F1-score)
- Per-class metrics table
- Confusion pair analysis
- Summary statistics

---

## Expected Results

### Model Performance (F1-Score)
- **Gradient Boosting**: 86-90%
- **Logistic Regression**: 85-89%
- **SVM**: 84-88%
- **Naive Bayes**: 78-82%
- **Random Forest**: 75-82% (expected to underperform on sparse features)

### Training Time
- **Naive Bayes**: ~10 seconds
- **Logistic Regression**: ~30 seconds
- **SVM**: ~2 minutes
- **Random Forest**: ~3 minutes
- **Gradient Boosting**: ~5 minutes

### Per-Topic Performance
**High Performance** (F1 > 90%):
- counting_and_probability
- number_theory

**Medium Performance** (F1: 85-90%):
- geometry
- precalculus

**Challenging** (F1: 80-85%):
- algebra ↔ intermediate_algebra (similar concepts)
- prealgebra ↔ algebra (overlapping operations)

---

## Key Design Decisions

### 1. Data Leakage Prevention
**Critical**: TF-IDF vectorizer fitted ONLY on training data
```
Train/Test Split → Fit Vectorizer on Train → Transform Both
```
Without this, test vocabulary leaks into training, inflating performance by 1-3%.

### 2. Feature Engineering
**Hybrid approach**:
- TF-IDF (5000 features): Captures text content
- Math symbols (10 features): Topic indicators (e.g., integrals → calculus)
- Numeric features (5 features): Statistical properties

**Why no hand-crafted keywords?**
Avoided topic-specific keyword lists to prevent heuristic bias. Let the model learn discriminative vocabulary from data.

### 3. Hyperparameter Optimization
All models use optimized parameters:
- **C=1.0** (SVM/Logistic): Balanced regularization
- **max_depth=30** (Random Forest): Sufficient complexity
- **subsample=0.8** (Gradient Boosting): Stochastic sampling prevents overfitting

### 4. Class Imbalance Handling
`class_weight='balanced'` automatically adjusts weights inversely proportional to class frequencies.

---

## Methodology

### Problem Type
**Supervised Multi-Class Text Classification**

**Why Classification (not Clustering)?**
- Categories are predefined and labeled
- Objective: Assign to known subtopic
- Not discovering latent groups
- Supervised learning with known labels

### Pipeline
```
JSON Files
    ↓
Parquet Conversion (Block 4)
    ↓
Feature Extraction (Block 9)
    ↓
TF-IDF Vectorization (Block 10)
    ↓
Model Training (Block 11)
    ↓
Evaluation (Blocks 12-13)
```

### Feature Vector
```
Total: 5015 dimensions
├── TF-IDF: 5000 (unigrams, bigrams, trigrams)
├── Math Symbols: 10 (binary indicators)
└── Numeric: 5 (scaled to [0,1])
```

---

## Troubleshooting

### "No data loaded"
**Solution**: Check data path in Block 3
```python
DATA_PATH = './math'  # Adjust to your path
```

### "NameError: name 'results' is not defined"
**Solution**: Run blocks in order. Block 12-13 need Block 11 first.

### "ValueError: Negative values"
**Solution**: Block 10 should complete successfully. MinMaxScaler scales features to [0,1].

### "TypeError: coo_matrix not subscriptable"
**Solution**: Block 10 converts to CSR format. Ensure it runs completely.

### Model underperforms
**Check**:
1. Data leakage prevented? (Vectorizer fitted on train only)
2. Features extracted correctly? (Block 9 output)
3. Class distribution balanced? (Block 8 charts)

---

## Performance Optimization

### Speed Up Training
```python
# Reduce vocabulary
vectorizer_config = {'max_features': 2000}

# Fewer trees
RandomForestClassifier(n_estimators=100)

# Fewer boosting rounds
GradientBoostingClassifier(n_estimators=50)
```

### Reduce Memory
```python
# Smaller vocabulary
vectorizer_config = {'max_features': 3000}

# Fewer n-grams
vectorizer_config = {'ngram_range': (1, 2)}
```

---

## Output Files

After Block 13 completes, you'll have:
- **train.parquet**: Training data (consolidated)
- **test.parquet**: Test data (consolidated)
- Performance metrics and visualizations
- Model saved in memory (classifier.best_model)

---

## Next Steps

### Save Model
Add after Block 13:
```python
import pickle
model_data = {
    'model': classifier.best_model,
    'vectorizer': classifier.vectorizer,
    'scaler': classifier.scaler,
    'label_encoder': classifier.label_encoder
}
with open('model.pkl', 'wb') as f:
    pickle.dump(model_data, f)
```

### Batch Prediction
```python
# Load model
with open('model.pkl', 'rb') as f:
    model_data = pickle.load(f)

# Predict
new_problems = ["Solve x^2 = 16", "Find area of circle"]
for problem in new_problems:
    # Preprocess → Extract features → Predict
    prediction = model.predict(...)
```

---

## Summary

**13 Blocks, 3 Stages**:
1. **Setup** (Blocks 1-7): One-time environment setup
2. **Training** (Blocks 8-11): Data loading and model training
3. **Evaluation** (Blocks 12-13): Comprehensive analysis

**Key Features**:
- Data leakage prevention
- 5 optimized models
- 6 visualization types
- Probability predictions
- Error analysis

**Expected Time**: 10-15 minutes total (including training)

**Expected Performance**: 85-90% F1-score on test set