NeerajCodz's picture
Fresh start: Push all project files including models and notebooks
1d5f27f
# Math Question Classifier - Quick Start Guide
## Execution Order
### Setup (Blocks 1-7)
**Run once to setup environment and define classes**
1. **Block 1**: Install packages
2. **Block 2**: Import libraries
3. **Block 3**: Set data path
4. **Block 4**: Convert JSON to Parquet (one-time data preparation)
5. **Block 5**: Define MathDatasetLoader class
6. **Block 6**: Define MathFeatureExtractor class
7. **Block 7**: Define MathQuestionClassifier class
### Training & Evaluation (Blocks 8-13)
**Run to train and evaluate models**
8. **Block 8**: Load dataset from Parquet files
9. **Block 9**: Extract features (text preprocessing + math symbols + numeric)
10. **Block 10**: Vectorize features (TF-IDF + scaling)
11. **Block 11**: Train 5 models and compare performance
12. **Block 12**: Detailed evaluation of best model
13. **Block 13**: Complete test set analysis with 6 visualizations
---
## What Each Block Does
### Block 1-3: Environment Setup
- Installs scikit-learn, pandas, matplotlib, seaborn, nltk
- Imports all necessary libraries
- Sets path to data directory (`./math`)
### Block 4: Data Consolidation
**Purpose**: Convert JSON files to Parquet format
- **Input**: `./math/train/` and `./math/test/` folders with JSON files
- **Output**: `train.parquet` and `test.parquet`
- **Benefit**: 10-100x faster loading than JSON
- **Run**: Only once (skip if Parquet files already exist)
### Block 5-7: Class Definitions
Define three main classes:
- **MathDatasetLoader**: Loads Parquet files, shows statistics
- **MathFeatureExtractor**: Cleans LaTeX, extracts math symbols, preprocesses text
- **MathQuestionClassifier**: Trains models, evaluates performance
### Block 8: Load Data
- Loads `train.parquet` and `test.parquet`
- Shows class distribution for train and test sets
- Displays 2 bar charts (train/test distribution)
### Block 9: Feature Extraction
Extracts three types of features:
1. **Text features**: Preprocessed text (LaTeX cleaning, lemmatization)
2. **Math symbol features**: 10 binary indicators (has_fraction, has_sqrt, etc.)
3. **Numeric features**: 5 statistical measures (num_count, avg_number, etc.)
### Block 10: Vectorization
- Creates TF-IDF features (5000 dimensions, trigrams)
- Scales additional features to [0,1] using MinMaxScaler
- **Critical**: Fits ONLY on training data (prevents data leakage)
- Converts to CSR format for efficient operations
### Block 11: Model Training
Trains 5 optimized models:
1. **Naive Bayes** (baseline)
2. **Logistic Regression** (linear classifier)
3. **SVM** (maximum margin)
4. **Random Forest** (ensemble)
5. **Gradient Boosting** (sequential ensemble)
**Output**:
- Comparison table with Accuracy, F1-Score, Training Time
- 2 bar charts comparing performance and speed
- Selects best model automatically
### Block 12: Detailed Evaluation
- Confusion matrix visualization
- Classification report (precision, recall, F1 per class)
- Feature importance (for tree-based models)
### Block 13: Complete Analysis
**Comprehensive evaluation on entire test set**
**6 Visualizations**:
1. Confusion Matrix (absolute counts)
2. Normalized Confusion Matrix (proportions)
3. F1-Score by Topic (horizontal bar chart)
4. Precision vs Recall (scatter plot, size = support)
5. Test Set Distribution (bar chart)
6. Confidence Distribution (histogram: correct vs incorrect)
**Analysis Sections**:
- Overall performance (accuracy, F1-score)
- Per-class metrics table
- Confusion pair analysis
- Summary statistics
---
## Expected Results
### Model Performance (F1-Score)
- **Gradient Boosting**: 86-90%
- **Logistic Regression**: 85-89%
- **SVM**: 84-88%
- **Naive Bayes**: 78-82%
- **Random Forest**: 75-82% (expected to underperform on sparse features)
### Training Time
- **Naive Bayes**: ~10 seconds
- **Logistic Regression**: ~30 seconds
- **SVM**: ~2 minutes
- **Random Forest**: ~3 minutes
- **Gradient Boosting**: ~5 minutes
### Per-Topic Performance
**High Performance** (F1 > 90%):
- counting_and_probability
- number_theory
**Medium Performance** (F1: 85-90%):
- geometry
- precalculus
**Challenging** (F1: 80-85%):
- algebra โ†” intermediate_algebra (similar concepts)
- prealgebra โ†” algebra (overlapping operations)
---
## Key Design Decisions
### 1. Data Leakage Prevention
**Critical**: TF-IDF vectorizer fitted ONLY on training data
```
Train/Test Split โ†’ Fit Vectorizer on Train โ†’ Transform Both
```
Without this, test vocabulary leaks into training, inflating performance by 1-3%.
### 2. Feature Engineering
**Hybrid approach**:
- TF-IDF (5000 features): Captures text content
- Math symbols (10 features): Topic indicators (e.g., integrals โ†’ calculus)
- Numeric features (5 features): Statistical properties
**Why no hand-crafted keywords?**
Avoided topic-specific keyword lists to prevent heuristic bias. Let the model learn discriminative vocabulary from data.
### 3. Hyperparameter Optimization
All models use optimized parameters:
- **C=1.0** (SVM/Logistic): Balanced regularization
- **max_depth=30** (Random Forest): Sufficient complexity
- **subsample=0.8** (Gradient Boosting): Stochastic sampling prevents overfitting
### 4. Class Imbalance Handling
`class_weight='balanced'` automatically adjusts weights inversely proportional to class frequencies.
---
## Methodology
### Problem Type
**Supervised Multi-Class Text Classification**
**Why Classification (not Clustering)?**
- Categories are predefined and labeled
- Objective: Assign to known subtopic
- Not discovering latent groups
- Supervised learning with known labels
### Pipeline
```
JSON Files
โ†“
Parquet Conversion (Block 4)
โ†“
Feature Extraction (Block 9)
โ†“
TF-IDF Vectorization (Block 10)
โ†“
Model Training (Block 11)
โ†“
Evaluation (Blocks 12-13)
```
### Feature Vector
```
Total: 5015 dimensions
โ”œโ”€โ”€ TF-IDF: 5000 (unigrams, bigrams, trigrams)
โ”œโ”€โ”€ Math Symbols: 10 (binary indicators)
โ””โ”€โ”€ Numeric: 5 (scaled to [0,1])
```
---
## Troubleshooting
### "No data loaded"
**Solution**: Check data path in Block 3
```python
DATA_PATH = './math' # Adjust to your path
```
### "NameError: name 'results' is not defined"
**Solution**: Run blocks in order. Block 12-13 need Block 11 first.
### "ValueError: Negative values"
**Solution**: Block 10 should complete successfully. MinMaxScaler scales features to [0,1].
### "TypeError: coo_matrix not subscriptable"
**Solution**: Block 10 converts to CSR format. Ensure it runs completely.
### Model underperforms
**Check**:
1. Data leakage prevented? (Vectorizer fitted on train only)
2. Features extracted correctly? (Block 9 output)
3. Class distribution balanced? (Block 8 charts)
---
## Performance Optimization
### Speed Up Training
```python
# Reduce vocabulary
vectorizer_config = {'max_features': 2000}
# Fewer trees
RandomForestClassifier(n_estimators=100)
# Fewer boosting rounds
GradientBoostingClassifier(n_estimators=50)
```
### Reduce Memory
```python
# Smaller vocabulary
vectorizer_config = {'max_features': 3000}
# Fewer n-grams
vectorizer_config = {'ngram_range': (1, 2)}
```
---
## Output Files
After Block 13 completes, you'll have:
- **train.parquet**: Training data (consolidated)
- **test.parquet**: Test data (consolidated)
- Performance metrics and visualizations
- Model saved in memory (classifier.best_model)
---
## Next Steps
### Save Model
Add after Block 13:
```python
import pickle
model_data = {
'model': classifier.best_model,
'vectorizer': classifier.vectorizer,
'scaler': classifier.scaler,
'label_encoder': classifier.label_encoder
}
with open('model.pkl', 'wb') as f:
pickle.dump(model_data, f)
```
### Batch Prediction
```python
# Load model
with open('model.pkl', 'rb') as f:
model_data = pickle.load(f)
# Predict
new_problems = ["Solve x^2 = 16", "Find area of circle"]
for problem in new_problems:
# Preprocess โ†’ Extract features โ†’ Predict
prediction = model.predict(...)
```
---
## Summary
**13 Blocks, 3 Stages**:
1. **Setup** (Blocks 1-7): One-time environment setup
2. **Training** (Blocks 8-11): Data loading and model training
3. **Evaluation** (Blocks 12-13): Comprehensive analysis
**Key Features**:
- Data leakage prevention
- 5 optimized models
- 6 visualization types
- Probability predictions
- Error analysis
**Expected Time**: 10-15 minutes total (including training)
**Expected Performance**: 85-90% F1-score on test set