Math Question Classifier - Quick Start Guide
Execution Order
Setup (Blocks 1-7)
Run once to setup environment and define classes
- Block 1: Install packages
- Block 2: Import libraries
- Block 3: Set data path
- Block 4: Convert JSON to Parquet (one-time data preparation)
- Block 5: Define MathDatasetLoader class
- Block 6: Define MathFeatureExtractor class
- Block 7: Define MathQuestionClassifier class
Training & Evaluation (Blocks 8-13)
Run to train and evaluate models
- Block 8: Load dataset from Parquet files
- Block 9: Extract features (text preprocessing + math symbols + numeric)
- Block 10: Vectorize features (TF-IDF + scaling)
- Block 11: Train 5 models and compare performance
- Block 12: Detailed evaluation of best model
- Block 13: Complete test set analysis with 6 visualizations
What Each Block Does
Block 1-3: Environment Setup
- Installs scikit-learn, pandas, matplotlib, seaborn, nltk
- Imports all necessary libraries
- Sets path to data directory (
./math)
Block 4: Data Consolidation
Purpose: Convert JSON files to Parquet format
- Input:
./math/train/and./math/test/folders with JSON files - Output:
train.parquetandtest.parquet - Benefit: 10-100x faster loading than JSON
- Run: Only once (skip if Parquet files already exist)
Block 5-7: Class Definitions
Define three main classes:
- MathDatasetLoader: Loads Parquet files, shows statistics
- MathFeatureExtractor: Cleans LaTeX, extracts math symbols, preprocesses text
- MathQuestionClassifier: Trains models, evaluates performance
Block 8: Load Data
- Loads
train.parquetandtest.parquet - Shows class distribution for train and test sets
- Displays 2 bar charts (train/test distribution)
Block 9: Feature Extraction
Extracts three types of features:
- Text features: Preprocessed text (LaTeX cleaning, lemmatization)
- Math symbol features: 10 binary indicators (has_fraction, has_sqrt, etc.)
- Numeric features: 5 statistical measures (num_count, avg_number, etc.)
Block 10: Vectorization
- Creates TF-IDF features (5000 dimensions, trigrams)
- Scales additional features to [0,1] using MinMaxScaler
- Critical: Fits ONLY on training data (prevents data leakage)
- Converts to CSR format for efficient operations
Block 11: Model Training
Trains 5 optimized models:
- Naive Bayes (baseline)
- Logistic Regression (linear classifier)
- SVM (maximum margin)
- Random Forest (ensemble)
- Gradient Boosting (sequential ensemble)
Output:
- Comparison table with Accuracy, F1-Score, Training Time
- 2 bar charts comparing performance and speed
- Selects best model automatically
Block 12: Detailed Evaluation
- Confusion matrix visualization
- Classification report (precision, recall, F1 per class)
- Feature importance (for tree-based models)
Block 13: Complete Analysis
Comprehensive evaluation on entire test set
6 Visualizations:
- Confusion Matrix (absolute counts)
- Normalized Confusion Matrix (proportions)
- F1-Score by Topic (horizontal bar chart)
- Precision vs Recall (scatter plot, size = support)
- Test Set Distribution (bar chart)
- Confidence Distribution (histogram: correct vs incorrect)
Analysis Sections:
- Overall performance (accuracy, F1-score)
- Per-class metrics table
- Confusion pair analysis
- Summary statistics
Expected Results
Model Performance (F1-Score)
- Gradient Boosting: 86-90%
- Logistic Regression: 85-89%
- SVM: 84-88%
- Naive Bayes: 78-82%
- Random Forest: 75-82% (expected to underperform on sparse features)
Training Time
- Naive Bayes: ~10 seconds
- Logistic Regression: ~30 seconds
- SVM: ~2 minutes
- Random Forest: ~3 minutes
- Gradient Boosting: ~5 minutes
Per-Topic Performance
High Performance (F1 > 90%):
- counting_and_probability
- number_theory
Medium Performance (F1: 85-90%):
- geometry
- precalculus
Challenging (F1: 80-85%):
- algebra โ intermediate_algebra (similar concepts)
- prealgebra โ algebra (overlapping operations)
Key Design Decisions
1. Data Leakage Prevention
Critical: TF-IDF vectorizer fitted ONLY on training data
Train/Test Split โ Fit Vectorizer on Train โ Transform Both
Without this, test vocabulary leaks into training, inflating performance by 1-3%.
2. Feature Engineering
Hybrid approach:
- TF-IDF (5000 features): Captures text content
- Math symbols (10 features): Topic indicators (e.g., integrals โ calculus)
- Numeric features (5 features): Statistical properties
Why no hand-crafted keywords? Avoided topic-specific keyword lists to prevent heuristic bias. Let the model learn discriminative vocabulary from data.
3. Hyperparameter Optimization
All models use optimized parameters:
- C=1.0 (SVM/Logistic): Balanced regularization
- max_depth=30 (Random Forest): Sufficient complexity
- subsample=0.8 (Gradient Boosting): Stochastic sampling prevents overfitting
4. Class Imbalance Handling
class_weight='balanced' automatically adjusts weights inversely proportional to class frequencies.
Methodology
Problem Type
Supervised Multi-Class Text Classification
Why Classification (not Clustering)?
- Categories are predefined and labeled
- Objective: Assign to known subtopic
- Not discovering latent groups
- Supervised learning with known labels
Pipeline
JSON Files
โ
Parquet Conversion (Block 4)
โ
Feature Extraction (Block 9)
โ
TF-IDF Vectorization (Block 10)
โ
Model Training (Block 11)
โ
Evaluation (Blocks 12-13)
Feature Vector
Total: 5015 dimensions
โโโ TF-IDF: 5000 (unigrams, bigrams, trigrams)
โโโ Math Symbols: 10 (binary indicators)
โโโ Numeric: 5 (scaled to [0,1])
Troubleshooting
"No data loaded"
Solution: Check data path in Block 3
DATA_PATH = './math' # Adjust to your path
"NameError: name 'results' is not defined"
Solution: Run blocks in order. Block 12-13 need Block 11 first.
"ValueError: Negative values"
Solution: Block 10 should complete successfully. MinMaxScaler scales features to [0,1].
"TypeError: coo_matrix not subscriptable"
Solution: Block 10 converts to CSR format. Ensure it runs completely.
Model underperforms
Check:
- Data leakage prevented? (Vectorizer fitted on train only)
- Features extracted correctly? (Block 9 output)
- Class distribution balanced? (Block 8 charts)
Performance Optimization
Speed Up Training
# Reduce vocabulary
vectorizer_config = {'max_features': 2000}
# Fewer trees
RandomForestClassifier(n_estimators=100)
# Fewer boosting rounds
GradientBoostingClassifier(n_estimators=50)
Reduce Memory
# Smaller vocabulary
vectorizer_config = {'max_features': 3000}
# Fewer n-grams
vectorizer_config = {'ngram_range': (1, 2)}
Output Files
After Block 13 completes, you'll have:
- train.parquet: Training data (consolidated)
- test.parquet: Test data (consolidated)
- Performance metrics and visualizations
- Model saved in memory (classifier.best_model)
Next Steps
Save Model
Add after Block 13:
import pickle
model_data = {
'model': classifier.best_model,
'vectorizer': classifier.vectorizer,
'scaler': classifier.scaler,
'label_encoder': classifier.label_encoder
}
with open('model.pkl', 'wb') as f:
pickle.dump(model_data, f)
Batch Prediction
# Load model
with open('model.pkl', 'rb') as f:
model_data = pickle.load(f)
# Predict
new_problems = ["Solve x^2 = 16", "Find area of circle"]
for problem in new_problems:
# Preprocess โ Extract features โ Predict
prediction = model.predict(...)
Summary
13 Blocks, 3 Stages:
- Setup (Blocks 1-7): One-time environment setup
- Training (Blocks 8-11): Data loading and model training
- Evaluation (Blocks 12-13): Comprehensive analysis
Key Features:
- Data leakage prevention
- 5 optimized models
- 6 visualization types
- Probability predictions
- Error analysis
Expected Time: 10-15 minutes total (including training)
Expected Performance: 85-90% F1-score on test set