NeerajCodz's picture
Fresh start: Push all project files including models and notebooks
1d5f27f

Math Question Classifier - Quick Start Guide

Execution Order

Setup (Blocks 1-7)

Run once to setup environment and define classes

  1. Block 1: Install packages
  2. Block 2: Import libraries
  3. Block 3: Set data path
  4. Block 4: Convert JSON to Parquet (one-time data preparation)
  5. Block 5: Define MathDatasetLoader class
  6. Block 6: Define MathFeatureExtractor class
  7. Block 7: Define MathQuestionClassifier class

Training & Evaluation (Blocks 8-13)

Run to train and evaluate models

  1. Block 8: Load dataset from Parquet files
  2. Block 9: Extract features (text preprocessing + math symbols + numeric)
  3. Block 10: Vectorize features (TF-IDF + scaling)
  4. Block 11: Train 5 models and compare performance
  5. Block 12: Detailed evaluation of best model
  6. Block 13: Complete test set analysis with 6 visualizations

What Each Block Does

Block 1-3: Environment Setup

  • Installs scikit-learn, pandas, matplotlib, seaborn, nltk
  • Imports all necessary libraries
  • Sets path to data directory (./math)

Block 4: Data Consolidation

Purpose: Convert JSON files to Parquet format

  • Input: ./math/train/ and ./math/test/ folders with JSON files
  • Output: train.parquet and test.parquet
  • Benefit: 10-100x faster loading than JSON
  • Run: Only once (skip if Parquet files already exist)

Block 5-7: Class Definitions

Define three main classes:

  • MathDatasetLoader: Loads Parquet files, shows statistics
  • MathFeatureExtractor: Cleans LaTeX, extracts math symbols, preprocesses text
  • MathQuestionClassifier: Trains models, evaluates performance

Block 8: Load Data

  • Loads train.parquet and test.parquet
  • Shows class distribution for train and test sets
  • Displays 2 bar charts (train/test distribution)

Block 9: Feature Extraction

Extracts three types of features:

  1. Text features: Preprocessed text (LaTeX cleaning, lemmatization)
  2. Math symbol features: 10 binary indicators (has_fraction, has_sqrt, etc.)
  3. Numeric features: 5 statistical measures (num_count, avg_number, etc.)

Block 10: Vectorization

  • Creates TF-IDF features (5000 dimensions, trigrams)
  • Scales additional features to [0,1] using MinMaxScaler
  • Critical: Fits ONLY on training data (prevents data leakage)
  • Converts to CSR format for efficient operations

Block 11: Model Training

Trains 5 optimized models:

  1. Naive Bayes (baseline)
  2. Logistic Regression (linear classifier)
  3. SVM (maximum margin)
  4. Random Forest (ensemble)
  5. Gradient Boosting (sequential ensemble)

Output:

  • Comparison table with Accuracy, F1-Score, Training Time
  • 2 bar charts comparing performance and speed
  • Selects best model automatically

Block 12: Detailed Evaluation

  • Confusion matrix visualization
  • Classification report (precision, recall, F1 per class)
  • Feature importance (for tree-based models)

Block 13: Complete Analysis

Comprehensive evaluation on entire test set

6 Visualizations:

  1. Confusion Matrix (absolute counts)
  2. Normalized Confusion Matrix (proportions)
  3. F1-Score by Topic (horizontal bar chart)
  4. Precision vs Recall (scatter plot, size = support)
  5. Test Set Distribution (bar chart)
  6. Confidence Distribution (histogram: correct vs incorrect)

Analysis Sections:

  • Overall performance (accuracy, F1-score)
  • Per-class metrics table
  • Confusion pair analysis
  • Summary statistics

Expected Results

Model Performance (F1-Score)

  • Gradient Boosting: 86-90%
  • Logistic Regression: 85-89%
  • SVM: 84-88%
  • Naive Bayes: 78-82%
  • Random Forest: 75-82% (expected to underperform on sparse features)

Training Time

  • Naive Bayes: ~10 seconds
  • Logistic Regression: ~30 seconds
  • SVM: ~2 minutes
  • Random Forest: ~3 minutes
  • Gradient Boosting: ~5 minutes

Per-Topic Performance

High Performance (F1 > 90%):

  • counting_and_probability
  • number_theory

Medium Performance (F1: 85-90%):

  • geometry
  • precalculus

Challenging (F1: 80-85%):

  • algebra โ†” intermediate_algebra (similar concepts)
  • prealgebra โ†” algebra (overlapping operations)

Key Design Decisions

1. Data Leakage Prevention

Critical: TF-IDF vectorizer fitted ONLY on training data

Train/Test Split โ†’ Fit Vectorizer on Train โ†’ Transform Both

Without this, test vocabulary leaks into training, inflating performance by 1-3%.

2. Feature Engineering

Hybrid approach:

  • TF-IDF (5000 features): Captures text content
  • Math symbols (10 features): Topic indicators (e.g., integrals โ†’ calculus)
  • Numeric features (5 features): Statistical properties

Why no hand-crafted keywords? Avoided topic-specific keyword lists to prevent heuristic bias. Let the model learn discriminative vocabulary from data.

3. Hyperparameter Optimization

All models use optimized parameters:

  • C=1.0 (SVM/Logistic): Balanced regularization
  • max_depth=30 (Random Forest): Sufficient complexity
  • subsample=0.8 (Gradient Boosting): Stochastic sampling prevents overfitting

4. Class Imbalance Handling

class_weight='balanced' automatically adjusts weights inversely proportional to class frequencies.


Methodology

Problem Type

Supervised Multi-Class Text Classification

Why Classification (not Clustering)?

  • Categories are predefined and labeled
  • Objective: Assign to known subtopic
  • Not discovering latent groups
  • Supervised learning with known labels

Pipeline

JSON Files
    โ†“
Parquet Conversion (Block 4)
    โ†“
Feature Extraction (Block 9)
    โ†“
TF-IDF Vectorization (Block 10)
    โ†“
Model Training (Block 11)
    โ†“
Evaluation (Blocks 12-13)

Feature Vector

Total: 5015 dimensions
โ”œโ”€โ”€ TF-IDF: 5000 (unigrams, bigrams, trigrams)
โ”œโ”€โ”€ Math Symbols: 10 (binary indicators)
โ””โ”€โ”€ Numeric: 5 (scaled to [0,1])

Troubleshooting

"No data loaded"

Solution: Check data path in Block 3

DATA_PATH = './math'  # Adjust to your path

"NameError: name 'results' is not defined"

Solution: Run blocks in order. Block 12-13 need Block 11 first.

"ValueError: Negative values"

Solution: Block 10 should complete successfully. MinMaxScaler scales features to [0,1].

"TypeError: coo_matrix not subscriptable"

Solution: Block 10 converts to CSR format. Ensure it runs completely.

Model underperforms

Check:

  1. Data leakage prevented? (Vectorizer fitted on train only)
  2. Features extracted correctly? (Block 9 output)
  3. Class distribution balanced? (Block 8 charts)

Performance Optimization

Speed Up Training

# Reduce vocabulary
vectorizer_config = {'max_features': 2000}

# Fewer trees
RandomForestClassifier(n_estimators=100)

# Fewer boosting rounds
GradientBoostingClassifier(n_estimators=50)

Reduce Memory

# Smaller vocabulary
vectorizer_config = {'max_features': 3000}

# Fewer n-grams
vectorizer_config = {'ngram_range': (1, 2)}

Output Files

After Block 13 completes, you'll have:

  • train.parquet: Training data (consolidated)
  • test.parquet: Test data (consolidated)
  • Performance metrics and visualizations
  • Model saved in memory (classifier.best_model)

Next Steps

Save Model

Add after Block 13:

import pickle
model_data = {
    'model': classifier.best_model,
    'vectorizer': classifier.vectorizer,
    'scaler': classifier.scaler,
    'label_encoder': classifier.label_encoder
}
with open('model.pkl', 'wb') as f:
    pickle.dump(model_data, f)

Batch Prediction

# Load model
with open('model.pkl', 'rb') as f:
    model_data = pickle.load(f)

# Predict
new_problems = ["Solve x^2 = 16", "Find area of circle"]
for problem in new_problems:
    # Preprocess โ†’ Extract features โ†’ Predict
    prediction = model.predict(...)

Summary

13 Blocks, 3 Stages:

  1. Setup (Blocks 1-7): One-time environment setup
  2. Training (Blocks 8-11): Data loading and model training
  3. Evaluation (Blocks 12-13): Comprehensive analysis

Key Features:

  • Data leakage prevention
  • 5 optimized models
  • 6 visualization types
  • Probability predictions
  • Error analysis

Expected Time: 10-15 minutes total (including training)

Expected Performance: 85-90% F1-score on test set