Spaces:

NeerajCodz
/

aiMathQuestionClassification

Running

App Files Files Community

aiMathQuestionClassification / TRAINING.md

NeerajCodz

Fresh start: Push all project files including models and notebooks

1d5f27f 9 days ago

preview code

raw

history blame contribute delete

8.42 kB

Math Question Classifier - Quick Start Guide

Execution Order

Setup (Blocks 1-7)

Run once to setup environment and define classes

Block 1: Install packages
Block 2: Import libraries
Block 3: Set data path
Block 4: Convert JSON to Parquet (one-time data preparation)
Block 5: Define MathDatasetLoader class
Block 6: Define MathFeatureExtractor class
Block 7: Define MathQuestionClassifier class

Training & Evaluation (Blocks 8-13)

Run to train and evaluate models

Block 8: Load dataset from Parquet files
Block 9: Extract features (text preprocessing + math symbols + numeric)
Block 10: Vectorize features (TF-IDF + scaling)
Block 11: Train 5 models and compare performance
Block 12: Detailed evaluation of best model
Block 13: Complete test set analysis with 6 visualizations

What Each Block Does

Block 1-3: Environment Setup

Installs scikit-learn, pandas, matplotlib, seaborn, nltk
Imports all necessary libraries
Sets path to data directory (./math)

Block 4: Data Consolidation

Purpose: Convert JSON files to Parquet format

Input: ./math/train/ and ./math/test/ folders with JSON files
Output: train.parquet and test.parquet
Benefit: 10-100x faster loading than JSON
Run: Only once (skip if Parquet files already exist)

Block 5-7: Class Definitions

Define three main classes:

MathDatasetLoader: Loads Parquet files, shows statistics
MathFeatureExtractor: Cleans LaTeX, extracts math symbols, preprocesses text
MathQuestionClassifier: Trains models, evaluates performance

Block 8: Load Data

Loads train.parquet and test.parquet
Shows class distribution for train and test sets
Displays 2 bar charts (train/test distribution)

Block 9: Feature Extraction

Extracts three types of features:

Text features: Preprocessed text (LaTeX cleaning, lemmatization)
Math symbol features: 10 binary indicators (has_fraction, has_sqrt, etc.)
Numeric features: 5 statistical measures (num_count, avg_number, etc.)

Block 10: Vectorization

Creates TF-IDF features (5000 dimensions, trigrams)
Scales additional features to [0,1] using MinMaxScaler
Critical: Fits ONLY on training data (prevents data leakage)
Converts to CSR format for efficient operations

Block 11: Model Training

Trains 5 optimized models:

Naive Bayes (baseline)
Logistic Regression (linear classifier)
SVM (maximum margin)
Random Forest (ensemble)
Gradient Boosting (sequential ensemble)

Output:

Comparison table with Accuracy, F1-Score, Training Time
2 bar charts comparing performance and speed
Selects best model automatically

Block 12: Detailed Evaluation

Confusion matrix visualization
Classification report (precision, recall, F1 per class)
Feature importance (for tree-based models)

Block 13: Complete Analysis

Comprehensive evaluation on entire test set

6 Visualizations:

Confusion Matrix (absolute counts)
Normalized Confusion Matrix (proportions)
F1-Score by Topic (horizontal bar chart)
Precision vs Recall (scatter plot, size = support)
Test Set Distribution (bar chart)
Confidence Distribution (histogram: correct vs incorrect)

Analysis Sections:

Overall performance (accuracy, F1-score)
Per-class metrics table
Confusion pair analysis
Summary statistics

Expected Results

Model Performance (F1-Score)

Gradient Boosting: 86-90%
Logistic Regression: 85-89%
SVM: 84-88%
Naive Bayes: 78-82%
Random Forest: 75-82% (expected to underperform on sparse features)

Training Time

Naive Bayes: ~10 seconds
Logistic Regression: ~30 seconds
SVM: ~2 minutes
Random Forest: ~3 minutes
Gradient Boosting: ~5 minutes

Per-Topic Performance

High Performance (F1 > 90%):

counting_and_probability
number_theory

Medium Performance (F1: 85-90%):

geometry
precalculus

Challenging (F1: 80-85%):

algebra ↔ intermediate_algebra (similar concepts)
prealgebra ↔ algebra (overlapping operations)

Key Design Decisions

1. Data Leakage Prevention

Critical: TF-IDF vectorizer fitted ONLY on training data

Train/Test Split → Fit Vectorizer on Train → Transform Both

Without this, test vocabulary leaks into training, inflating performance by 1-3%.

2. Feature Engineering

Hybrid approach:

TF-IDF (5000 features): Captures text content
Math symbols (10 features): Topic indicators (e.g., integrals → calculus)
Numeric features (5 features): Statistical properties

Why no hand-crafted keywords? Avoided topic-specific keyword lists to prevent heuristic bias. Let the model learn discriminative vocabulary from data.

3. Hyperparameter Optimization

All models use optimized parameters:

C=1.0 (SVM/Logistic): Balanced regularization
max_depth=30 (Random Forest): Sufficient complexity
subsample=0.8 (Gradient Boosting): Stochastic sampling prevents overfitting

4. Class Imbalance Handling

class_weight='balanced' automatically adjusts weights inversely proportional to class frequencies.

Methodology

Problem Type

Supervised Multi-Class Text Classification

Why Classification (not Clustering)?

Categories are predefined and labeled
Objective: Assign to known subtopic
Not discovering latent groups
Supervised learning with known labels

Pipeline

JSON Files
    ↓
Parquet Conversion (Block 4)
    ↓
Feature Extraction (Block 9)
    ↓
TF-IDF Vectorization (Block 10)
    ↓
Model Training (Block 11)
    ↓
Evaluation (Blocks 12-13)

Feature Vector

Total: 5015 dimensions
├── TF-IDF: 5000 (unigrams, bigrams, trigrams)
├── Math Symbols: 10 (binary indicators)
└── Numeric: 5 (scaled to [0,1])

Troubleshooting

"No data loaded"

Solution: Check data path in Block 3

DATA_PATH = './math'  # Adjust to your path

"NameError: name 'results' is not defined"

Solution: Run blocks in order. Block 12-13 need Block 11 first.

"ValueError: Negative values"

Solution: Block 10 should complete successfully. MinMaxScaler scales features to [0,1].

"TypeError: coo_matrix not subscriptable"

Solution: Block 10 converts to CSR format. Ensure it runs completely.

Model underperforms

Check:

Data leakage prevented? (Vectorizer fitted on train only)
Features extracted correctly? (Block 9 output)
Class distribution balanced? (Block 8 charts)

Performance Optimization

Speed Up Training

# Reduce vocabulary
vectorizer_config = {'max_features': 2000}

# Fewer trees
RandomForestClassifier(n_estimators=100)

# Fewer boosting rounds
GradientBoostingClassifier(n_estimators=50)

Reduce Memory

# Smaller vocabulary
vectorizer_config = {'max_features': 3000}

# Fewer n-grams
vectorizer_config = {'ngram_range': (1, 2)}

Output Files

After Block 13 completes, you'll have:

train.parquet: Training data (consolidated)
test.parquet: Test data (consolidated)
Performance metrics and visualizations
Model saved in memory (classifier.best_model)

Next Steps

Save Model

Add after Block 13:

import pickle
model_data = {
    'model': classifier.best_model,
    'vectorizer': classifier.vectorizer,
    'scaler': classifier.scaler,
    'label_encoder': classifier.label_encoder
}
with open('model.pkl', 'wb') as f:
    pickle.dump(model_data, f)

Batch Prediction

# Load model
with open('model.pkl', 'rb') as f:
    model_data = pickle.load(f)

# Predict
new_problems = ["Solve x^2 = 16", "Find area of circle"]
for problem in new_problems:
    # Preprocess → Extract features → Predict
    prediction = model.predict(...)

Summary

13 Blocks, 3 Stages:

Setup (Blocks 1-7): One-time environment setup
Training (Blocks 8-11): Data loading and model training
Evaluation (Blocks 12-13): Comprehensive analysis

Key Features:

Data leakage prevention
5 optimized models
6 visualization types
Probability predictions
Error analysis

Expected Time: 10-15 minutes total (including training)

Expected Performance: 85-90% F1-score on test set