# Math Question Classifier - Quick Start Guide ## Execution Order ### Setup (Blocks 1-7) **Run once to setup environment and define classes** 1. **Block 1**: Install packages 2. **Block 2**: Import libraries 3. **Block 3**: Set data path 4. **Block 4**: Convert JSON to Parquet (one-time data preparation) 5. **Block 5**: Define MathDatasetLoader class 6. **Block 6**: Define MathFeatureExtractor class 7. **Block 7**: Define MathQuestionClassifier class ### Training & Evaluation (Blocks 8-13) **Run to train and evaluate models** 8. **Block 8**: Load dataset from Parquet files 9. **Block 9**: Extract features (text preprocessing + math symbols + numeric) 10. **Block 10**: Vectorize features (TF-IDF + scaling) 11. **Block 11**: Train 5 models and compare performance 12. **Block 12**: Detailed evaluation of best model 13. **Block 13**: Complete test set analysis with 6 visualizations --- ## What Each Block Does ### Block 1-3: Environment Setup - Installs scikit-learn, pandas, matplotlib, seaborn, nltk - Imports all necessary libraries - Sets path to data directory (`./math`) ### Block 4: Data Consolidation **Purpose**: Convert JSON files to Parquet format - **Input**: `./math/train/` and `./math/test/` folders with JSON files - **Output**: `train.parquet` and `test.parquet` - **Benefit**: 10-100x faster loading than JSON - **Run**: Only once (skip if Parquet files already exist) ### Block 5-7: Class Definitions Define three main classes: - **MathDatasetLoader**: Loads Parquet files, shows statistics - **MathFeatureExtractor**: Cleans LaTeX, extracts math symbols, preprocesses text - **MathQuestionClassifier**: Trains models, evaluates performance ### Block 8: Load Data - Loads `train.parquet` and `test.parquet` - Shows class distribution for train and test sets - Displays 2 bar charts (train/test distribution) ### Block 9: Feature Extraction Extracts three types of features: 1. **Text features**: Preprocessed text (LaTeX cleaning, lemmatization) 2. **Math symbol features**: 10 binary indicators (has_fraction, has_sqrt, etc.) 3. **Numeric features**: 5 statistical measures (num_count, avg_number, etc.) ### Block 10: Vectorization - Creates TF-IDF features (5000 dimensions, trigrams) - Scales additional features to [0,1] using MinMaxScaler - **Critical**: Fits ONLY on training data (prevents data leakage) - Converts to CSR format for efficient operations ### Block 11: Model Training Trains 5 optimized models: 1. **Naive Bayes** (baseline) 2. **Logistic Regression** (linear classifier) 3. **SVM** (maximum margin) 4. **Random Forest** (ensemble) 5. **Gradient Boosting** (sequential ensemble) **Output**: - Comparison table with Accuracy, F1-Score, Training Time - 2 bar charts comparing performance and speed - Selects best model automatically ### Block 12: Detailed Evaluation - Confusion matrix visualization - Classification report (precision, recall, F1 per class) - Feature importance (for tree-based models) ### Block 13: Complete Analysis **Comprehensive evaluation on entire test set** **6 Visualizations**: 1. Confusion Matrix (absolute counts) 2. Normalized Confusion Matrix (proportions) 3. F1-Score by Topic (horizontal bar chart) 4. Precision vs Recall (scatter plot, size = support) 5. Test Set Distribution (bar chart) 6. Confidence Distribution (histogram: correct vs incorrect) **Analysis Sections**: - Overall performance (accuracy, F1-score) - Per-class metrics table - Confusion pair analysis - Summary statistics --- ## Expected Results ### Model Performance (F1-Score) - **Gradient Boosting**: 86-90% - **Logistic Regression**: 85-89% - **SVM**: 84-88% - **Naive Bayes**: 78-82% - **Random Forest**: 75-82% (expected to underperform on sparse features) ### Training Time - **Naive Bayes**: ~10 seconds - **Logistic Regression**: ~30 seconds - **SVM**: ~2 minutes - **Random Forest**: ~3 minutes - **Gradient Boosting**: ~5 minutes ### Per-Topic Performance **High Performance** (F1 > 90%): - counting_and_probability - number_theory **Medium Performance** (F1: 85-90%): - geometry - precalculus **Challenging** (F1: 80-85%): - algebra ↔ intermediate_algebra (similar concepts) - prealgebra ↔ algebra (overlapping operations) --- ## Key Design Decisions ### 1. Data Leakage Prevention **Critical**: TF-IDF vectorizer fitted ONLY on training data ``` Train/Test Split → Fit Vectorizer on Train → Transform Both ``` Without this, test vocabulary leaks into training, inflating performance by 1-3%. ### 2. Feature Engineering **Hybrid approach**: - TF-IDF (5000 features): Captures text content - Math symbols (10 features): Topic indicators (e.g., integrals → calculus) - Numeric features (5 features): Statistical properties **Why no hand-crafted keywords?** Avoided topic-specific keyword lists to prevent heuristic bias. Let the model learn discriminative vocabulary from data. ### 3. Hyperparameter Optimization All models use optimized parameters: - **C=1.0** (SVM/Logistic): Balanced regularization - **max_depth=30** (Random Forest): Sufficient complexity - **subsample=0.8** (Gradient Boosting): Stochastic sampling prevents overfitting ### 4. Class Imbalance Handling `class_weight='balanced'` automatically adjusts weights inversely proportional to class frequencies. --- ## Methodology ### Problem Type **Supervised Multi-Class Text Classification** **Why Classification (not Clustering)?** - Categories are predefined and labeled - Objective: Assign to known subtopic - Not discovering latent groups - Supervised learning with known labels ### Pipeline ``` JSON Files ↓ Parquet Conversion (Block 4) ↓ Feature Extraction (Block 9) ↓ TF-IDF Vectorization (Block 10) ↓ Model Training (Block 11) ↓ Evaluation (Blocks 12-13) ``` ### Feature Vector ``` Total: 5015 dimensions ├── TF-IDF: 5000 (unigrams, bigrams, trigrams) ├── Math Symbols: 10 (binary indicators) └── Numeric: 5 (scaled to [0,1]) ``` --- ## Troubleshooting ### "No data loaded" **Solution**: Check data path in Block 3 ```python DATA_PATH = './math' # Adjust to your path ``` ### "NameError: name 'results' is not defined" **Solution**: Run blocks in order. Block 12-13 need Block 11 first. ### "ValueError: Negative values" **Solution**: Block 10 should complete successfully. MinMaxScaler scales features to [0,1]. ### "TypeError: coo_matrix not subscriptable" **Solution**: Block 10 converts to CSR format. Ensure it runs completely. ### Model underperforms **Check**: 1. Data leakage prevented? (Vectorizer fitted on train only) 2. Features extracted correctly? (Block 9 output) 3. Class distribution balanced? (Block 8 charts) --- ## Performance Optimization ### Speed Up Training ```python # Reduce vocabulary vectorizer_config = {'max_features': 2000} # Fewer trees RandomForestClassifier(n_estimators=100) # Fewer boosting rounds GradientBoostingClassifier(n_estimators=50) ``` ### Reduce Memory ```python # Smaller vocabulary vectorizer_config = {'max_features': 3000} # Fewer n-grams vectorizer_config = {'ngram_range': (1, 2)} ``` --- ## Output Files After Block 13 completes, you'll have: - **train.parquet**: Training data (consolidated) - **test.parquet**: Test data (consolidated) - Performance metrics and visualizations - Model saved in memory (classifier.best_model) --- ## Next Steps ### Save Model Add after Block 13: ```python import pickle model_data = { 'model': classifier.best_model, 'vectorizer': classifier.vectorizer, 'scaler': classifier.scaler, 'label_encoder': classifier.label_encoder } with open('model.pkl', 'wb') as f: pickle.dump(model_data, f) ``` ### Batch Prediction ```python # Load model with open('model.pkl', 'rb') as f: model_data = pickle.load(f) # Predict new_problems = ["Solve x^2 = 16", "Find area of circle"] for problem in new_problems: # Preprocess → Extract features → Predict prediction = model.predict(...) ``` --- ## Summary **13 Blocks, 3 Stages**: 1. **Setup** (Blocks 1-7): One-time environment setup 2. **Training** (Blocks 8-11): Data loading and model training 3. **Evaluation** (Blocks 12-13): Comprehensive analysis **Key Features**: - Data leakage prevention - 5 optimized models - 6 visualization types - Probability predictions - Error analysis **Expected Time**: 10-15 minutes total (including training) **Expected Performance**: 85-90% F1-score on test set