| # Math Question Classifier - Quick Start Guide | |
| ## Execution Order | |
| ### Setup (Blocks 1-7) | |
| **Run once to setup environment and define classes** | |
| 1. **Block 1**: Install packages | |
| 2. **Block 2**: Import libraries | |
| 3. **Block 3**: Set data path | |
| 4. **Block 4**: Convert JSON to Parquet (one-time data preparation) | |
| 5. **Block 5**: Define MathDatasetLoader class | |
| 6. **Block 6**: Define MathFeatureExtractor class | |
| 7. **Block 7**: Define MathQuestionClassifier class | |
| ### Training & Evaluation (Blocks 8-13) | |
| **Run to train and evaluate models** | |
| 8. **Block 8**: Load dataset from Parquet files | |
| 9. **Block 9**: Extract features (text preprocessing + math symbols + numeric) | |
| 10. **Block 10**: Vectorize features (TF-IDF + scaling) | |
| 11. **Block 11**: Train 5 models and compare performance | |
| 12. **Block 12**: Detailed evaluation of best model | |
| 13. **Block 13**: Complete test set analysis with 6 visualizations | |
| --- | |
| ## What Each Block Does | |
| ### Block 1-3: Environment Setup | |
| - Installs scikit-learn, pandas, matplotlib, seaborn, nltk | |
| - Imports all necessary libraries | |
| - Sets path to data directory (`./math`) | |
| ### Block 4: Data Consolidation | |
| **Purpose**: Convert JSON files to Parquet format | |
| - **Input**: `./math/train/` and `./math/test/` folders with JSON files | |
| - **Output**: `train.parquet` and `test.parquet` | |
| - **Benefit**: 10-100x faster loading than JSON | |
| - **Run**: Only once (skip if Parquet files already exist) | |
| ### Block 5-7: Class Definitions | |
| Define three main classes: | |
| - **MathDatasetLoader**: Loads Parquet files, shows statistics | |
| - **MathFeatureExtractor**: Cleans LaTeX, extracts math symbols, preprocesses text | |
| - **MathQuestionClassifier**: Trains models, evaluates performance | |
| ### Block 8: Load Data | |
| - Loads `train.parquet` and `test.parquet` | |
| - Shows class distribution for train and test sets | |
| - Displays 2 bar charts (train/test distribution) | |
| ### Block 9: Feature Extraction | |
| Extracts three types of features: | |
| 1. **Text features**: Preprocessed text (LaTeX cleaning, lemmatization) | |
| 2. **Math symbol features**: 10 binary indicators (has_fraction, has_sqrt, etc.) | |
| 3. **Numeric features**: 5 statistical measures (num_count, avg_number, etc.) | |
| ### Block 10: Vectorization | |
| - Creates TF-IDF features (5000 dimensions, trigrams) | |
| - Scales additional features to [0,1] using MinMaxScaler | |
| - **Critical**: Fits ONLY on training data (prevents data leakage) | |
| - Converts to CSR format for efficient operations | |
| ### Block 11: Model Training | |
| Trains 5 optimized models: | |
| 1. **Naive Bayes** (baseline) | |
| 2. **Logistic Regression** (linear classifier) | |
| 3. **SVM** (maximum margin) | |
| 4. **Random Forest** (ensemble) | |
| 5. **Gradient Boosting** (sequential ensemble) | |
| **Output**: | |
| - Comparison table with Accuracy, F1-Score, Training Time | |
| - 2 bar charts comparing performance and speed | |
| - Selects best model automatically | |
| ### Block 12: Detailed Evaluation | |
| - Confusion matrix visualization | |
| - Classification report (precision, recall, F1 per class) | |
| - Feature importance (for tree-based models) | |
| ### Block 13: Complete Analysis | |
| **Comprehensive evaluation on entire test set** | |
| **6 Visualizations**: | |
| 1. Confusion Matrix (absolute counts) | |
| 2. Normalized Confusion Matrix (proportions) | |
| 3. F1-Score by Topic (horizontal bar chart) | |
| 4. Precision vs Recall (scatter plot, size = support) | |
| 5. Test Set Distribution (bar chart) | |
| 6. Confidence Distribution (histogram: correct vs incorrect) | |
| **Analysis Sections**: | |
| - Overall performance (accuracy, F1-score) | |
| - Per-class metrics table | |
| - Confusion pair analysis | |
| - Summary statistics | |
| --- | |
| ## Expected Results | |
| ### Model Performance (F1-Score) | |
| - **Gradient Boosting**: 86-90% | |
| - **Logistic Regression**: 85-89% | |
| - **SVM**: 84-88% | |
| - **Naive Bayes**: 78-82% | |
| - **Random Forest**: 75-82% (expected to underperform on sparse features) | |
| ### Training Time | |
| - **Naive Bayes**: ~10 seconds | |
| - **Logistic Regression**: ~30 seconds | |
| - **SVM**: ~2 minutes | |
| - **Random Forest**: ~3 minutes | |
| - **Gradient Boosting**: ~5 minutes | |
| ### Per-Topic Performance | |
| **High Performance** (F1 > 90%): | |
| - counting_and_probability | |
| - number_theory | |
| **Medium Performance** (F1: 85-90%): | |
| - geometry | |
| - precalculus | |
| **Challenging** (F1: 80-85%): | |
| - algebra โ intermediate_algebra (similar concepts) | |
| - prealgebra โ algebra (overlapping operations) | |
| --- | |
| ## Key Design Decisions | |
| ### 1. Data Leakage Prevention | |
| **Critical**: TF-IDF vectorizer fitted ONLY on training data | |
| ``` | |
| Train/Test Split โ Fit Vectorizer on Train โ Transform Both | |
| ``` | |
| Without this, test vocabulary leaks into training, inflating performance by 1-3%. | |
| ### 2. Feature Engineering | |
| **Hybrid approach**: | |
| - TF-IDF (5000 features): Captures text content | |
| - Math symbols (10 features): Topic indicators (e.g., integrals โ calculus) | |
| - Numeric features (5 features): Statistical properties | |
| **Why no hand-crafted keywords?** | |
| Avoided topic-specific keyword lists to prevent heuristic bias. Let the model learn discriminative vocabulary from data. | |
| ### 3. Hyperparameter Optimization | |
| All models use optimized parameters: | |
| - **C=1.0** (SVM/Logistic): Balanced regularization | |
| - **max_depth=30** (Random Forest): Sufficient complexity | |
| - **subsample=0.8** (Gradient Boosting): Stochastic sampling prevents overfitting | |
| ### 4. Class Imbalance Handling | |
| `class_weight='balanced'` automatically adjusts weights inversely proportional to class frequencies. | |
| --- | |
| ## Methodology | |
| ### Problem Type | |
| **Supervised Multi-Class Text Classification** | |
| **Why Classification (not Clustering)?** | |
| - Categories are predefined and labeled | |
| - Objective: Assign to known subtopic | |
| - Not discovering latent groups | |
| - Supervised learning with known labels | |
| ### Pipeline | |
| ``` | |
| JSON Files | |
| โ | |
| Parquet Conversion (Block 4) | |
| โ | |
| Feature Extraction (Block 9) | |
| โ | |
| TF-IDF Vectorization (Block 10) | |
| โ | |
| Model Training (Block 11) | |
| โ | |
| Evaluation (Blocks 12-13) | |
| ``` | |
| ### Feature Vector | |
| ``` | |
| Total: 5015 dimensions | |
| โโโ TF-IDF: 5000 (unigrams, bigrams, trigrams) | |
| โโโ Math Symbols: 10 (binary indicators) | |
| โโโ Numeric: 5 (scaled to [0,1]) | |
| ``` | |
| --- | |
| ## Troubleshooting | |
| ### "No data loaded" | |
| **Solution**: Check data path in Block 3 | |
| ```python | |
| DATA_PATH = './math' # Adjust to your path | |
| ``` | |
| ### "NameError: name 'results' is not defined" | |
| **Solution**: Run blocks in order. Block 12-13 need Block 11 first. | |
| ### "ValueError: Negative values" | |
| **Solution**: Block 10 should complete successfully. MinMaxScaler scales features to [0,1]. | |
| ### "TypeError: coo_matrix not subscriptable" | |
| **Solution**: Block 10 converts to CSR format. Ensure it runs completely. | |
| ### Model underperforms | |
| **Check**: | |
| 1. Data leakage prevented? (Vectorizer fitted on train only) | |
| 2. Features extracted correctly? (Block 9 output) | |
| 3. Class distribution balanced? (Block 8 charts) | |
| --- | |
| ## Performance Optimization | |
| ### Speed Up Training | |
| ```python | |
| # Reduce vocabulary | |
| vectorizer_config = {'max_features': 2000} | |
| # Fewer trees | |
| RandomForestClassifier(n_estimators=100) | |
| # Fewer boosting rounds | |
| GradientBoostingClassifier(n_estimators=50) | |
| ``` | |
| ### Reduce Memory | |
| ```python | |
| # Smaller vocabulary | |
| vectorizer_config = {'max_features': 3000} | |
| # Fewer n-grams | |
| vectorizer_config = {'ngram_range': (1, 2)} | |
| ``` | |
| --- | |
| ## Output Files | |
| After Block 13 completes, you'll have: | |
| - **train.parquet**: Training data (consolidated) | |
| - **test.parquet**: Test data (consolidated) | |
| - Performance metrics and visualizations | |
| - Model saved in memory (classifier.best_model) | |
| --- | |
| ## Next Steps | |
| ### Save Model | |
| Add after Block 13: | |
| ```python | |
| import pickle | |
| model_data = { | |
| 'model': classifier.best_model, | |
| 'vectorizer': classifier.vectorizer, | |
| 'scaler': classifier.scaler, | |
| 'label_encoder': classifier.label_encoder | |
| } | |
| with open('model.pkl', 'wb') as f: | |
| pickle.dump(model_data, f) | |
| ``` | |
| ### Batch Prediction | |
| ```python | |
| # Load model | |
| with open('model.pkl', 'rb') as f: | |
| model_data = pickle.load(f) | |
| # Predict | |
| new_problems = ["Solve x^2 = 16", "Find area of circle"] | |
| for problem in new_problems: | |
| # Preprocess โ Extract features โ Predict | |
| prediction = model.predict(...) | |
| ``` | |
| --- | |
| ## Summary | |
| **13 Blocks, 3 Stages**: | |
| 1. **Setup** (Blocks 1-7): One-time environment setup | |
| 2. **Training** (Blocks 8-11): Data loading and model training | |
| 3. **Evaluation** (Blocks 12-13): Comprehensive analysis | |
| **Key Features**: | |
| - Data leakage prevention | |
| - 5 optimized models | |
| - 6 visualization types | |
| - Probability predictions | |
| - Error analysis | |
| **Expected Time**: 10-15 minutes total (including training) | |
| **Expected Performance**: 85-90% F1-score on test set | |