File size: 8,419 Bytes
1d5f27f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
# Math Question Classifier - Quick Start Guide

## Execution Order

### Setup (Blocks 1-7)
**Run once to setup environment and define classes**

1. **Block 1**: Install packages
2. **Block 2**: Import libraries  
3. **Block 3**: Set data path
4. **Block 4**: Convert JSON to Parquet (one-time data preparation)
5. **Block 5**: Define MathDatasetLoader class
6. **Block 6**: Define MathFeatureExtractor class
7. **Block 7**: Define MathQuestionClassifier class

### Training & Evaluation (Blocks 8-13)
**Run to train and evaluate models**

8. **Block 8**: Load dataset from Parquet files
9. **Block 9**: Extract features (text preprocessing + math symbols + numeric)
10. **Block 10**: Vectorize features (TF-IDF + scaling)
11. **Block 11**: Train 5 models and compare performance
12. **Block 12**: Detailed evaluation of best model
13. **Block 13**: Complete test set analysis with 6 visualizations

---

## What Each Block Does

### Block 1-3: Environment Setup
- Installs scikit-learn, pandas, matplotlib, seaborn, nltk
- Imports all necessary libraries
- Sets path to data directory (`./math`)

### Block 4: Data Consolidation
**Purpose**: Convert JSON files to Parquet format
- **Input**: `./math/train/` and `./math/test/` folders with JSON files
- **Output**: `train.parquet` and `test.parquet`
- **Benefit**: 10-100x faster loading than JSON
- **Run**: Only once (skip if Parquet files already exist)

### Block 5-7: Class Definitions
Define three main classes:
- **MathDatasetLoader**: Loads Parquet files, shows statistics
- **MathFeatureExtractor**: Cleans LaTeX, extracts math symbols, preprocesses text
- **MathQuestionClassifier**: Trains models, evaluates performance

### Block 8: Load Data
- Loads `train.parquet` and `test.parquet`
- Shows class distribution for train and test sets
- Displays 2 bar charts (train/test distribution)

### Block 9: Feature Extraction
Extracts three types of features:
1. **Text features**: Preprocessed text (LaTeX cleaning, lemmatization)
2. **Math symbol features**: 10 binary indicators (has_fraction, has_sqrt, etc.)
3. **Numeric features**: 5 statistical measures (num_count, avg_number, etc.)

### Block 10: Vectorization
- Creates TF-IDF features (5000 dimensions, trigrams)
- Scales additional features to [0,1] using MinMaxScaler
- **Critical**: Fits ONLY on training data (prevents data leakage)
- Converts to CSR format for efficient operations

### Block 11: Model Training
Trains 5 optimized models:
1. **Naive Bayes** (baseline)
2. **Logistic Regression** (linear classifier)
3. **SVM** (maximum margin)
4. **Random Forest** (ensemble)
5. **Gradient Boosting** (sequential ensemble)

**Output**:
- Comparison table with Accuracy, F1-Score, Training Time
- 2 bar charts comparing performance and speed
- Selects best model automatically

### Block 12: Detailed Evaluation
- Confusion matrix visualization
- Classification report (precision, recall, F1 per class)
- Feature importance (for tree-based models)

### Block 13: Complete Analysis
**Comprehensive evaluation on entire test set**

**6 Visualizations**:
1. Confusion Matrix (absolute counts)
2. Normalized Confusion Matrix (proportions)
3. F1-Score by Topic (horizontal bar chart)
4. Precision vs Recall (scatter plot, size = support)
5. Test Set Distribution (bar chart)
6. Confidence Distribution (histogram: correct vs incorrect)

**Analysis Sections**:
- Overall performance (accuracy, F1-score)
- Per-class metrics table
- Confusion pair analysis
- Summary statistics

---

## Expected Results

### Model Performance (F1-Score)
- **Gradient Boosting**: 86-90%
- **Logistic Regression**: 85-89%
- **SVM**: 84-88%
- **Naive Bayes**: 78-82%
- **Random Forest**: 75-82% (expected to underperform on sparse features)

### Training Time
- **Naive Bayes**: ~10 seconds
- **Logistic Regression**: ~30 seconds
- **SVM**: ~2 minutes
- **Random Forest**: ~3 minutes
- **Gradient Boosting**: ~5 minutes

### Per-Topic Performance
**High Performance** (F1 > 90%):
- counting_and_probability
- number_theory

**Medium Performance** (F1: 85-90%):
- geometry
- precalculus

**Challenging** (F1: 80-85%):
- algebra โ†” intermediate_algebra (similar concepts)
- prealgebra โ†” algebra (overlapping operations)

---

## Key Design Decisions

### 1. Data Leakage Prevention
**Critical**: TF-IDF vectorizer fitted ONLY on training data
```
Train/Test Split โ†’ Fit Vectorizer on Train โ†’ Transform Both
```
Without this, test vocabulary leaks into training, inflating performance by 1-3%.

### 2. Feature Engineering
**Hybrid approach**:
- TF-IDF (5000 features): Captures text content
- Math symbols (10 features): Topic indicators (e.g., integrals โ†’ calculus)
- Numeric features (5 features): Statistical properties

**Why no hand-crafted keywords?**
Avoided topic-specific keyword lists to prevent heuristic bias. Let the model learn discriminative vocabulary from data.

### 3. Hyperparameter Optimization
All models use optimized parameters:
- **C=1.0** (SVM/Logistic): Balanced regularization
- **max_depth=30** (Random Forest): Sufficient complexity
- **subsample=0.8** (Gradient Boosting): Stochastic sampling prevents overfitting

### 4. Class Imbalance Handling
`class_weight='balanced'` automatically adjusts weights inversely proportional to class frequencies.

---

## Methodology

### Problem Type
**Supervised Multi-Class Text Classification**

**Why Classification (not Clustering)?**
- Categories are predefined and labeled
- Objective: Assign to known subtopic
- Not discovering latent groups
- Supervised learning with known labels

### Pipeline
```
JSON Files
    โ†“
Parquet Conversion (Block 4)
    โ†“
Feature Extraction (Block 9)
    โ†“
TF-IDF Vectorization (Block 10)
    โ†“
Model Training (Block 11)
    โ†“
Evaluation (Blocks 12-13)
```

### Feature Vector
```
Total: 5015 dimensions
โ”œโ”€โ”€ TF-IDF: 5000 (unigrams, bigrams, trigrams)
โ”œโ”€โ”€ Math Symbols: 10 (binary indicators)
โ””โ”€โ”€ Numeric: 5 (scaled to [0,1])
```

---

## Troubleshooting

### "No data loaded"
**Solution**: Check data path in Block 3
```python
DATA_PATH = './math'  # Adjust to your path
```

### "NameError: name 'results' is not defined"
**Solution**: Run blocks in order. Block 12-13 need Block 11 first.

### "ValueError: Negative values"
**Solution**: Block 10 should complete successfully. MinMaxScaler scales features to [0,1].

### "TypeError: coo_matrix not subscriptable"
**Solution**: Block 10 converts to CSR format. Ensure it runs completely.

### Model underperforms
**Check**:
1. Data leakage prevented? (Vectorizer fitted on train only)
2. Features extracted correctly? (Block 9 output)
3. Class distribution balanced? (Block 8 charts)

---

## Performance Optimization

### Speed Up Training
```python
# Reduce vocabulary
vectorizer_config = {'max_features': 2000}

# Fewer trees
RandomForestClassifier(n_estimators=100)

# Fewer boosting rounds
GradientBoostingClassifier(n_estimators=50)
```

### Reduce Memory
```python
# Smaller vocabulary
vectorizer_config = {'max_features': 3000}

# Fewer n-grams
vectorizer_config = {'ngram_range': (1, 2)}
```

---

## Output Files

After Block 13 completes, you'll have:
- **train.parquet**: Training data (consolidated)
- **test.parquet**: Test data (consolidated)
- Performance metrics and visualizations
- Model saved in memory (classifier.best_model)

---

## Next Steps

### Save Model
Add after Block 13:
```python
import pickle
model_data = {
    'model': classifier.best_model,
    'vectorizer': classifier.vectorizer,
    'scaler': classifier.scaler,
    'label_encoder': classifier.label_encoder
}
with open('model.pkl', 'wb') as f:
    pickle.dump(model_data, f)
```

### Batch Prediction
```python
# Load model
with open('model.pkl', 'rb') as f:
    model_data = pickle.load(f)

# Predict
new_problems = ["Solve x^2 = 16", "Find area of circle"]
for problem in new_problems:
    # Preprocess โ†’ Extract features โ†’ Predict
    prediction = model.predict(...)
```

---

## Summary

**13 Blocks, 3 Stages**:
1. **Setup** (Blocks 1-7): One-time environment setup
2. **Training** (Blocks 8-11): Data loading and model training
3. **Evaluation** (Blocks 12-13): Comprehensive analysis

**Key Features**:
- Data leakage prevention
- 5 optimized models
- 6 visualization types
- Probability predictions
- Error analysis

**Expected Time**: 10-15 minutes total (including training)

**Expected Performance**: 85-90% F1-score on test set