# Strategy Comparison: Teacher vs Baselines ## Overview This module compares three training strategies for the student agent: 1. **Random Strategy**: Student receives random questions from task generator until they can confidently pass difficult questions 2. **Progressive Strategy**: Student receives questions in progressive difficulty order (Easy → Medium → Hard) within each family sequentially 3. **Teacher Strategy**: RL teacher agent learns optimal curriculum using UCB bandit algorithm ## Goal Demonstrate that the **Teacher-trained student performs best** - achieving highest accuracy on difficult questions. ## Running the Comparison ```bash cd teacher_agent_dev python compare_strategies.py ``` This will: - Train all three strategies for 500 iterations - Track accuracy on general questions and difficult questions - Generate comparison plots showing all three strategies - Print summary statistics ## Output ### Plot: `comparison_all_strategies.png` The plot contains three subplots: 1. **General Accuracy Over Time**: Shows how student accuracy improves on medium-difficulty questions 2. **Difficult Question Accuracy**: **KEY METRIC** - Shows accuracy on hard questions (most important for demonstrating teacher superiority) 3. **Learning Efficiency**: Bar chart showing iterations to reach 75% target vs final performance ### Key Metrics Tracked - **General Accuracy**: Student performance on medium-difficulty questions from all topics - **Difficult Accuracy**: Student performance on hard-difficulty questions (target metric) - **Iterations to Target**: How many iterations until student reaches 75% accuracy on difficult questions - **Final Accuracy**: Final performance after 500 iterations ## Expected Results The Teacher strategy should show: - ✅ **Highest final accuracy** on difficult questions - ✅ **Efficient learning** (good balance of speed and performance) - ✅ **Better curriculum** (smarter topic/difficulty selection) ### Example Output ``` STRATEGY COMPARISON SUMMARY ====================================================================== Random | ✅ Reached | Iterations: 51 | Final Acc: 0.760 Progressive | ✅ Reached | Iterations: 310 | Final Acc: 0.520 Teacher | ✅ Reached | Iterations: 55 | Final Acc: 0.880 ====================================================================== ``` **Teacher wins with highest final accuracy!** ## Strategy Details ### Random Strategy - Completely random selection of topics and difficulties - No curriculum structure - Baseline for comparison - May reach target quickly due to luck, but doesn't optimize learning ### Progressive Strategy - Rigid curriculum: Easy → Medium → Hard for each topic sequentially - No adaptation to student needs - Slow to reach difficult questions - Doesn't account for forgetting or optimal pacing ### Teacher Strategy - **RL-based curriculum learning** - Uses UCB bandit to balance exploration/exploitation - Adapts based on student improvement (reward signal) - Optimizes for efficient learning - Can strategically review topics to prevent forgetting ## Visualization Features - **Color coding**: Teacher in green (highlighted as best), Random in red, Progressive in teal - **Line styles**: Teacher with solid thick line, baselines with dashed/dotted - **Annotations**: Final accuracy values labeled on plots - **Target line**: 75% accuracy threshold marked on difficult question plot - **Summary statistics**: Table showing which strategies reached target and when ## Customization You can modify parameters in `compare_strategies.py`: ```python num_iterations = 500 # Number of training iterations target_accuracy = 0.75 # Target accuracy on difficult questions seed = 42 # Random seed for reproducibility ``` ## Files - `compare_strategies.py` - Main comparison script - `comparison_all_strategies.png` - Generated comparison plot - `train_teacher.py` - Teacher training logic - `mock_student.py` - Student agent implementation - `mock_task_generator.py` - Task generator ## Notes - All strategies use the same student parameters for fair comparison - Evaluation uses held-out test sets - Teacher strategy learns from rewards based on student improvement - Results may vary slightly due to randomness, but teacher should consistently outperform baselines