1️⃣ Build a solid RL env with Verifiers (Prime Intellect) 2️⃣ Generate synthetic data: <200 games sampled from GPT-5-mini playing in the env 3️⃣ SFT warm-up to teach format 4️⃣ Group-based RL (CISPO) against opponents making 20-70% random moves 5️⃣ RL again with stronger opponents (0-25% random moves) + 1.25 temperature to push exploration and shake off suboptimal strategies