Sleeping Agents FEST-Style Few-Shot RL for Reasoning 🧠 Solve math problems with step‑by‑step reasoning
Running Agents Implicit Memory Conflict Validator 🧠 Evaluate LLM responses for outdated memory conflicts
Sleeping Agents Sudanese CoT Reasoning Benchmark 🧠 Run Sudanese Arabic reasoning benchmark with step-by-step analysis
Sleeping Agents COPSD Sudanese Reasoning Demo 🚀 Compare Sudanese math reasoning with and without English context
Sleeping Agents PrefixGuard Demo - Agent Failure Detection 🛡 Detect potential agent failures from execution traces
Running Agents LoPE Demo - Prompt Perturbation for Reasoning Exploration 🧠 Compare baseline and perturbed reasoning for tasks
Paused Agents Lost-in-Thought Benchmark 🧠 Run a benchmark to see how reasoning steps affect retrieval accuracy
Sleeping Agents Master Key Capability Demo 🔑 Show expected accuracy boost for a math problem via steering