Linear probes trained on diverse deception data to detect dishonest completions across model families (OLMo, Qwen, Gemma).
AI & ML interests
Frontier alignment research to ensure the safe development and deployment of advanced AI systems.
Recent Activity
View all activity
Papers
View all PapersObfuscated Policy, Obfuscated Activations, Blatant Deception, and Honest models trained in the Obfuscation Atlas paper.
-
The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes
Paper • 2602.15515 • Published -
taufeeque/mbpp-hardcode
Viewer • Updated • 974 • 989 -
AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.001-det10-seed1-mbpp_probe
Updated • 1 -
AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.0001-det10-seed1-mbpp_probe
Updated • 1
Linear probes trained on diverse deception data to detect dishonest completions across model families (OLMo, Qwen, Gemma).
Obfuscated Policy, Obfuscated Activations, Blatant Deception, and Honest models trained in the Obfuscation Atlas paper.
-
The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes
Paper • 2602.15515 • Published -
taufeeque/mbpp-hardcode
Viewer • Updated • 974 • 989 -
AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.001-det10-seed1-mbpp_probe
Updated • 1 -
AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.0001-det10-seed1-mbpp_probe
Updated • 1
models 629
AlignmentResearch/diverse-deception-probe-olmo-3-32b-think
Updated
AlignmentResearch/diverse-deception-probe-gemma-3-12b-it
Updated
AlignmentResearch/diverse-deception-probe-qwen3-8b
Updated
AlignmentResearch/diverse-deception-probe-olmo-3-7b-instruct
Updated
AlignmentResearch/diverse-deception-probe-olmo-3-7b-think
Updated
AlignmentResearch/obfuscation-atlas-gemma-3-12b-it-kl0.0001-det1-seed3-mbpp_probe
Updated • 3
AlignmentResearch/obfuscation-atlas-gemma-3-27b-it-kl0.001-det1-seed3-mbpp_probe
Updated • 1
AlignmentResearch/obfuscation-atlas-gemma-3-27b-it-kl1-det1-seed3-mbpp_probe
Updated • 1
AlignmentResearch/obfuscation-atlas-gemma-3-27b-it-kl0.0001-det1-seed3-mbpp_probe
Updated • 3
AlignmentResearch/obfuscation-atlas-gemma-3-27b-it-kl0.01-det1-seed3-mbpp_probe
Updated • 2
datasets 109
AlignmentResearch/deceptive-followup-v25_rp
Viewer • Updated • 75.6k
AlignmentResearch/deceptive-followup-v25_r
Viewer • Updated • 69.9k • 16
AlignmentResearch/roleplay-base-examples
Viewer • Updated • 2.92k • 13
AlignmentResearch/deceptive-followup-v25
Viewer • Updated • 63.1k • 59
AlignmentResearch/deceptive-followup-v24
Viewer • Updated • 61.4k • 35
AlignmentResearch/deceptive-followup-v23
Viewer • Updated • 58.7k • 51
AlignmentResearch/deceptive-followup-v22
Viewer • Updated • 54.6k • 36
AlignmentResearch/deceptive-followup-v21
Viewer • Updated • 30.1k • 98
AlignmentResearch/model-self-knowledge-gemma27b
Viewer • Updated • 6.33k • 56
AlignmentResearch/deceptive-followup-v20
Viewer • Updated • 55.5k • 68