Follow
Evan Hubinger
Evan Hubinger
Safety Researcher, Anthropic
Verified email at anthropic.com - Homepage
Title
Cited by
Cited by
Year
Discovering language model behaviors with model-written evaluations
E Perez, S Ringer, K Lukosiute, K Nguyen, E Chen, S Heiner, C Pettit, ...
Findings of the Association for Computational Linguistics: ACL 2023, 13387-13434, 2023
2752023
Risks from learned optimization in advanced machine learning systems
E Hubinger, C van Merwijk, V Mikulik, J Skalse, S Garrabrant
arXiv preprint arXiv:1906.01820, 2019
1702019
Sleeper agents: Training deceptive llms that persist through safety training
E Hubinger, C Denison, J Mu, M Lambert, M Tong, M MacDiarmid, ...
arXiv preprint arXiv:2401.05566, 2024
1502024
Studying large language model generalization with influence functions
R Grosse, J Bae, C Anil, N Elhage, A Tamkin, A Tajdini, B Steiner, D Li, ...
arXiv preprint arXiv:2308.03296, 2023
1392023
Measuring faithfulness in chain-of-thought reasoning
T Lanham, A Chen, A Radhakrishnan, B Steiner, C Denison, ...
arXiv preprint arXiv:2307.13702, 2023
1182023
Many-shot jailbreaking
C Anil, E Durmus, N Panickssery, M Sharma, J Benton, S Kundu, J Batson, ...
Advances in Neural Information Processing Systems 37, 129696-129742, 2024
1092024
Steering llama 2 via contrastive activation addition
N Panickssery, N Gabrieli, J Schulz, M Tong, E Hubinger, AM Turner
arXiv preprint arXiv:2312.06681, 2023
1072023
Question decomposition improves the faithfulness of model-generated reasoning
A Radhakrishnan, K Nguyen, A Chen, C Chen, C Denison, D Hernandez, ...
arXiv preprint arXiv:2307.11768, 2023
542023
An overview of 11 proposals for building safe advanced ai
E Hubinger
arXiv preprint arXiv:2012.07532, 2020
302020
Sycophancy to subterfuge: Investigating reward-tampering in large language models
C Denison, M MacDiarmid, F Barez, D Duvenaud, S Kravec, S Marks, ...
arXiv preprint arXiv:2406.10162, 2024
282024
Tamera Lanham, Daniel M
E Hubinger, JM Carson Denison, M Lambert, M Tong, M MacDiarmid
Ziegler, Tim Maxwell, Newton Cheng, et al. Sleeper agents: Training …, 2024
282024
Tamera Lanham, Tim Maxwell, Venkatesa Chandrasekaran, Zac Hatfield-Dodds, Jared Kaplan, Jan Brauner, Samuel R
A Radhakrishnan, K Nguyen, A Chen, C Chen, C Denison, D Hernandez, ...
Bowman, and Ethan Perez. Question decomposition improves the faithfulness of …, 2023
262023
Alignment faking in large language models
R Greenblatt, C Denison, B Wright, F Roger, M MacDiarmid, S Marks, ...
arXiv preprint arXiv:2412.14093, 2024
222024
Uncovering deceptive tendencies in language models: A simulated company ai assistant
O Järviniemi, E Hubinger
arXiv preprint arXiv:2405.01576, 2024
172024
Studying large language model generalization with influence functions, 2023
R Grosse, J Bae, C Anil, N Elhage, A Tamkin, A Tajdini, B Steiner, D Li, ...
URL https://arxiv. org/abs/2308.03296, 0
15
Sleeper agents: Training deceptive LLMs that persist through safety training. arXiv
E Hubinger, C Denison, J Mu, M Lambert, M Tong, M MacDiarmid, ...
112024
Tamera Lanham, Karina Nguyen, Tomasz Korbak, Jared Kaplan, Deep Ganguli, Samuel R. Bowman, Ethan Perez, Roger Grosse, and David Duvenaud. Many-shot jailbreaking
C Anil, E Durmus, M Sharma, J Benton, S Kundu, J Batson, N Rimsky, ...
Preprint, 2024
102024
AI safety via market making
E Hubinger
AI Alignment Forum, 2020
102020
Simple probes can catch sleeper agents, 2024
M MacDiarmid, T Maxwell, N Schiefer, J Mu, J Kaplan, D Duvenaud, ...
URL https://www. anthropic. com/news/probes-catch-sleeper-agents, 0
10
Engineering monosemanticity in toy models
AS Jermyn, N Schiefer, E Hubinger
arXiv preprint arXiv:2211.09169, 2022
92022
The system can't perform the operation now. Try again later.
Articles 1–20