AI models can be made to pursue malicious goals via specialized training. Teaching AI models about reward hacking can lead to other bad actions. A deeper problem may be the issue of AI personas. Code ...
A person holds a smartphone displaying Claude. AI models can do scary things. There are signs that they could deceive and blackmail users. Still, a common critique is that these misbehaviors are ...
Anthropic has seen its fair share of AI models behaving strangely. However, a recent paper details an instance where an AI model turned “evil” during an ordinary training setup. A situation with a ...
Khalaf, Hadi, Claudio Mayrink Verdun, Alex Oesterling, Himabindu Lakkaraju, and Flavio Calmon. "Inference-Time Reward Hacking in Large Language Models." Advances in Neural Information Processing ...