As AI models improve at holding natural conversations, we must examine how these interactions affect people and society.
Constructing on a breadth of scientific research, today, we’re releasing recent findings on the potential for AI to be misused for harmful manipulation*, specifically, its ability to change human thought and behavior in negative and deceptive ways. With this latest study, we now have created the primary empirically validated toolkit to measure this type of AI manipulation in the actual world, which we hope will help protect people and advance the sphere as a complete. We’re publicly releasing all materials mandatory to run human participant studies using the identical methodology. (Note: The behaviors observed during this study took place in a controlled lab setting, and don’t necessarily predict real-world behaviors.)
Why harmful manipulation matters
Consider two scenarios: One AI model gives you facts to make a well-informed healthcare decision that improves your well-being. One other AI model uses fear to pressure you to make an ill-informed decision that harms your health. The primary educates and helps you; the second tricks and harms you.
These scenarios highlight the difference between two varieties of persuasion in human-AI interactions (also defined in earlier research):
- Helpful (rational) persuasion: Using facts and evidence to assist people make selections that align with their very own interest
- Harmful manipulation: Exploiting emotional and cognitive vulnerabilities to trick people into making harmful selections
Our latest work helps us and the broader AI community higher understand the danger of AI developing capabilities for harmful manipulation and construct a scalable evaluation framework to measure this complex area. To do that effectively, we simulated misuse in high-stakes environments, explicitly prompting AI to attempt to negatively manipulate people’s beliefs and behaviours on key topics.
Developing recent evaluations for a posh challenge
Testing the outcomes of AI harmful manipulation
Testing for harmful manipulation is inherently difficult since it involves measuring subtle changes in how people think and act, various heavily by topic, culture and context.
That is what motivated our latest research, which involved conducting nine studies involving over 10,000 participants across the UK, the US, and India. We focused on high-stakes areas resembling finance, where we used simulated investment scenarios to check if AI could influence how people would behave in complex decision-making environments, and health, where we tracked if AI could influence which dietary supplements people preferred. Interestingly, the AI was least effective at harmfully manipulating participants on health-related topics.
Our findings show that success in a single domain doesn’t predict success in one other, validating our targeted approach to testing for harmful manipulation in specific, high-stakes environments where AI may very well be misused.
How could AI manipulate?
Along with tracking efficacy (whether the AI successfully changes minds), we also measured its propensity (how often it even tries to make use of manipulative tactics). We tested propensity in two scenarios: after we explicitly told the model to be manipulative, and after we didn’t.
As detailed in our research, we counted manipulative tactics in experimental transcripts, confirming the AI models were most manipulative when explicitly instructed to be.
Our results also suggest that certain manipulative tactics could also be more more likely to end in harmful outcomes, though further research is required to know these mechanisms intimately.
By measuring each efficacy and propensity, we are able to higher understand how AI manipulation works and construct more targeted mitigations.
