Prompt injection is persuasion, not a bug
Security communities have been warning about this for several years. Multiple OWASP Top 10 reports put prompt injection, or more recently Agent Goal Hijack, at the highest of the danger list and pair it with identity and privilege abuse and human-agent trust exploitation: an excessive amount of power within the agent, no separation between instructions and data, and no mediation of what comes out.
Guidance from the NCSC and CISA describes generative AI as a persistent social-engineering and manipulation vector that should be managed across design, development, deployment, and operations, not patched away with higher phrasing. The EU AI Act turns that lifecycle view into law for high-risk AI systems, requiring a continuous risk management system, robust data governance, logging, and cybersecurity controls.
In practice, prompt injection is best understood as a persuasion channel. Attackers don’t break the model—they persuade it. Within the Anthropic example, the operators framed each step as a part of a defensive security exercise, kept the model blind to the general campaign, and nudged it, loop by loop, into doing offensive work at machine speed.
That’s not something a keyword filter or a polite “please follow these safety instructions” paragraph can reliably stop. Research on deceptive behavior in models makes this worse. Anthropic’s research on sleeper agentsshows that when a model has learned a backdoor, then strategic pattern recognition, standard fine-tuning, and adversarial training can actually help the model hide the deception reasonably than remove it. If one tries to defend a system like that purely with linguistic rules, they’re playing on its home field.
Why it is a governance problem, not a vibe coding problem
Regulators aren’t asking for perfect prompts; they’re asking that enterprises display control.
NIST’s AI RMF emphasizes asset inventory, role definition, access control, change management, and continuous monitoring across the AI lifecycle. The UK AI Cyber Security Code of Practice similarly pushes for secure-by-design principles by treating AI like all other critical system, with explicit duties for boards and system operators from conception through decommissioning.
In other words: the foundations actually needed should not “never say X” or “at all times respond like Y,” they’re:
- Who is that this agent acting as?
- What tools and data can it touch?
- Which actions require human approval?
- How are high-impact outputs moderated, logged, and audited?
Frameworks like Google’s Secure AI Framework (SAIF) make this concrete. SAIF’s agent permissions control is blunt: agents should operate with least privilege, dynamically scoped permissions, and explicit user control for sensitive actions. OWASP’s Top 10 emerging guidance on agentic applications mirrors that stance: constrain capabilities on the boundary, not within the prose.
