In a preceding examination within this series, titled "Rules Fail at the Prompt, Succeed at the Boundary," we dissected the first documented instance of an AI-orchestrated espionage campaign. That investigation elucidated why security controls confined strictly to the prompt level failed with absolute cataclysm. This current article serves as the indispensable prescription for the challenges that follow. Today, boards of directors are compelling CEOs to confront a singular, urgent inquiry: What specific measures must we enact to mitigate the risks posed by autonomous AI agents?
A comprehensive review of recent AI security guidance issued by global standards bodies, regulatory agencies, and leading technology providers reveals a singular, recurring imperative. Organizations are mandated to treat AI agents as formidable, semi-autonomous users rather than passive software tools. Consequently, rules must be rigorously enforced at the boundaries where these agents interface with identity, tools, data, and outputs. The following eight-step strategic plan provides actionable directives that CEOs can immediately deploy to their technical teams for implementation and reporting.
These initial procedural steps are critical for defining an agent's identity and establishing strict limitations on its functional scope.
Currently, a significant number of AI agents operate under vague service identities that possess far more privilege than is strictly necessary for their function. The remediation is direct and methodical: treat every agent as a non-human principal, applying the same rigorous discipline to digital entities as one applies to human employees. Each agent must execute as the specific requesting user within the correct tenant, with permissions strictly constrained to the role and geographic location of that user. Organizations must categorically prohibit shortcuts that permit agents to operate across tenant boundaries. For actions carrying high impact, the system must mandate explicit human approval accompanied by a recorded, auditable rationale.
This approach reflects the practical application of frameworks such as Google's Secure AI Framework (SAIF) and NIST's guidance on AI access control. The pivotal question a CEO must pose is: Can we, today, present a definitive list of all active agents and precisely delineate the exact scope of actions each is permitted to undertake?
The success of the Anthropic espionage framework was predicated on the attacker's ability to wire a large AI model, such as Claude, into a flexible suite of external tools. These included scanners, exploit frameworks, and data parsers, connected via protocols like the Model Context Protocol. Crucially, these tools were not pinned to specific versions nor gated by stringent security policies. The requisite defense strategy involves treating AI toolchains with the same rigor applied to a software supply chain. Organizations must pin the versions of remote tool servers and require formal approvals for adding new tools, expanding access scopes, or integrating new data sources. Automatic tool-chaining must be forbidden unless a specific security policy explicitly authorizes it.
This methodology directly addresses the "excessive agency" risk flagged by the Open Web Application Security Project (OWASP). Under regulatory mandates such as the EU AI Act, designing for cyber-resilience and resistance to misuse is a legal obligation to ensure robustness. The CEO question to resolve here is: Who possesses the authority to sign off when an agent gains a new tool or broader scope, and how is that approval tracked and verified?
A prevalent anti-pattern involves granting an AI model a long-lived, powerful credential and relying on prompt engineering to keep its behavior polite. SAIF and NIST argue vehemently for the opposite approach. Credentials and access scopes should be bound to specific tools and tasks, rotated regularly, and fully auditable. Agents must then request narrowly scoped capabilities through these controlled tools.
In practice, this manifests as a rule stating: "The finance-ops-agent may read, but not write, certain ledgers without explicit Chief Financial Officer approval." The CEO question to validate this control is: Can we revoke a specific capability from an agent without necessitating a complete re-architecture of the system?
These steps serve to gate inputs and outputs while constraining the overall behavioral trajectory of the agent.
Most agent security incidents originate with adversarial data. This can manifest as a poisoned web page, a manipulated PDF, a deceptive email, or a corrupted code repository that smuggles malicious instructions into the AI system. Security guidelines, including OWASP's prompt-injection cheat sheet and OpenAI's developer guidance, insist on the strict separation of system instructions from user content. They also mandate treating unvetted retrieval sources as entirely untrusted.
Operationally, this necessitates gating all data before it enters an agent's retrieval system or long-term memory. New external sources must be reviewed, tagged, and formally onboarded. Persistent memory should be disabled when untrusted context is present, and provenance data must be attached to every piece of information. The CEO question is: Can we enumerate every external content source our agents learn from, and identify the specific individual who approved each one?
In the Anthropic espionage case, AI-generated exploit code and credential dumps flowed directly into execution without any intermediate checks. Any agent output that can precipitate a real-world side effect—such as running code, sending an email, or modifying a database—requires a security validator positioned between the agent and the action. OWASP's "insecure output handling" category is explicit on this point, aligning with established browser security practices regarding origin boundaries.
The critical question for system architects is: Where, precisely within our system's architecture, are agent outputs assessed for safety and correctness before they are executed or shipped to customers?
The underlying principle here is to protect the data itself so that there is nothing dangerous for an agent to reveal by default. Both NIST and SAIF advocate for "secure-by-default" designs where sensitive values—such as social security numbers or credit card details—are tokenized or masked. These values are only re-hydrated, or revealed, for authorized users and specific, approved use cases.
In agentic systems, this requires implementing policy-controlled detokenization at the final output boundary and logging every single data reveal event. If an agent is fully compromised, the potential damage is bounded by the visibility allowed by the security policy. This control resides at the intersection of the EU AI Act, GDPR, and other sector-specific data protection regimes. The EU AI Act expects providers and deployers to manage AI-specific risks; runtime tokenization and policy-gated reveal constitute strong evidence of actively controlling those risks in production. The CEO must ask: When our agents process regulated data, is that protection enforced by the system architecture itself, or merely by promises and procedures?
For the final steps, it is essential to demonstrate that controls are functioning and will continue to function under duress.
Anthropic's own research on "sleeper agents"—AI models harboring hidden, malicious behaviors—should dispel any fantasy that a single security test is sufficient. This highlights the critical need for continuous evaluation. This entails instrumenting agents with deep observability, regularly conducting red-team exercises with adversarial test suites, and backing everything with robust, immutable logging. Security failures must become both regression tests and catalysts for enforceable policy updates.
The CEO question is operational: Who is actively working to breach our agents every week, and how are their findings systematically used to modify our security policies?
All major AI security frameworks emphasize the necessity of a comprehensive inventory and an unbroken evidence trail. An enterprise must know which models, prompts, tools, datasets, and vector stores it utilizes. It must also track ownership of each component and what risk-based decisions were made regarding them.
For agents, this requires a living catalog and unified security logs that record: which agents exist and on which platforms; what scopes, tools, and data each agent is allowed to access; and every approval, detokenization event, and high-impact action, along with who approved it and when.
The final CEO question is: If a regulator or auditor asks how an agent made a specific, consequential decision, could we reconstruct the complete decision chain with evidence? It is also vital to maintain a system-level threat model. Assume a sophisticated threat actor, like the state-based GTG-1002 from the Anthropic case study, is already inside your enterprise. To complete preparedness, security teams should consider frameworks like MITRE ATLAS, which exists precisely because adversaries attack entire systems, not just isolated AI models.
Taken together, these eight controls do not render AI agents magically safe. They accomplish something more familiar and more reliable: they place AI, its access privileges, and its actions back inside the same proven security framework used for any powerful human user or critical software system.
For boards and CEOs, the defining question is no longer "Do we have good AI guardrails?" It has evolved into a demand for proof: Can we answer each of the CEO questions above with concrete evidence, rather than mere assurances?