AI coding agents enable developers to work faster by streamlining tasks and driving automated, test-driven development. Nevertheless, additionally they introduce a major, often ignored, attack surface by running tools from the command line with the identical permissions and entitlements because the user, making them computer use agents, with all of the risks those entail.
The first threat to those tools is that of indirect prompt injection, where a portion of the content ingested by the LLM driving the model is provided by an adversary through vectors comparable to malicious repositories or pull requests, git histories with prompt injections, .cursorrules, CLAUDE/AGENT.md files that contain prompt injections or malicious MCP responses. Such malicious instructions to the LLM may end up in it taking attacker-influenced actions with hostile consequences.
Manual approval of actions performed by the agent is probably the most common option to manage this risk, however it also introduces ongoing developer friction, requiring developers to repeatedly return to the applying to review and approve actions. This creates a risk of user habituation where they simply approve potentially dangerous actions without reviewing them. A key requirement for agentic system security is finding the balance between hands-on user input and automation. The next controls are what the NVIDIA AI Red Team considers either required or highly really useful, but must be implemented to reflect your specific use case and your organization’s risk tolerance.
Based on the NVIDIA AI Red Team’s experience, the next mandatory controls mitigate probably the most serious attacks that could be achieved with indirect prompt injection:
- Network egress controls: Blocking network access to arbitrary sites prevents exfiltration of information or establishing a distant shell without additional exploits.
- Block file writes outside of the workspace: Blocking write operations to files outside of the workspace prevents plenty of persistence mechanisms, sandbox escapes, and distant code execution (RCE) techniques.
- Block writes to configuration files, irrespective of where they’re situated: Blocking writes to config files prevents exploitation of hooks, skills, and native model context protocol (MCP) configurations that usually run outside of a sandbox context.
These really useful controls further reduce the attack surface, making host enumeration and exploration harder, limiting risks posed by hooks, local MCP configurations, and kernel exploits, and shutting other exploitation and disclosure risks.
- Prevent reads from files outside of the workspace.
- Sandbox your entire integrated development environment (IDE) and all spawned functions (e.g., hooks, MCP startup scripts, skills, and gear calls), and, where possible, are run as their very own user.
- Use virtualization to isolate the sandbox kernel from the host kernel (e.g., microVM, Kata container, full VM)
- Require user approval for each instance of specific actions (e.g., a network connection) that otherwise violate isolation controls. Allow-once / run-many isn’t an adequate control.
- Use a secret injection approach to stop secrets (e.g., in environment variables) from being shared with the agent.
- Establish lifecycle management controls for the sandbox to stop the buildup of code, mental property, or secrets.
Note: This post doesn’t address risks arising from inaccurate or adversarially manipulated output from AI-powered tools, that are treated as user-level responsibilities.
Why implement sandbox controls at an OS level?
Agentic tools, particularly for coding, perform arbitrary code execution by design. Automating test- or specification-driven development requires that the agent create and execute code to look at the outcomes. As well as, tool-using agents are moving toward writing and executing throwaway scripts to perform tasks.
This makes application-level controls insufficient. They will intercept tool calls and arguments before execution, but once control passes to a subprocess, the applying has no visibility into or control over the subprocess. Attackers often use indirection—calling a more restricted tool through a safer, approved one—as a standard option to bypass application-level controls comparable to allowlists. OS-level controls, like macOS Seatbelt, work beneath the applying layer to cover every process within the sandbox. Regardless of how these processes start, they’re kept from reaching dangerous system capabilities, even through indirect paths.
Mandatory sandbox security controls
This section briefly outlines controls that the Red Team considers mandatory for agentic applications and the classes of attacks they assist mitigate. When implemented together, they block easy exploitation techniques observed in practice. The section concludes with guidance on layering controls in real-world deployments.
Network egress except to known-good locations
Essentially the most obvious and direct threat of network access is distant access (a network implant, malware, or a straightforward reverse shell), enabling an attacker access to the victim machine, where they will directly probe and enumerate controls and try and pivot or escape.
One other significant threat is data exfiltration. Developer machines often contain a wide selection of secrets and mental property of value to an attacker, often even in a current workspace (e.g., .env files with API tokens). Exfiltrating the contents of directories comparable to ~/.ssh to realize access to other systems is a serious goal, as is exfiltrating sensitive source code.
Network connections created by sandbox processes mustn’t be permitted without manual approval. Tightly scoped allowlists enforced through HTTP proxy, IP, or port-based controls reduce user interaction and approval fatigue. Limiting DNS resolution to designated trusted resolvers to avoid DNS-based exfiltration can also be really useful. A default-ask posture combined with enterprise-level denylists that can not be overridden by local users provides a superb balance between functionality and security.
Block file writes outside of the lively workspace
Writing files outside of an lively workspace is a major risk. Files comparable to ~/.zshrc are executed robotically and may end up in each RCE and sandbox escape. URLs in various key files, comparable to ~/.gitconfig or ~/.curlrc, could be overwritten to redirect sensitive data to attacker-controlled locations. Malicious files, comparable to a backdoored python or node binary, could possibly be placed in ~/.local/bin to ascertain persistence or escape the sandbox.
Write operations should be blocked outside of the lively workspace at an OS level. Similarly to network controls, use an enterprise-level policy that blocks any such operation on known-sensitive paths, no matter whether or not the user manually approves the motion. These protected files should include dotfiles, configuration directories, and any additional paths enumerated by enterprise policy. Another out-of-workspace file write operations could also be permitted with manual user approval.
Block all writes to any agent configuration file or extension
Many agentic systems, including agentic IDEs, permit the creation of extensions that enhance functionality and infrequently include executable code. “Hooks” may define shell code to be executed on specific events (comparable to on prompt submission). MCP servers using an stdio transport define shell commands required to start out the server. Claude Skills can include scripts, code, or helper functions that run as soon because the skill is named. Files comparable to .cursorrules, CLAUDE.md, copilot-instructions.md, can provide adversaries with a durable option to shape the agent’s behavior, and in some cases, gain full control and even arbitrary code execution.
As well as, agentic IDEs often contain global and native settings, including command allow and denylists, with local configuration settings within the lively workspace. This could give attackers the flexibility to pivot or extend their reach if these local settings are modified. For instance, adding a poisoned hooks configuration to a Git repository in a workspace can affect every user who clones it. Moreover, hooks and MCP initialization functions often run outside of a sandbox environment, offering a chance to flee sandbox controls.
Application-specific configuration files, including those situated inside the current workspace, should be shielded from any modification by the agent, with no user approval of such actions by the IDE possible. Direct, manual modification by the user is the one acceptable modification mechanism for these sensitive files.
Tiered implementation of controls
Defining universally applicable allow/denylists is difficult, given the wide selection of use cases that agentic tools could also be applied. The goal must be to dam exploitable behavior while preserving manual user interventions as an infrequently-used fallback for unanticipated cases using a tiered approach comparable to the next:
- Establish clear enterprise-level denylists for access to critical files outside the present workspace that may’t be overridden by user-level allowlists or manual approval decisions.
- Allow read-write access inside the agent’s workspace (except for configuration files) without user approval.
- Permit specific allowlisted operations (e.g., read from
~/.ssh/gitlab-key) that could be required for the right functionality of specific functions. - Assume default-deny for all other actions, permitting case-by-case user approval.
This post doesn’t specifically address command allow/denylisting, as OS-level restrictions should make command-level blocks redundant, though they could be useful as a defense-in-depth mitigation against potential sandbox misconfigurations.
Advisable sandbox security controls
The required controls discussed provide strong protection against indirect prompt injection and help reduce approval fatigue. Nevertheless, there are remaining potential vulnerabilities, including:
- Ingestion of malicious hooks or local MCP initialization commands.
- Kernel-level vulnerabilities that result in sandbox escape and full host control.
- Agent access to secrets.
- Failure modes in product-specific caching of manual approvals.
- The buildup of secrets, IP, or exploitable code within the sandbox.
The extra controls and considerations help close a few of these remaining potential vulnerabilities.
Sandbox IDE and all spawned functions
Many agentic systems only apply sandboxing on the time of tool invocation (commonly just for using shell/command-line tools). While this does prevent a wide selection of abuse mechanisms, there remain many agentic functionalities that usually default to running outside of the sandbox. These include hooks, MCP configurations that spawn local processes, scripts utilized by ‘skills’, or other tools managed at the applying layer. This is commonly required when sandboxes are associated only with command-line tools, while file-editing tools or search tools execute outside of a sandbox and are controlled at the applying level. These unsandboxed execution paths could make it easier for attackers to bypass sandbox controls or obtain distant code execution.
The sandbox restrictions discussed must be enforced for all agentic operations, not only command-line tool invocations. Restrictions on write operations for files outside of the present workspace and configuration files are probably the most critical, while network egress from the sandbox should only be permitted for correctly configured distant MCP server calls.
Use virtualization to isolate the sandbox kernel from the host kernel
Many sandbox solutions (macOS Seatbelt, Windows AppContainer, Linux Bubblewrap, Dockerized dev containers) share the host kernel, leaving it exposed to any code executed inside the sandbox. Because agentic tools often execute arbitrary code by design, kernel vulnerabilities could be directly targeted as a path to full system compromise.
To stop these attacks at an architectural level, run agentic tools inside a totally virtualized environment isolated from the host kernel in any respect times, including VMs, unikernels, or Kata containers. Intermediate mitigations like gVisor, which mediate system calls via a separate user-space kernel, are preferable to completely shared solutions, but offer different and potentially weaker security guarantees than full virtualization.
While virtualization typically introduces some amount of overhead, it’s often modest in comparison with that induced by LLM calls. The lifecycle management of the virtualized environment must be tuned against the associated overhead required to reduce developer friction while stopping the buildup of data.
Prevent reads from files outside of the workspace
Sandbox solutions often require access to certain files outside of the workspace, comparable to ~/.zshrc, to breed the developer’s environment. Unrestricted read access exposes information of value to an attacker, enabling enumeration and exploration of the user’s device, secrets, and mental property.
This follows a tiered approach consistent with the principle of least access:
- Use enterprise-level denylists to dam reads from highly sensitive paths or patterns not required for sandbox operation.
- Limit allowlist external reads access to what’s strictly essential, ideally permitting reads only during sandbox initialization and blocking reads thereafter.
- Block all other reads outside the workspace unless manually approved by the user.
Require manual user approval each time an motion would violate default-deny isolation controls
As described within the tiered implementation approach, default-deny actions that aren’t allowlisted or explicitly blocked should require manual user approval before execution. Enterprise-level denylists should never be overridden by user approval.
Critically, approvals should never be cached or endured, as a single legitimate approval immediately opens the door to future adversarial abuse. As an illustration, permitting modification of ~/.zshrc once to perform a legitimate function may allow later adversarial activity to implant code on a subsequent execution without requiring re-approval. Each potentially dangerous motion should require fresh user confirmation.
Use a secret injection approach to stop secrets from being exposed to the agent
Developer environments commonly contain a wide selection of secrets, comparable to API keys in environment variables, credentials in ~/.aws, tokens in .env files, and SSH keys. These secrets are sometimes inherited by sandboxed processes or accessible inside the filesystem, even after they aren’t required for the duty at hand. This creates unnecessary exposure.
Even with network controls in place, exposed secrets remain a risk.
Sandbox environments should depend on explicit secret injection to scope credentials to the minimum required for a given task, fairly than inheriting the total set of host environment credentials. In practice:
- Start the sandbox with a straightforward or empty credential set.
- Remove any secrets that aren’t required for the present task.
- Inject required secrets based only on the particular task or project, ideally via a mechanism that isn’t directly accessible to the agent (e.g., a credential broker that gives short-lived tokens on demand fairly than long-lived credentials in environment variables).
- Proceed enforcing standard security practices comparable to least privilege for all secrets.
The goal is to limit the blast radius of any compromise in order that a hypothetical attacker who gains control of agent behavior can only use secrets which have been explicitly provisioned for the present task and never the total set of credentials available within the host system.
Establish lifecycle management controls for the sandbox
Long-running sandbox environments can accumulate artifacts over time from downloaded dependencies, generated scripts, cached credentials, mental property from previous projects, and temporary files that persist longer than intended. This expands the potential attack surface and increases the worth of a compromise. When an attacker gains access to an agent operating in a stale sandbox, they could find secrets, proprietary code, or tools required for earlier work that could be repurposed.
The small print of lifecycle management vary based on sandbox architecture, initialization overhead, and project complexity. The important thing principle is ensuring that the sandbox state doesn’t persist indefinitely, whether through:
- Ephemeral sandboxes: Using sandbox architectures where the environment exists only at some point of a particular task or command (e.g., Kata containers created and destroyed per execution), stopping accumulation.
- Explicit lifecycle management: Periodically destroying and recreating the sandbox environment in a known-good state (e.g., weekly for VM-based sandboxes), ensuring accrued state is cleared on a known schedule.
While the provider of the agentic tool is answerable for ensuring lifecycle management, organizations should evaluate their sandbox architecture and establish lifecycle policies that balance initialization overhead and developer friction against accumulation risk.
Learn more
Agentic tools represent a major shift in how developers work. They provide productivity gains through automated code generation, testing, and execution. Nevertheless, these advantages include a corresponding expansion of the attack surface. As agentic tools proceed to evolve, gaining recent capabilities, integrations, and autonomy, the attack surface evolves with them. The principles outlined on this post must be revisited as recent features come out. Organizations should commonly validate that their sandbox implementations provide the isolation and security controls they expect.
Learn more about agentic security from the NVIDIA AI Red Team, including:
