Train an AI Agent for Command-Line Tasks with Synthetic Data and Reinforcement Learning

-


What in case your computer-use agent could learn a recent Command Line Interface (CLI)—and operate it safely without ever writing files or free-typing shell commands?

In Part 1 of our series on constructing a pc use agent, we built a custom Bash computer-use agent using NVIDIA Nemotron in only one hour. On this sequel, we’ll take it further by teaching the identical reasoning model with no prior knowledge to securely operate the LangGraph Platform CLI. This shows how easily a big reasoning model will be specialized to perform recent, agentic tasks.
As a substitute of straightforward file operations, our recent agent will learn to start out local servers, construct containers, and generate Dockerfiles—entirely through a verifiable, human-in-the-loop command interface.

We’ll mix synthetic data generation (SDG) and Reinforcement Learning with Verifiable Rewards (RLVR), optimized via Group Relative Policy Optimization (GRPO), to make training each efficient and secure.

You’ll fine-tune an AI agent that may:

  • Propose valid LangGraph CLI commands (e.g., langgraph dev –port 8123 –no-browser)
  • Ask for explicit human confirmation before executing
  • Learn recent subcommands from synthetic seed data
  • Train efficiently on a single GPU using RLVR

Here’s what a typical interaction looks like once the model is trained:

[🙂] Bring the LangGraph server online.

[🤖] I can execute:
[COMMAND]
["langgraph", "up", "--wait"]
[CONFIRM]
Run this command now? (yes/no)

▶️  Execute `langgraph up --wait`? [y/N]: y

[🤖] Result:
Server began successfully on port 8000.

This pattern generalizes: The identical workflow will be prolonged to support recent CLI tools and environments.

Why use synthetic data generation and reinforcement learning to show a brand new CLI?

Teaching an AI agent to operate a specialized CLI tool presents unique challenges that traditional approaches struggle with:

The info scarcity problem: Most specialized CLI tools lack the large usage logs needed for conventional training. Unlike common shell commands, tools like LangGraph have specific syntax, flags, and workflows that aren’t well-represented basically training data. Waiting to gather real-world usage examples could take months or years.

The protection-accuracy tradeoff: You wish your agent to be creative in understanding user intent, but absolutely precise when generating commands. A single typo or fallacious flag could cause system errors or worse. Traditional fine-tuning often produces models which are either too conservative (refusing valid requests) or too permissive (hallucinating dangerous commands).

How SDG + RL solves this:

  • Synthetic data generation helps you to bootstrap high-quality training examples from only a handful of seed commands, ensuring complete coverage of the CLI’s capabilities.
  • Reinforcement learning with verifiable rewards teaches the model to consistently produce syntactically correct commands by rewarding valid outputs and penalizing errors.
  • Together, they create a virtuous cycle: SDG provides diverse training scenarios, while RLVR ensures the model learns to handle them appropriately.

This approach is especially powerful for enterprise environments where you would possibly have to quickly adapt agents to proprietary internal tools without waiting for organic data collection.

Prerequisites

For this setup, you’ll need:

Hardware requirements:

  • Access to an NVIDIA GPU with a minimum of 80 GB memory (e.g., A100)
  • Minimum 32 GB system RAM
  • 100 GB free disk space for model weights and datasets

Software requirements:

  • Python 3.10 or newer
  • CUDA 12.0+ and appropriate NVIDIA drivers

Core components:

  • LangGraph – The goal CLI tool our agent will learn to operate
  • NeMo Gym – For constructing the RL training environment with tools and verifiable rewards
  • Unsloth – For efficient GRPO-based reinforcement learning with reduced VRAM requirements
  • NeMo Data Designer – For generating synthetic training data

Base model:

  • Nemotron-Nano-9B-V2 – Available on Hugging Face
  • Installation and usage instructions are provided within the linked documentation

Take a look at a video version of this tutorial:

Video 1. Use SDG and RL to supply a LangGraph CLI BASH Agent.

Step 1: Design an artificial dataset with NeMo Data Designer

Before training, we want data: pairs of natural-language requests mapped to LangGraph CLI invocations.

We’ll use the NVIDIA NeMo Data Designer to programmatically generate this dataset, ranging from a handful of seed examples and expanding into lots of of verified command pairs.

Why use synthetic data generation?

Consider synthetic data generation like teaching someone a brand new language by showing them a pattern, then having them create variations. As a substitute of collecting hundreds of real examples (which could not exist yet), we:

  1. Provide a couple of high-quality “seed” examples
  2. Use an AI model to generate diverse variations
  3. Validate each generated example against strict rules
  4. Construct a comprehensive dataset in hours as a substitute of months

The dataset structure

User request CLI command Confirmation
“Start an area dev server on port 8123.” langgraph dev –port 8123 –no-browser “Proceed with this command? (yes/no)”
“Construct the project image for each amd64 and arm64.” langgraph construct -t my-graph:multi –platform linux/amd64,linux/arm64 “Run construct now?”

Each generated record includes:

  • User request: Natural language that a human might actually type
  • CLI command: The precise, syntactically correct command to execute
  • Confirmation prompt: A security check before execution

The validation process

In Data Designer, we steer diversity with sampling parameters and reject any record that fails validation. For instance, we’d use a regex pattern like:
^langgraphs+(dev|construct|up|dockerfile)b

This ensures that:

  • Every command starts with langgraph
  • Only approved subcommands are used
  • The syntax is all the time valid

Finally, we export the dataset in OpenAI-style messages format—ideal for RLVR fine-tuning with the open-source NVIDIA NeMo framework.

This validation process matters: It guarantees that the reward verifier (introduced later) will likely be consistent with the structure and syntax of the training data. 

Let’s have a look at the implementation in NeMo Data Designer.

# Define seed distributions
command  = Sampler(["new", "dev", "up", "build", "dockerfile"])
port     = Sampler(range(3000, 9000))
template = Sampler(["react-agent", "memory-agent", "retrieval-agent"])

# Generate natural language input
user_request = LLM(
    prompt=f"Write a request to {command} with {template} on port {port}",
    model="nemotron-3-nano-30b-a3b"
)

# Generate structured output
tool_call = LLM(
    prompt=f"Convert '{user_request}' to CLI JSON.",
    schema=CLIToolCall,
    model="nemotron-3-nano-30b-a3b"
)

Step 2: High quality-tune with RLVR (using GRPO)

With clean, verified data in hand, we move to fine-tuning using Unsloth, an open source framework for efficient reinforcement learning that integrates with NeMo Gym training environments

Reinforcement Learning with Verifiable Rewards (RLVR)

Traditional reinforcement learning from human feedback (RLHF) is like having a panel of judges rating each output—subjective, expensive, and inconsistent. RLVR replaces human judges with deterministic code-based verification.

As a substitute of asking humans “Does this command look good?,” we ask code “Does this command pass our validation rules?”

For a CLI agent, the verifier enforces rules equivalent to:

  1. Output must start with langgraph
  2. Only approved subcommands and flags allowed
  3. No commentary, punctuation, or unsafe tokens

The reward system:

✅ Valid command → +1 reward (encourages this behavior)
❌ Invalid command → −1 reward (discourages this behavior)
⚪ Ambiguous output → 0 reward (neutral, no reinforcement)

This consistency is crucial: The identical output all the time yields the identical reward, making training stable and predictable. And since the verifier is just code, you possibly can adjust constraints anytime without retraining a separate reward model.

Constructing the training environment with NeMo Gym

NeMo Gym is an open source library for constructing reinforcement learning training environments for LLMs. It provides the infrastructure to define tools, execute agent actions, and compute verifiable rewards—exactly what we want for training a CLI agent.

The CLI agent environment is implemented as a NeMo Gym resource server, which encapsulates:

  • Tool definitions – The CLI commands the agent can propose
  • Verification logic – Rules that check command validity and correctness
  • Reward computation – Scores (0.0 to 1.0) returned to the RL training loop

When the agent proposes commands, the resource server evaluates correctness and returns reward signals for GRPO training. This clean separation between environment logic and training framework means you possibly can iterate in your CLI tools and validation rules without touching the RL code.
To learn more about creating custom environments, see the NeMo Gym documentation and the guide on creating resource servers.

Optimization via Group Relative Policy Optimization (GRPO)

GRPO is an easier, more memory-efficient alternative to PPO. As a substitute of coaching a separate “critic” model to estimate how good each motion is, GRPO samples multiple outputs for a similar prompt and uses their average reward because the baseline. This cuts the model count in half (no critic needed) and reduces variance by comparing outputs against one another moderately than against a learned estimate.

Here’s how it really works in practice:

Traditional RL might struggle when most attempts fail. Imagine the model generates 10 command variations for a similar prompt:

  • Nine are invalid (reward = 0)
  • One is valid (reward = 1)

Standard optimization might wander away within the noise of failures. GRPO as a substitute:

  1. Groups all responses to the identical prompt together
  2. Computes relative benefits inside each group
  3. Strongly reinforces that one success, making it stand out from the failures

This approach dramatically improves learning efficiency and convergence speed, helping the model quickly learn what makes a command valid.

Let’s see how we’d implement this with Unsloth and NeMo Gym:

# The "Verifiable Reward" Function
def compute_reward(agent_output, expected):
    try:
        cmd = json.loads(agent_output)

        # Hard Rule: Command must match expectation
        if cmd.name != expected.name:
            return -1.0  # Penalize hallucinations

        # Soft Rule: Flags have to be accurate
        accuracy = calculate_flag_accuracy(cmd.flags, expected.flags)
        return accuracy

    except JSONDecodeError:
        return -1.0  # Penalize broken syntax

# Start GRPO Training
grpo.train(
    model="nemotron-nano-9B-v2",
    algorithm="GRPO",
    env=compute_reward,
    dataset=synthetic_data
)

Step 3: Human-in-the-loop execution

Once fine-tuned, we embed the model right into a runtime loop that all the time requests human confirmation before execution. This maintains the protection architecture introduced in Part 1, ensuring no command runs without explicit approval.

The protection architecture

subprocess.run(argv, shell=False)

This easy line embodies an important security principle. By setting shell=False, we ensure:

  • Commands execute as discrete argument lists (e.g., [“langgraph”, “up”, “–wait”])
  • Shell metacharacters like &&, ;, or | are treated as literal text, not operators
  • Command injection attacks turn out to be not possible

The entire safety chain

Our multi-layered approach ensures safety at every step:

  1. Training-time safety: RLVR ensures the model learns to generate valid commands.
  2. Runtime verification: A validator checks every proposed command against allowlists.
  3. Human confirmation: Users must explicitly approve each command before execution.
  4. Execution isolation: Commands run without shell interpretation, stopping injection.

Even when the model occasionally produces an invalid command despite training, the runtime policy prevents it from being executed.

Why RLVR + synthetic data work for customizing Agentic AI

This mix creates a strong synergy:

Component Role Why it matters
NeMo Data Designer Generates realistic and diverse, structured AI training data with built-in validation Solves the cold-start problem—you possibly can train without waiting for real usage data
NeMo Gym Provides the training environment with CLI tools and verifiable reward logic Defines what actions are valid and the way success is measured

Unsloth for RLVR + GRPO
Executes efficient GRPO training with 80% less VRAM Makes RL training accessible on a single GPU while maintaining quality
Human approval loop Serves as the ultimate safety gate, keeping users on top of things Maintains trust—users all the time have the ultimate say before any motion occurs
Table 1. What role each component of the training pipeline plays, and why it matters to the issue

The result: We will teach Nemotron-Nano-9B-V2 to exactly and safely operate a brand new CLI tool—all without full retraining or compromising on safety.

Closing thoughts

By extending our Bash operator right into a LangGraph-aware computer-use agent, we’ve demonstrated how synthetic data generation and RLVR (with GRPO) form a strong recipe for rapidly specializing large reasoning models to recent toolchains.

The workflow generalizes cleanly to any CLI tool:

  1. Use NeMo Data Designer to define structured, verifiable examples
  2. Construct a NeMo Gym environment along with your CLI tools and verification logic
  3. High quality-tune efficiently with Unsloth’s GRPO
  4. Maintain human-in-the-loop execution for safety

This pattern helps you to turn any capable large language model (LLM) right into a domain-specific, verifiably secure computer-use agent—from LangGraph today to your proprietary internal tools tomorrow.

The implications are significant: As a substitute of waiting months to gather training data or accepting the risks of uncontrolled command generation, you possibly can deploy specialized, secure CLI agents in days. Whether you’re automating DevOps workflows, creating customer support tools, or constructing internal productivity agents, this approach provides a quick, secure path from idea to production.

Stay awake-to-date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, Discord, and YouTube.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x