The scientific process may be repetitive and tedious, with researchers spending hours digging through papers, managing experiment workflows, or wrangling massive multi-modal datasets. Scientific AI agents can tackle much of that busywork, acting as assistants that review literature, generate hypotheses, plan experiments, submit computational jobs, orchestrate lab operations, analyze results, and summarize findings. That frees up researchers to deal with creative considering and scientific discovery.Â
But constructing scientific AI assistants is difficult. Agents must maintain a high-level plan over many steps of research, incorporating memory and context management. A single mistake can potentially derail a research task. Furthermore, domain-specific tools are difficult for general-purpose LLMs to leverage, especially in cutting-edge research areas. Verification of results with computational or real-world data can take an extended time, requiring an agent to take care of coherence over hours, days, or more.Â
Available as open-source libraries throughout the NVIDIA NeMo framework suite, NVIDIA NeMo Gym and NeMo RL offer a unified, modular reinforcement learning stack for constructing reliable agentic AI across any domain, including scientific research. NeMo Gym enables developers to create realistic environments where agents can interact, learn, and solve domain-specific tasks generating high-quality, verifiable, domain-specific rollout data. This training data can then be used with NeMo RL to adapt and improve these agents efficiently at scale.
Each libraries played a key role within the post-training of the most recent Nemotron-3-Nano, a cost-efficient model optimized for targeted tasks, delivering high accuracy at low inference cost.
One developer using NeMo Gym and NeMo RL is Edison Scientific, which is working on automating scientific discovery. The spinoff of nonprofit research organization FutureHouse uses the infrastructure to power Aviary, a framework of scientific RL training environments spanning biology, chemistry, and related domains.
On this blog, we reveal methods to implement agentic training environments using NeMo Gym and use them in training with NeMo RL. We feature Aviary for example of a domain-specific reinforcement-learning environment for science.
How reinforcement learning extends LLM capabilities for science
Not all LLMs can execute complex scientific workflows. Pre-training teaches a model to predict the following token, which builds broad knowledge but not domain skills. This foundation allows zero-shot performance on structured factual questions, corresponding to gene–disease links, drug mechanisms, or clinical timelines. Post-training then teaches the model to follow instructions and reflect domain preferences through iterative tuning and alignment.
Post-training often begins with supervised nice tuning (SFT), where the model learns from instruction-response pairs using a next-token prediction log-likelihood loss. This process depends heavily on high-quality expert or filtered synthetic data and is sensitive to errors. SFT is proscribed by the coverage of its datasets, and the loss function only rewards reproducing the reference answer, even when alternative correct outputs exist, corresponding to different valid code implementations.
Training pipelines sometimes add reinforcement learning (RL) to expand a model’s ability to reason and act beyond supervised data. RL uses a reward function to attain outputs from the model or policy during training. In reinforcement learning from human feedback (RLHF), humans rank responses based on their preference or a rubric. Reinforcement learning from AI feedback (RLAIF) removes the human preference step by utilizing LLMs as a judge. Reinforcement learning with verifiable rewards (RLVR) uses computational checks, corresponding to code execution, to supply objective and repeatable reward signals.
RLVR is beneficial for training scientific agents since it allows models to design and run experiments, evaluate outcomes, and optimize toward scientific metrics through verification design and reward shaping. Scientific RL may be run in multi-step environments where an agent takes actions, observes feedback, and continues until a task is complete. Training may use full trajectories or individual state transitions. Through RL, scientific agents can compose skills learned in pre-training and SFT to construct recent workflows and achieve specific scientific goals.
How NeMo Gym and NeMo RL improve agentic training and evaluation
Implementing RL for LLM agents requires a training framework and environment to define what the agent can do, what it observes, and what rewards it gets for its actions. The training framework, corresponding to NeMo RL, runs training algorithms like group relative policy optimization (GRPO), manages compute for rollouts and verification and orchestrates updates to the model weights. The newest NeMo RL release supports on-policy distillation, asyncRL, advanced RL algorithms, and end-to-end FP8 RL training.Â
An agent drives the interaction loop with the environment by taking actions and leveraging crucial tools, while the environment provides observations and rewards for actions, maintains a persistent state, and determines when a task is complete. Environments may range from an easy Python execution sandbox to a full research software stack for evaluating workflows corresponding to moleculer cloning.
Training an AI scientist requires models that excel at many complex tasks. In practice, this implies tons of to hundreds of diverse tasks across use cases like literature synthesis, hypothesis generation, experimental design, and data evaluation, each requiring its own verification logic. As task diversity grows, training environment infrastructure management becomes difficult on account of varied dependencies and domain-specific requirements. To handle this, we created NeMo Gym—an open source framework for constructing RL training environments at scale.
NeMo Gym serves because the hub for RL data, environments, and reward signals utilized in LLM post-training. It provides the infrastructure to develop training environments, scale rollout collection, and integrate seamlessly along with your preferred training framework. Environments are isolated and expose REST APIs, enabling parallel execution and scalable deployments without dependency conflicts.
NeMo Gym provides three core server abstractions. A training environment typically includes all three server types working together:
- Model: Wraps OpenAI-compatible endpoints with reasoning and tool-calling support. Models can run locally or within the cloud and work with multiple backends including OpenAI, Azure, and vLLM. This abstraction separates model deployment from agent logic.
- Resources: Provides tool implementations that may be invoked via tool calling and verification logic that measures task performance. This abstraction offloads heavy processing so agents can asynchronously call each models for inference and resources for tool execution and verification.
- Agents: Orchestrate interactions between models and resources—routing requests, coordinating multi-turn conversations, and formatting responses consistently.
NeMo Gym generates rollouts and rewards from complex training environments, producing the optimization targets that RL training requires. Interoperable with existing environments, systems, and RL training frameworks, NeMo Gym lets users leverage each custom and NVIDIA-curated environments for LLM post-training. When paired with NeMo RL for training algorithms and infrastructure, the 2 libraries provide a scalable pipeline for agentic training and reinforcement learning.
NeMo Gym in practice: Training scientific reasoning agents at Edison Scientific
Edison Scientific is using NeMo Gym and NeMo RL to scale AI agents that automate scientific discovery. That features Aviary, which might train agents in biology, chemistry, and related domains. It may well perform tasks corresponding to literature research, bioinformatic data evaluation, laboratory tasks like solving molecular cloning problems, and multi-step scientific problem-solving.Â
Aviary manages state, tool execution, rewards, and statement formatting for RL environments. Its open source repository includes environments for math, scientific literature research, and data evaluation. NeMo Gym runs on top of Aviary, allowing Aviary to regulate its environment logic while NeMo Gym provides scalable rollout collection, additional NVIDIA-curated training environments, and integration with NeMo RL for training at scale.
Each Aviary environment implements two core methods: reset() and step(). The reset method initializes the environment, returns the primary statement, and lists available tools. The step method executes an motion and returns recent observations, rewards, and termination or truncation signals. Actions are tool requests that will include multiple tool calls.
Using Aviary through NeMo Gym, Edison Scientific is training a Jupyter-notebook data-analysis agent for bioinformatics tasks. At each step, the agent views the notebook and edits a cell. Notebook size can exceed the model context window, so Edison Scientific added two features to administer context growth. The corporate drops interaction history so the agent sees only the unique instruction, all previous actions, and the present notebook, and it modified GRPO grouping to operate on individual steps quite than full trajectories. This permits training on transitions, reduces context length, and enables step-level reward signals.
As a testbed, Edison Scientific built a Jupyter-based data evaluation environment in Aviary, integrated it with NeMo Gym, and introduced a benchmark of verifiable bioinformatics questions called BixBench.


Constructing agentic environments in NeMo Gym for training or downstream use is simple, requiring just a number of steps
Step 1: Install NeMo Gym
Clone the NeMo Gym repo, install the uv Python package, and create a virtual environment:
# Clone the repository
git clone git@github.com:NVIDIA-NeMo/Gym.git
cd Gym
# Install UV (Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
# Create virtual environment
uv venv --python 3.12
source .venv/bin/activate
# Install NeMo Gym
uv sync --extra dev --group docs
Step 2: Configure the model
You should use a hosted model corresponding to from OpenAI, or deploy a model locally corresponding to through NVIDIA NIM or vLLM. In this instance, we are going to use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 from HuggingFace and deploy the model with vLLM with tool-calling enabled. For more detailed information on methods to use the model with vLLM, please see this cookbook.
pip install -U "vllm>=0.12.0"
wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/resolve/foremost/nano_v3_reasoning_parser.py
vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
--max-num-seqs 8
--tensor-parallel-size 1
--max-model-len 262144
--port 10240
--trust-remote-code
--tool-call-parser qwen3_coder
--enable-auto-tool-choice
--reasoning-parser-plugin nano_v3_reasoning_parser.py
--reasoning-parser nano_v3
Then, create an env.yaml file within the NeMo Gym root directory:
policy_base_url: http://localhost:10240/v1
policy_api_key: EMPTY
policy_model_name: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
Step 3: Test an Aviary environment with an easy agent in NeMo Gym
Now let’s run an agent through the GSM8K environment, a math problem set where the agent can use a calculator tool.
In NeMo Gym, the `ng_run` command is used to launch servers. To configure the servers, config files have to be provided. Here, we offer two config files: `gsm8k_aviary.yaml` configures the resources server and agent server, and `vllm_model.yaml` defines the model server.
ng_run "+config_paths=[resources_servers/aviary/configs/gsm8k_aviary.yaml,responses_api_models/vllm_model/configs/vllm_model.yaml]"
Once all servers are running, it is best to receive logs much like below:
All 3 / 3 servers ready! Polling every 60s
####################################################################################################
#
# Server Instances
#
####################################################################################################
[1] gsm8k_aviary_resources_server (resources_servers/aviary)
{
'process_name': 'gsm8k_aviary_resources_server',
'server_type': 'resources_servers',
'name': 'aviary',
'dir_path': (
'/home/ubuntu/Gym/resources_servers/aviary'
),
'entrypoint': 'gsm8k_app.py',
'host': '127.0.0.1',
'port': 18575,
'pid': 1582343,
'config_path': 'gsm8k_aviary_resources_server',
'url': 'http://127.0.0.1:18575',
}
[2] gsm8k_aviary_agent (responses_api_agents/aviary_agent)
{
'process_name': 'gsm8k_aviary_agent',
'server_type': 'responses_api_agents',
'name': 'aviary_agent',
'dir_path': (
'/home/ubuntu/Gym/responses_api_agents/aviary_agent'
),
'entrypoint': 'app.py',
'host': '127.0.0.1',
'port': 63115,
'pid': 1582344,
'config_path': 'gsm8k_aviary_agent',
'url': 'http://127.0.0.1:63115',
}
[3] policy_model (responses_api_models/vllm_model)
{
'process_name': 'policy_model',
'server_type': 'responses_api_models',
'name': 'vllm_model',
'dir_path': (
'/home/ubuntu/Gym/responses_api_models/vllm_model'
),
'entrypoint': 'app.py',
'host': '127.0.0.1',
'port': 55951,
'pid': 1582347,
'config_path': 'policy_model',
'url': 'http://127.0.0.1:55951',
}
####################################################################################################
Next, run the agent within the GSM8K environment. The next command will run the easy agent on the 5 example problems within the input file, and write the agent trajectories to the output file.
ng_collect_rollouts
+agent_name=gsm8k_aviary_agent
+input_jsonl_fpath=resources_servers/aviary/data/gsm8k_example.jsonl
+output_jsonl_fpath=results/gsm8k_aviary_rollouts.jsonl
You must receive an output showing the common reward of the trajectories:
Collecting rollouts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:18<00:00, 3.71s/it]
{
"reward": 1.0,
}
To view the trajectories, NeMo Gym provides an easy UI:
ng_viewer +jsonl_fpath=results/gsm8k_aviary_rollouts.jsonl
Step 4: Construct a brand new environment
To create a brand new environment in NeMo Gym, you possibly can first construct the environment in Aviary, then easily create a brand new resources server through the Aviary integration. Or, you possibly can create a custom environment from scratch directly in NeMo Gym. In this instance, let’s add the Aviary HotPotQA environment to NeMo Gym.
First, create `resources_servers/aviary/hotpotqa_app.py`, which extends the bottom aviary resources server:
from pydantic import Field
from aviary.envs.hotpotqa import HotPotQADataset, HotPotQAEnv
from resources_servers.aviary.app import AviaryResourcesServer
class HotPotQAResourcesServer(AviaryResourcesServer[HotPotQAEnv, HotPotQADataset]):
dataset: HotPotQADataset = Field(default_factory=lambda: HotPotQADataset(split="train"))
if __name__ == "__main__":
HotPotQAResourcesServer.run_webserver()
Next, create a configuration file in `resources_servers/aviary/configs/hotpotqa_aviary.yaml`:
hotpotqa_aviary_resources_server:
resources_servers:
aviary:
entrypoint: hotpotqa_app.py
hotpotqa_aviary_agent:
responses_api_agents:
aviary_agent:
entrypoint: app.py
resources_server:
type: resources_servers
name: hotpotqa_aviary_resources_server
model_server:
type: responses_api_models
name: policy_model
datasets:
- name: train
type: train
jsonl_fpath: resources_servers/aviary/data/hotpotqa_train.jsonl
gitlab_identifier:
dataset_name: hotpotqa_train
version: 0.0.1
artifact_fpath: hotpotqa_train.jsonl
license: Apache 2.0
- name: validation
type: validation
jsonl_fpath: resources_servers/aviary/data/hotpotqa_validation.jsonl
gitlab_identifier:
dataset_name: hotpotqa_validation
version: 0.0.1
artifact_fpath: hotpotqa_validation.jsonl
license: Apache 2.0
- name: hotpotqa_example
type: example
jsonl_fpath: resources_servers/aviary/data/hotpotqa_example.jsonl
gitlab_identifier:
dataset_name: hotpotqa_example
version: 0.0.1
artifact_fpath: hotpotqa_example.jsonl
license: Apache 2.0
Then create a example dataset in `resources_servers/aviary/data/hotpotqa_example.jsonl`, which provides task indices to retrieve samples from the underlying Aviary environment dataset:
{"task_idx":0,"responses_create_params":{"input":[]}}
{"task_idx":1,"responses_create_params":{"input":[]}}
{"task_idx":2,"responses_create_params":{"input":[]}}
{"task_idx":3,"responses_create_params":{"input":[]}}
{"task_idx":4,"responses_create_params":{"input":[]}}
Lastly, update ‘requirements.txt’ to incorporate the `hotpotqa` package from Aviary:
-e nemo-gym[dev] @ ../../
fhaviary[gsm8k,hotpotqa,notebook,llm]>=0.24.1
tqdm
datasets
huggingface-hub
With these 4 changes, we will now run the Aviary HotPotQA environment in NeMo Gym.
Visit the NeMo Gym repository for more ready-to-use training environments. The product documentation also provides a more comprehensive overview of key concepts, methods to create resource servers, and methods to perform RL training. Take a look at our latest NeMo RL release which supports on-policy distillation, asyncRL, advanced RL algorithms, and end-to-end FP8 RL training.Â
Best practices for constructing scientific agents
Constructing scientific agents is difficult, but the next practices may help teams make regular progress toward more capable systems.
- Start easy. Begin with a basic agent quite than a multi-agent system and plenty of tools. Use outcome-based rewards before introducing complex reward structure, which might result in reward hacking
- Reward profiling. Training with GRPO style algorithms works well when the model can produce a various set of solutions to a task, a few of that are correct. Measuring the mean and standard deviation of reward for every task over multiple attempts may help to create a more efficient training environment for a model.
- Monitor training metrics. Various metrics describing training stability, model behavior, and learning progress are robotically logged to Weights & Biases. For instance, sampling issues, model collapse, or truncated trajectories may be detected through analyzing different metrics.
- Train longer. Training with RLVR based methods can show little learning in early stages, followed by a steeper learning curve later in training. This could occur when the model struggles to seek out correct solutions for the tasks, but later discovers a method that works.
These steps provide a practical path to constructing and training scientific agents at scale with NeMo Gym, NeMo RL, and Aviary. Start constructing your recent scientific agent today. And for more information, try the next links:
And likewise try our recent NVIDIA Nemotron 3 model family, with Nano available now.
Contributors to this work included Brian Yu, Chris Wing, Elliot Eshelman, and Sylendran Arunagiri.
