Train Small Orchestration Agents to Solve Big Problems

-


Using the fitting tool and model for a task is a difficult and ever-present engineering problem in agent design. At NVIDIA Research, we’re making fast progress toward automating it away with an approach that trains and uses a separate model, which we call an “orchestrator”, to act as a supervisor over the entire other models and tools. 

The orchestrator’s job is to contemplate the duty within the context of user preferences (do they need the result fast, low cost, with the best level of accuracy possible, or some combination of those?) after which manage other models and call on tools within the task-solving conversation to succeed in the goal. Crucially, because it seems, small models are already powerful enough to handle this burden if tuned appropriately.

While it could be surprising to employ large models subordinate to small models, the arrangement plays to their benefits. Small models are unburdened by excessive knowledge and trained to capture the essence of problem-solving resulting from their limited size.

To construct orchestrators, we introduce ToolOrchestra, our flagship method, which involves data preparation, synthetic data generation, multi-objective reinforcement-learning training, and comprehensive evaluation of orchestration methods and models.

Diagram showing how an AI orchestrator coordinates tools and models to answer a user’s query efficiently. The Orchestrator uses multi-turn reasoning and calls basic tools, specialized LLMs, and generalist LLMs, optimizing for outcome, efficiency, and cost preference through reinforcement learning.Diagram showing how an AI orchestrator coordinates tools and models to answer a user’s query efficiently. The Orchestrator uses multi-turn reasoning and calls basic tools, specialized LLMs, and generalist LLMs, optimizing for outcome, efficiency, and cost preference through reinforcement learning.
Figure 1. Overview of the orchestrator: when given a task, it alternates between reasoning and gear calling in multiple turns to resolve it

Why train an orchestrator?

You is perhaps wondering: “Using an orchestrator is an intriguing concept, but why should I train a model for it? Wouldn’t or not it’s enough to only edit the prompts of my agent to act as an orchestrator?” The short answer isn’t any. The explanation ToolOrchestra-trained orchestrators trump other methods lies within the training objectives. During training, the orchestrator generates experimental trajectories. Some solve the issue higher than others. Some reach the proper solution cheaply and quickly, while others make extensive use of high-priced tools and take an extended time to give you a conclusion. ToolOrchestra’s reinforcement-learning setup explicitly rewards high model problem-solving accuracy, low price, and short time-to-solution based on the associated fee preferences for the given problem. 

What are the outcomes of using an orchestrator?

To show the effectiveness of ToolOrchestra, we trained a small model, Orchestrator-8B, to tackle among the most difficult tasks available, including the issues of the Humanity’s Last Exam, FRAMES, and τ2-Bench.

We then give out-of-the-box monolithic LLMs, prompted orchestrators running on frontier LLMs, and Orchestrator-8B access to the identical tools, and measure their performance. The consequence is shown in Table 1. Summarized, Orchestrator-8B outperforms all its competitors no matter their size or advertised level of capabilities while incurring the smallest cost and problem-solving latency.

Tools Model(s) HLE (↑) FRAMES (↑) τ²-Bench (↑) Cost (↓) Latency (↓)
Existing reported
SOTA  
GPT-5 35.2 84.2‡
o3 24.3 68.4
GPT-4o 5.3 43.8

No tool    

Qwen3-8B 3.2 24.2 –* 0.2 0.6
Llama-Nemotron-49B 3.6 25.6 –* 0.4 1.1
Llama-3.3-70B 3.8 32.4 –* 0.5 1.4
Qwen3-235B-A22B 5.2 34.3 –* 2.6 3.3
Claude Opus 4.1 11.7 58.2 –* 27.4 8.2
GPT-5 23.4 66.3 –* 6.2 4.1

Basic tools   

Qwen3-8B 4.7 26.5 40.7 1.3 2.2
Llama-Nemotron-49B 6.8 28.2 23.2 2.5 3.5
Llama-3.3-70B 4.6 42.3 17.6 2.8 4.3
Qwen3-235B-A22B 14.0 39.5 52.9 12.3 10.2
Claude Opus 4.1 19.8 63.5 46.0 76.2 32.5
GPT-5 35.1 74.0 77.7 30.2 19.8

Basic tools,
specialized LLMs, generalist LLMs  

Qwen3-8B 30.6 68.9 72.3 27.6 18.3
Llama-Nemotron-49B 25.8 57.9 66.7 25.6 17.1
Llama-3.3-70B 19.7 52.4 55.8 19.7 13.4
Qwen3-235B-A22B 32.8 74.2 75.6 29.7 21.2
Claude Opus 4.1 34.6 72.8 76.8 52.5 25.6
GPT-5 21.2 57.5 62.3 17.8 13.6
Orchestrator-8B 37.1 76.3 80.2 9.2 8.2
Table 1. A comparison of Orchestrator-8B with baselines

To drive the purpose of Orchestrator-8B’s efficiency home, we measured the accuracy and price of leading frontier models and the Orchestrator-8B while restricting the model’s reasoning and acting to 10, 20, 50, and 100 conversational turns. The consequence is visualized within the figure below. We observed that whatever the conversational length limit imposed on the competing systems, Orchestrator-8B all the time outperforms its competition while maintaining a lower dollar cost.

Scatter plot showing HLE Accuracy (%) versus Cost ($) for multiple LLMs. Orchestrator-8B achieves higher accuracy than other models at the same cost and maintains the same quality at a lower cost. GPT-5 and Grok-4 perform well but are more expensive, while Claude Opus 4.1, Qwen3-235B-A22B, and Llama-3.3-70B have lower accuracy. The plot highlights Orchestrator-8B’s superior performance-cost efficiency compared to SOTA baselines.Scatter plot showing HLE Accuracy (%) versus Cost ($) for multiple LLMs. Orchestrator-8B achieves higher accuracy than other models at the same cost and maintains the same quality at a lower cost. GPT-5 and Grok-4 perform well but are more expensive, while Claude Opus 4.1, Qwen3-235B-A22B, and Llama-3.3-70B have lower accuracy. The plot highlights Orchestrator-8B’s superior performance-cost efficiency compared to SOTA baselines.
Figure 2. Orchestrator-8B compared with several advanced LLMs by way of cost and HLE accuracy

Find out how to train an orchestrator?

To coach an orchestrator for your personal purposes while following the ToolOrchestra method, you’ll need a model, some data, and our training code.

To point out how little is required to construct an orchestrator for difficult tasks, corresponding to the hard benchmarks we tested Orchestrator-8B on, we used Qwen3-8B as our underlying model, generated only 552 synthetic problems, and used only one,296 prompts in training.

Step 1: Select the underlying model

The alternative of the model to coach for an efficient orchestrator is entirely as much as you. We recommend you choose the smallest language model aligned with the character of your agent. NVIDIA Nemotron Nano, the Qwen 3 family, or the xLAM family are only a number of of the choices.

Step 2: Prepare and generate data

The excellent news concerning the data for ToolOrchestra is that you simply really don’t need much to start. The tool assumes that much of the information will probably be synthetically generated. We describe the information generation process intimately in our paper. In broad terms, you’ll want to start out with an outline or a number of examples of your agent problem-solving with its preferred tools. Using large models, you’ll be able to then generate many more similar synthetic tasks. 

The next is a sketch of the code that could be used to generate samples much like those used to coach Orchestrator-8B. 

def generate_samples(domain):
    subjects = generate_subjects(domain)
    schema = generate_schema(subjects)
    data_model = generate_datamodel(schema)
    database = generated_database(domain,schema,data_model)
    tools = generate_tools(domain,database)
    tasks = generate_tasks(database,tools)
    return tasks
samples = generate_samples()
...

You may jump right in and experience the true data generation magic.

Step 3: Start training

Once equipped together with your model alternative and a few data, you’ll be able to directly use or adapt ToolOrchestra’s released code to coach your personal orchestrator. This sketch can get you began (more details could be present in the repository README.)

train_dataset = prepare_data(raw_examples,tools)
train_dataloader = DataLoader(train_dataset)
reward_model = RewardManager(config)
trainer = RayTrainer(config,reward_model)
trainer.init_workers()
trainer.start()
...

You may kick off your personal training run and watch your orchestrator come to life! 

Step 4: Visualize your progress

ToolOrchestra’s training code supports direct logging through wandb. The next shows example visualizations from Orchestrator-8B’s runs.

Side-by-side line charts of training metrics. The left chart shows actor policy gradient loss decreasing and stabilizing around -2.5 over 150 Side-by-side line charts of training metrics
steps. The right chart shows critic mean score increasing and plateauing around 2.0, indicating training convergence and performance improvement.Side-by-side line charts of training metrics. The left chart shows actor policy gradient loss decreasing and stabilizing around -2.5 over 150 Side-by-side line charts of training metrics
steps. The right chart shows critic mean score increasing and plateauing around 2.0, indicating training convergence and performance improvement.
Figure 3. Training loss and critic rating of Orchestrator-8B

The advantages of orchestration

Engineering efficient, high-performance agents today involves a continuing struggle to balance capability and price. Developers must manually weigh every alternative (model size, tool use, query length, reasoning depth), knowing that one unsuitable call can push costs skyward or compromise the standard of the result. This complexity scales unforgivingly because the variety of queries that must be engineered grows, making cost-aware agent optimization probably the most difficult and time-intensive points of constructing real-world AI systems.

ToolOrchestra changes that. By training small orchestrators to direct large models and tools with surgical precision and based on need, we automate this balancing act in a way that outperforms monolithic LLMs and prompted orchestrator setups across accuracy, latency, and dollar cost.

Orchestrator-8B, our example-trained model, is a concrete example demonstrating that the fitting strategy can beat brute model-size scaling or prompt-engineering dexterity. It delivers state-of-the-art performance on hard benchmarks while using resources way more efficiently. In brief, orchestration enables agents to be each powerful and nimble.

Looking ahead: The rise of compound AI systems

It has been the dominant paradigm of the AI sphere over the past few years that intelligence is first built into large foundational models by training after which specialized for real-world use cases through in-context learning. This belief is increasingly under attack, because the AI community continues to supply an increasing number of examples of compound AI systems outperforming the capabilities of monolithic LLMs while being safer, faster, and more cost effective.

ToolOrchestra represents our first step toward fundamentally intelligent compound AI systems as a paradigm emerging to exchange AI monoliths. It’s further aligned with our long-term position that small language models are ultimately the important thing to scalable agentic AI. 

To learn more:



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x