Data scientists spend numerous time cleansing and preparing large, unstructured datasets before evaluation can begin, often requiring strong programming and statistical expertise. Managing feature engineering, model tuning, and consistency across workflows is complex and error-prone. These challenges are amplified by the slow, sequential nature of CPU-based ML workflows, which make experimentation and iteration painfully inefficient.
Accelerated data science ML agent
We prototyped a knowledge science agent that may interpret user intent and orchestrate repetitive tasks in an ML workflow to simplify data science and ML experimentation. With GPU acceleration, the agent can process datasets with tens of millions of samples using NVIDIA CUDA-X Data Science libraries. It showcases NVIDIA Nemotron Nano-9B-v2, a compact, powerful open-source language model designed to translate the intent of a knowledge scientist into an optimized workflow.
With this setup, developers can explore large datasets, train models, and evaluate results just by chatting with the agent. It bridges the gap between natural language and high-performance computing, enabling users to go from raw data to business insights in minutes. We encourage you to make use of this as a start line to construct your individual agent with different LLMs, tools, and storage solutions tailored to your specific needs. Explore the Python scripts for this agent on GitHub.
Data science agent orchestration
The agent’s architecture is designed for modularity, scalability, and GPU acceleration. It’s organized into five core layers and one temporary data store that work together to translate natural language prompts into executable, data processing, and ML workflows. Figure 1 shows the high-level workflow of how each layer interacts.


Let’s take a more in-depth have a look at how the layers work together.
Layer 1: User interface
The user interface was developed using a Streamlit-based conversational chatbot for users to interact with the agent in plain English.
Layer 2: Agent orchestrator
That is the central controller that coordinates with all layers. It interprets user prompts, delegates execution to the LLM for intent understanding, calls the precise GPU-accelerated functions from the Tool Layer, and responds in natural language. Each orchestrator method is a light-weight wrapper around a GPU function; as an illustration, _describe_data within the user query calls basic_eda(), while _optimize_ridge within the user query calls optimize_ridge_regression().


Layer 3: LLM layer
The LLM layer serves because the reasoning engine of the agent, initializing the language model client to speak with Nemotron Nano 9B-v2 using the NVIDIA NIM API. This layer enables the agent to interpret natural language inputs and translate them into structured, executable actions through 4 key mechanisms: LLM model, retry strategy for resilient communication, function calling for structured tool invocation, and a function calling window.
- LLM model
The LLM layer architecture is LLM-agnostic and may work with any language model that supports function calling. For this application, we used Nemotron Nano-9B-v2, which supports each function calling and advanced reasoning. Further, being smaller in size, the model offers an optimal balance between efficiency and capability, and may be deployed on a single GPU for inference. It delivers as much as 6x higher token generation throughput than other leading models in its size class, while the considering budget feature allows developers to regulate what number of “considering” tokens are used, reducing reasoning costs by as much as 60%. This mixture of remarkable performance and value efficiency enables real-time conversational workflows which might be economically viable for production deployment. - Retry strategy for resilient communication
The LLM client implements an exponential backoff retry mechanism to handle transient network failures and API rate limits, ensuring reliable communication even under adversarial network conditions or high API load. - Function calling for structured tool invocation
Function calling bridges natural language and code execution by enabling the LLM to translate user intent into structured tool invocations in Agent Orchestrator. The agent defines available tools using OpenAI-compatible function schemas that specify each tool’s name, purpose, parameters, and constraints. - Function calling window
Function calling transforms the LLM from a text generator right into a reasoning engine able to API orchestration. The model, which is Nemotron Nano-9B-v2, is supplied with a structured “API specification” of accessible tools, using which it tries to grasp user intent, select appropriate functions, extract parameters with proper types, and coordinate multi-step data processing and ML operation. All that is executed through natural language, eliminating the necessity for users to grasp API syntax or write code.The whole function-calling flow shown in Figure 3 shows how natural language transforms into executable code. Confer with
chat_agent.pyandllm.pyscripts within the GitHub code for the operations listed in Figure 3.


Layer 4: Memory layer
The memory layer (ExperimentStore) stores experiment metadata, including model configurations, performance metrics, and evaluation results, akin to accuracy and F1 scores. This metadata is saved in standard JSONL format in a session-specific file, allowing for in-session tracking and retrieval via functions like get_recent_experiments() and show_history().
Layer 5: Temporary data storage
The temporary data storage layer stores session-specific output files (best_model.joblib and predictions.csv) within the system’s temporary directory in addition to the user interface for immediate download and use. These files are routinely deleted when the agent shuts down.
Layer 6: Tool layer
The tool layer is the computational core of the agent, which is liable for executing data science functions akin to data loading, exploratory data evaluation (EDA), model training & evaluation, and hyperparameter optimization (HPO). The function chosen for execution relies on the user query. Various optimization strategies are used, including:
-
Consistency and Repeatability
The agent uses different abstraction methods from scikit-learn (a preferred open-source library) to make sure consistent data preprocessing and model training across training, testing, and production environments. This design prevents common ML pitfalls akin to data leakage and inconsistent preprocessing by routinely applying the very same transformations (imputation values, scaling parameters, and encoding mappings) learned during training to all inference data. -
Memory Management
To handle large datasets, we use memory optimization strategies.Float32conversion reduces memory use, GPU memory management releases lively cache GPU memory, and dense output configuration is quicker on GPUs in comparison with sparse formats. -
Function Execution
The tool execution agent uses CUDA-X data science libraries akin to cuDF and cuML to deliver GPU-accelerated performance while maintaining the identical syntax of pandas and scikit-learn. This zero-code-change acceleration is achieved through Python’s module preloading mechanism, enabling developers to run existing CPU code on GPUs without refactoring. Thecudf.pandasaccelerator replaces pandas operations with GPU equivalents, whilecuml.accelroutinely substitutes scikit-learn models with cuML’s GPU implementations.
The next command launches a Streamlit interface with GPU acceleration enabled for each data processing and machine learning components:
python -m cudf.pandas -m cuml.accel -m streamlit run user_interface.py
Acceleration, modularity, and extension of the ML agent
The agent is built with a modular design for straightforward extension through latest function calls, experiment stores, LLM integrations, and other enhancements. Its layered architecture supports the incorporation of additional capabilities over time. Out of the box, it includes support for popular machine learning algorithms, exploratory data evaluation (EDA), and hyperparameter optimization (HPO).
Using CUDA-X data science libraries, the agent accelerates data processing and machine learning workflows end to finish. This GPU-based acceleration delivers performance gains starting from 3x to 43x, depending on the precise operation. Table 1 highlights the speedups achieved across several key tasks, including ML operations, data processing, and HPO.
| Agent Task | CPU (sec) | GPU (sec) | Speedup | Details |
| Classification ML task | 21,410 | 6,886 | ~3x | Using logistic regression, random forest classification, and linear support vector classification with 1 million samples |
| Regression ML task | 57,040 | 8,947 | ~6x | Using ridge regression, random forest regression, and linear support vector regression with 1 million samples |
| Hyperparameter optimization for ML algorithm | 18,447 | 906 | ~20x | cuBLAS-accelerated matrix operations (QR decomposition, SVD) dominate; the regularization path is computed in parallel and used |
Start with Nemotron models and CUDA-X Data Science libraries
Start with Nemotron models and CUDA-X Data Science libraries. The open-source data science agent is accessible on GitHub and able to integrate together with your datasets for end-to-end ML experimentation. Download the agent and tell us what datasets you tried, how much speedup you achieved, and the customizations you made.
Learn more:
