Agentic AI for Modern Deep Learning Experimentation

-

that reads your metrics, detects anomalies, applies predefined tuning rules, restarts jobs when essential, and logs every decision—without you watching loss curves at 2 a.m.

In this text, I’ll provide a light-weight agent designed for deep learning researchers and ML engineers that may:

• Detect failures mechanically
• Visually reason over performance metrics
• Apply your predefined hyperparameter strategies
• Relaunch jobs
• Document every motion and final result

No architecture search. No AutoML. No invasive rewrites of your codebase.

The implementation is intentionally minimal: containerize your training script, add a small LangChain-based agent, define hyperparameters in YAML, and express preferences in markdown. You’re probably doing 50% of this already.

Drop this agent into your manual train.py workflow and go from 0️⃣ to 💯 in a single day.

The issue along with your existing experiments

🤔 You endlessly ponder over hyperparameters.

▶️ You run train.py.

🐛 You fix the bug in train.py.

🔁 You train.py

👀 You stare at TensorBoard.

🫠 You query reality.

🔄 You repeat.

Every practicing Deep Learning/Machine Learning Engineer in the sector does this. Don’t be ashamed. Original photo by MART PRODUCTION via Pexels. Gif imagined by Grok

Stop watching your model spit out numbers

You should not a Jedi. No amount of staring will magically make your move within the direction you wish.

Babysitting a model into the midnight for a vanishing/exploding gradient in a deep transformer based network which you can’t track down—and which may never even appear? Also a hard no.

How are you speculated to solve real research problems when most of your time is spent on work that technically needs to be done, yet contributes little or no to actual insight?

If 70% of your day is consumed by operational drag, when does the pondering occur?

Shift to agentic-driven experiments

A lot of the deep learning engineers and researchers I work with still run experiments manually. A good portion of the day goes to: scanning Weights & Biases or TensorBoard for last night’s run, comparing runs, exporting metrics, adjusting hyperparameters, logging notes, restarting jobs. Then repeating the cycle.

It’s dry, tedious, and repetitive work.

We’re going to dump these repetitive tasks so you possibly can shift your focus to high value work

The concept of AutoML is, frankly, laughable.

Your agent is not going to make decisions on find out how to change your network topology or add complex features — that’s your job. It is going to replace the repetitive glue work that eats beneficial time with little added value.

Agent Driven Experiments (ADEs)

Switching from manual experiments to an agent-driven workflow is less complicated than it initially seems. No rewriting your stack, no heavy systems, no tech debt.

Image by Creator

At its core, an ADE requires three steps:

  1. Containerize your existing training script
    • Wrap your current train.py in a Docker container. No refactoring of model logic. No architectural changes. Only a reproducible execution boundary.
  2. Add a light-weight agent
    • Introduce a small LangChain-based script that reads metrics out of your dashboard, applies your preferences, decides when and where to relaunch, halt or document and schedule it with cron or any job scheduler
  3. Define behavior and preferences with natural language
    • Use a YAML file for configuration and hyperparameters
    • Use a Markdown document to speak along with your agent

That’s the whole system. Now, Let’s review each step.

Containerize your training script

One could argue you have to be doing this in any case. It makes restarting and scheduling much easier, and, for those who move to a Kubernetes cluster for training, the disruption to your existing process is far lower.

In case you’re already doing this, skip to the subsequent section. If not, here’s some helpful code you should use to start.

First, let’s define a project structure that can work with Docker.

your experiment/
├── scripts/
│   ├── train.py                 # Most important training script
│   └── health_server.py         # Health check server
├── requirements.txt             # Python dependencies
├── Dockerfile                   # Container definition
└── run.sh                       # Script to begin training + health check

We want to ensure that that your train.py script can load a configuration file from the cloud, allowing the agent to edit it if needed.

I like to recommend using GitHub for this. Here’s an example of find out how to read a distant config file. The agent may have a corresponding tool to read and modify this config file.

import os
import requests
import yaml
from box import Box

# add this to `train.py`
GITHUB_RAW = (
    "https://raw.githubusercontent.com/"
    "{owner}/{repo}/{ref}/{path}"
)

def load_config_from_github(owner, repo, path, ref="essential", token=None):
    url = GITHUB_RAW.format(owner=owner, repo=repo, ref=ref, path=path)

    headers = {}
    if token:
        headers["Authorization"] = f"Bearer {token}"

    r = requests.get(url, headers=headers, timeout=10)
    r.raise_for_status()

    return Box(yaml.safe_load(r.text))


config = load_yaml_from_github(...)

# use params throughout your `train.py` script
optimizer = Adam(lr=config.lr)

We also include a health check server to run alongside the essential process. This permits container managers, equivalent to Kubernetes, or your agent, to observe the job’s status inspecting logs.

If the container’s state changes unexpectedly, it might be mechanically restarted. This simplifies agent inspection, as reading and summarizing log files may be more costly in tokens than simply checking the health of a container.

# health_server.py
import time
from pathlib import Path
from fastapi import FastAPI, Response

app = FastAPI()

HEARTBEAT = Path("/tmp/heartbeat")
STATUS = Path("/tmp/status.json")  # optional richer state
MAX_AGE = 300  # seconds

def last_heartbeat_age():
    if not HEARTBEAT.exists():
        return float("inf")
    return time.time() - float(HEARTBEAT.read_text())

@app.get("/health")
def health():
    age = last_heartbeat_age()

    # stale -> training likely hung
    if age > MAX_AGE:
        return Response("stalled", status_code=500)

    # optional: detect NaNs or failure flags written by trainer
    if STATUS.exists() and "failed" in STATUS.read_text():
        return Response("failed", status_code=500)

    return {"status": "okay", "heartbeat_age": age}

A small shell script, run.sh, which starts the health_server process along side the train.py

#!/bin/bash

# Start health check server within the background
python scripts/health_server.py &
# Capture its PID if you desire to terminate later
HEALTH_PID=$!
# Start the essential training script
python scripts/train.py

And naturally, our Dockerfile, which is built on NVIDIA’s base image so your container can use the host’s accelerator with zero friction. This instance is for Pytorch, but you possibly can simply extend it to Jax or Tensorflow if needed.

FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu20.04

RUN apt-get update && apt-get install -y 
    python3 python3-pip git

RUN python3 -m pip install --upgrade pip

# Install PyTorch with CUDA support
RUN pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121

WORKDIR /app

COPY . /app

CMD ["sh", "run.sh"]

✅ You’re containerized. Easy and minimal.

Add a light-weight agent

There are a lot of agent frameworks to selected from. For this agent, I like Langchain.

LangChain is a framework for constructing LLM-driven systems that mix reasoning and execution. It simplifies chaining model calls, managing memory, and integrating external capabilities so your LLM can do greater than generate text.

In LangChain, Tools are explicitly defined, schema-bound functions the model can call. Each tool is an idempotent skill or task (e.g., reading a file, querying an API, modifying state).

To ensure that our agent to work, we first have to define the tools that it might use to realize our objective.

Tool definitions

  1. read_preferences
    • Reads in user preferences and experiment notes from a markdown document
  2. check_tensorboard
    • Uses selenium with a chrome webdriver to screenshot metrics
  3. analyze_metric
    • Uses multimodal LLM reasoning to know what’s happening within the screenshot
  4. check_container_health
    • Checks our containerized experiment using a health check
  5. restart_container
    • Restarts experiment if unhealthy or a hyperparameter must be modified
  6. modify_config
    • Modifies a distant config file and commits to Github
  7. write_memory
    • Writes a series of actions to a persistent memory (markdown)

This set of tools define our agent’s operational boundaries. All interaction with our experiment through these tools, making behavior controllable and hopefully, predictable.

As an alternative of providing these tools in line — here’s a github gist containing all of the tools described above. You’ll be able to plug these into your agent or modify as you see fit.

The agent

To be quite honest, the primary time I attempted to grok the official Langchain documentation, I became immediately turned off of the thought all together.

It’s overly verbose and more complex than essential. In case you’re recent to agents, or simply don’t wish to navigate the labyrinth that’s the Langchain documentation, please proceed reading below.

Langsmith? Random asides? Little tooltips in all places? I’ll pass on smiting this worthy foe. Imagined by Grok

In a nutshell, that is how Langchain agents work:

Our agent uses a prompt to determine what to do at each step.

Steps are dynamically created by filling within the prompt with the present context and former outputs. Each LLM call [+ optional tool invocation] is a step, and its output feeds into the subsequent, forming a chain.

Using this conceptionally recursive loop, the agent can reason and perform the proper intended motion over all of the steps required. What number of steps depends on the agent’s ability to reason and the way clearly the termination condition is defined.

It’s a Lang-chain. Get it? 🤗 

The prompt

As noted, the prompt is the recursive glue that maintains context across LLM and power invocations. You’ll see placeholders (defined below) used when the agent is first initialized.

We use a little bit of LangChain’s built-in memory abstractions, included with each tool call. Other than that, the agent fills within the gaps, deciding each the subsequent step and which tool to call.

For readability, the essential prompt is below. You’ll be able to either plug it directly into the agent script or load it from the filesystem before running.

"You're an experiment automation agent answerable for monitoring 
and maintaining ML experiments.

Current context:
{chat_history}

Your workflow:
1. First, read preferences from preferences.md to know thresholds and settings
2. Check TensorBoard at the desired URL and capture a screenshot
3. Analyze key metrics (validation loss, training loss, accuracy) from the screenshot
4. Check Docker container health for the training container
5. Take corrective actions based on evaluation:
   - Restart unhealthy containers
   - Adjust hyperparameters based on user preferences 
     and anomalous patterns, restarting the experiment if essential
6. Log all observations and actions to memory

Essential guidelines:
- All the time read preferences first to get current configuration
- Use visual evaluation to know metric trends
- Be conservative with config changes (only adjust if clearly needed)
- Write detailed memory entries for future reference
- Check container health before and after any restart
- When modifying config, use appropriate values from preferences

Available tools: {tool_names}
Tool descriptions: {tools}

Current task: {input}

Think step-by-step and use tools to finish the workflow.
"""

Now with ~100ish lines, we now have our agent. The agent is initialized, then we define a series of steps. For every step, the current_task directive is populated in our prompt, and every tool updates a shared memory instance ConverstationSummaryBufferMemory

We’re going to use OpenAI for this agent, nonetheless, Langchain provides alternatives, including hosting your personal. If cost is a problem, there are open-sourced models which may be used here.

import os
from datetime import datetime
from pathlib import Path
from langchain.agents import AgentExecutor, create_react_agent
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.memory import ConversationSummaryBufferMemory

# Import tools from tools.py
from tools import (
    read_preferences,
    check_tensorboard,
    analyze_metric,
    check_container_health,
    restart_container,
    modify_config,
    write_memory
)

PROMPT=open("prompt.txt").read()
class ExperimentAutomation:
    def __init__(self, openai_key=None):
        """Initialize the agent"""
        self.llm = ChatOpenAI(
            temperature=0.8,
            model="gpt-4-turbo-preview",
            api_key=openai_key or os.getenv('OPENAI_API_KEY')
        )

        # Initialize memory for conversation context
        self.memory = ConversationSummaryBufferMemory(
            llm=self.llm,
            max_token_limit=32000,
            memory_key="chat_history",
            return_messages=True
        )

    def create_agent(self):
        """Create LangChain agent with imported tools"""
        tools = [
            lambda **kwargs: read_preferences(memory=self.memory, **kwargs),
            lambda **kwargs: check_tensorboard(memory=self.memory, **kwargs),
            lambda **kwargs: analyze_metric(memory=self.memory, **kwargs),
            lambda **kwargs: check_container_health(memory=self.memory, **kwargs),
            lambda **kwargs: restart_container(memory=self.memory, **kwargs),
            lambda **kwargs: modify_config(memory=self.memory, **kwargs),
            lambda **kwargs: write_memory(memory=self.memory, **kwargs)
        ]

        # Create the prompt template
        prompt = PromptTemplate.from_template(PROMPT)

        agent = create_react_agent(
            llm=self.llm,
            tools=tools,
            prompt=prompt
        )

        # Create agent executor with memory
        return AgentExecutor(
            agent=agent,
            tools=tools,
            memory=self.memory,
            verbose=True,
            max_iterations=15,
            handle_parsing_errors=True,
            return_intermediate_steps=True
        )

    def run_automation_cycle(self):
        """Execute the total automation cycle step-by-step"""
        write_memory(
            entry="Automation cycle began",
            category="SYSTEM",
            memory=self.memory
        )

        try:
            agent = self.create_agent()

            # Define the workflow as individual steps
            workflow_steps = [
                "Read preferences from preferences.md to capture thresholds and settings",
                "Check TensorBoard at the specified URL and capture a screenshot",
                "Analyze validation loss, training loss, and accuracy from the screenshot",
                "Check Docker container health for the training container",
                "Restart unhealthy containers if needed",
                "Adjust hyperparameters according to preferences and restart container if necessary",
                "Write all observations and actions to memory"
            ]

            # Execute each step individually
            for step in workflow_steps:
                result = agent.invoke({"input": step})

                # Write step output to memory
                if result.get("output"):
                    memory_summary = f"Step: {step}nOutput: {result['output']}"
                    write_memory(entry=memory_summary, category="STEP", memory=self.memory)

            write_memory(
                entry="Automation cycle accomplished successfully",
                category="SYSTEM",
                memory=self.memory
            )

            return result

        except Exception as e:
            error_msg = f"Automation cycle failed: {str(e)}"
            write_memory(entry=error_msg, category="ERROR", memory=self.memory)
            raise


def essential():
    try:
        automation = ExperimentAutomation(openai_key=os.environ["OPENAI_API_KEY"])
        result = automation.run_automation_cycle()

        if result.get('output'):
            print(f"nFinal Output:n{result['output']}")

        if result.get('intermediate_steps'):
            print(f"nSteps Executed: {len(result['intermediate_steps'])}")

        print("n✓ Automation cycle accomplished successfully")

    except Exception as e:
        print(f"n✗ Automation failed: {e}")
        write_memory(entry=f"Critical failure: {str(e)}", category="ERROR")
        import sys
        sys.exit(1)


if __name__ == "__main__":
    essential()

Now that we now have our agent, and tools, let’s discuss how we actually express our intent as a researcher – a very powerful piece.

Define behavior and preferences with natural language

As described, defining what we’re on the lookout for once we start an experiment is important to getting the proper behavior from an agent.

Although image reasoning models have come quite far, and have a superb little bit of context, they still have a ways to go before they will understand what a superb policy loss curve looks like in Hierarchical Policy Optimization, or what the perplexity of the codebook should seem like in a Vector Quantized Variational Autoencoder, something I’ve been optimizing over the past week.

For this, we initialize any automated reasoning with a preferences.md.

Let’s start with some general settings

# Experiment Preferences

This file defines my preferences for this experiment.
The agent should all the time read this primary before taking any motion.

---

## General Settings

- experiment_name: vqvae
- container_name: vqvae-train
- tensorboard_url: http://localhost:6006
- memory_file: memory.md
- maximum_adjustments_per_run: 4
---
## More details
You'll be able to all the time add more sections here. The read_preferences task will parse
and reason over each section. 

Now, let’s define metrics of interest. This is very necessary within the case of visual reasoning.

Inside the markdown document, define yaml blocks which will probably be parsed by the agent using the read_preferences tool. Adding this little bit of structure is useful for using preferences as arguments to other tools.

```yaml
metrics:
  - name: perplexity
    pattern: should remain high through the course of coaching
    restart_condition: premature collapse to zero
    hyperparameters: |
        if collapse, increase `perplexity_weight` from current value to 0.2
  - name: prediction_loss
    pattern: should decrease over the course of coaching
    restart_condition: increases or stalls
    hyperparameters: |
        if increases, increase the `prediction_weight` value from current to 0.4
  - name: codebook_usage
    pattern: should remain fixed at > 90%
    restart_condition: drops below 90% for a lot of epochs
    hyperparameters: |
        decrease the `codebook_size` param from 512 to 256. 

```

The important thing idea is that the preferences.md should provide enough structured and descriptive detail so the agent can:

Compare its evaluation against your intent, e.g., if the agent sees validation loss = 0.6 but preferences say val_loss_threshold ought to be 0.5, it knows what the corrective motion ought to be

Read the thresholds and constraints (YAML or key-value) for metrics, hyperparameters, and container management.

Understand intent or intent patterns described in human-readable sections, like “only adjust learning rate if validation loss exceeds threshold and accuracy is stagnating.”

Wiring all of it together

Now that we now have a containerized experiment + an agent, we’d like to schedule the agent. This is so simple as running the agent process via a cron task. This runs our agent once every hour, providing a tradeoff between cost (in tokens) vs. operational efficiency.

0 * * * * /usr/bin/python3 /path/to/agent.py >> /var/log/agent.log 2>&1

I’ve found that this agent doesn’t need the most recent reasoning model and performs wonderful with the previous generations from Anthropic and OpenAI.

Wrapping up

If research time is finite, it ought to be spent on research, not babysitting experiments.

Your agent should handle monitoring, restarts, and parameter adjustments without constant supervision. When the drag disappears, what stays is the actual work: forming hypotheses, designing higher models, and testing ideas that matter.

Hopefully, this agent will free you up a bit to dream up the subsequent big idea. Enjoy.

References

Müller, T., Smith, J., & Li, K. (2023). . GitHub repository. https://github.com/hwchase17/langchain

OpenAI. (2023). . https://platform.openai.com/docs

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x