that reads your metrics, detects anomalies, applies predefined tuning rules, restarts jobs when essential, and logs every decision—without you watching loss curves at 2 a.m.
In this text, I’ll provide a light-weight agent designed for deep learning researchers and ML engineers that may:
• Detect failures mechanically
• Visually reason over performance metrics
• Apply your predefined hyperparameter strategies
• Relaunch jobs
• Document every motion and final result
No architecture search. No AutoML. No invasive rewrites of your codebase.
The implementation is intentionally minimal: containerize your training script, add a small LangChain-based agent, define hyperparameters in YAML, and express preferences in markdown. You’re probably doing 50% of this already.
Drop this agent into your manual train.py workflow and go from 0️⃣ to 💯 in a single day.
The issue along with your existing experiments
🤔 You endlessly ponder over hyperparameters.
▶️ You run train.py.
🐛 You fix the bug in train.py.
🔁 You train.py
👀 You stare at TensorBoard.
🫠 You query reality.
🔄 You repeat.
Stop watching your model spit out numbers
You should not a Jedi. No amount of staring will magically make your move within the direction you wish.
Babysitting a model into the midnight for a vanishing/exploding gradient in a deep transformer based network which you can’t track down—and which may never even appear? Also a hard no.
How are you speculated to solve real research problems when most of your time is spent on work that technically needs to be done, yet contributes little or no to actual insight?
If 70% of your day is consumed by operational drag, when does the pondering occur?
Shift to agentic-driven experiments
A lot of the deep learning engineers and researchers I work with still run experiments manually. A good portion of the day goes to: scanning Weights & Biases or TensorBoard for last night’s run, comparing runs, exporting metrics, adjusting hyperparameters, logging notes, restarting jobs. Then repeating the cycle.
It’s dry, tedious, and repetitive work.
We’re going to dump these repetitive tasks so you possibly can shift your focus to high value work
The concept of AutoML is, frankly, laughable.
Your agent is not going to make decisions on find out how to change your network topology or add complex features — that’s your job. It is going to replace the repetitive glue work that eats beneficial time with little added value.
Agent Driven Experiments (ADEs)
Switching from manual experiments to an agent-driven workflow is less complicated than it initially seems. No rewriting your stack, no heavy systems, no tech debt.

At its core, an ADE requires three steps:
- Containerize your existing training script
- Wrap your current
train.pyin a Docker container. No refactoring of model logic. No architectural changes. Only a reproducible execution boundary.
- Wrap your current
- Add a light-weight agent
- Introduce a small LangChain-based script that reads metrics out of your dashboard, applies your preferences, decides when and where to relaunch, halt or document and schedule it with cron or any job scheduler
- Define behavior and preferences with natural language
- Use a YAML file for configuration and hyperparameters
- Use a Markdown document to speak along with your agent
That’s the whole system. Now, Let’s review each step.
Containerize your training script
One could argue you have to be doing this in any case. It makes restarting and scheduling much easier, and, for those who move to a Kubernetes cluster for training, the disruption to your existing process is far lower.
In case you’re already doing this, skip to the subsequent section. If not, here’s some helpful code you should use to start.
First, let’s define a project structure that can work with Docker.
your experiment/
├── scripts/
│ ├── train.py # Most important training script
│ └── health_server.py # Health check server
├── requirements.txt # Python dependencies
├── Dockerfile # Container definition
└── run.sh # Script to begin training + health check
We want to ensure that that your train.py script can load a configuration file from the cloud, allowing the agent to edit it if needed.
I like to recommend using GitHub for this. Here’s an example of find out how to read a distant config file. The agent may have a corresponding tool to read and modify this config file.
import os
import requests
import yaml
from box import Box
# add this to `train.py`
GITHUB_RAW = (
"https://raw.githubusercontent.com/"
"{owner}/{repo}/{ref}/{path}"
)
def load_config_from_github(owner, repo, path, ref="essential", token=None):
url = GITHUB_RAW.format(owner=owner, repo=repo, ref=ref, path=path)
headers = {}
if token:
headers["Authorization"] = f"Bearer {token}"
r = requests.get(url, headers=headers, timeout=10)
r.raise_for_status()
return Box(yaml.safe_load(r.text))
config = load_yaml_from_github(...)
# use params throughout your `train.py` script
optimizer = Adam(lr=config.lr)
We also include a health check server to run alongside the essential process. This permits container managers, equivalent to Kubernetes, or your agent, to observe the job’s status inspecting logs.
If the container’s state changes unexpectedly, it might be mechanically restarted. This simplifies agent inspection, as reading and summarizing log files may be more costly in tokens than simply checking the health of a container.
# health_server.py
import time
from pathlib import Path
from fastapi import FastAPI, Response
app = FastAPI()
HEARTBEAT = Path("/tmp/heartbeat")
STATUS = Path("/tmp/status.json") # optional richer state
MAX_AGE = 300 # seconds
def last_heartbeat_age():
if not HEARTBEAT.exists():
return float("inf")
return time.time() - float(HEARTBEAT.read_text())
@app.get("/health")
def health():
age = last_heartbeat_age()
# stale -> training likely hung
if age > MAX_AGE:
return Response("stalled", status_code=500)
# optional: detect NaNs or failure flags written by trainer
if STATUS.exists() and "failed" in STATUS.read_text():
return Response("failed", status_code=500)
return {"status": "okay", "heartbeat_age": age}
A small shell script, run.sh, which starts the health_server process along side the train.py
#!/bin/bash
# Start health check server within the background
python scripts/health_server.py &
# Capture its PID if you desire to terminate later
HEALTH_PID=$!
# Start the essential training script
python scripts/train.py
And naturally, our Dockerfile, which is built on NVIDIA’s base image so your container can use the host’s accelerator with zero friction. This instance is for Pytorch, but you possibly can simply extend it to Jax or Tensorflow if needed.
FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu20.04
RUN apt-get update && apt-get install -y
python3 python3-pip git
RUN python3 -m pip install --upgrade pip
# Install PyTorch with CUDA support
RUN pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121
WORKDIR /app
COPY . /app
CMD ["sh", "run.sh"]
✅ You’re containerized. Easy and minimal.
Add a light-weight agent
There are a lot of agent frameworks to selected from. For this agent, I like Langchain.
LangChain is a framework for constructing LLM-driven systems that mix reasoning and execution. It simplifies chaining model calls, managing memory, and integrating external capabilities so your LLM can do greater than generate text.
In LangChain, Tools are explicitly defined, schema-bound functions the model can call. Each tool is an idempotent skill or task (e.g., reading a file, querying an API, modifying state).
To ensure that our agent to work, we first have to define the tools that it might use to realize our objective.
Tool definitions
- read_preferences
- Reads in user preferences and experiment notes from a markdown document
- check_tensorboard
- Uses selenium with a chrome webdriver to screenshot metrics
- analyze_metric
- Uses multimodal LLM reasoning to know what’s happening within the screenshot
- check_container_health
- Checks our containerized experiment using a health check
- restart_container
- Restarts experiment if unhealthy or a hyperparameter must be modified
- modify_config
- Modifies a distant config file and commits to Github
- write_memory
- Writes a series of actions to a persistent memory (markdown)
This set of tools define our agent’s operational boundaries. All interaction with our experiment through these tools, making behavior controllable and hopefully, predictable.
As an alternative of providing these tools in line — here’s a github gist containing all of the tools described above. You’ll be able to plug these into your agent or modify as you see fit.
The agent
To be quite honest, the primary time I attempted to grok the official Langchain documentation, I became immediately turned off of the thought all together.
It’s overly verbose and more complex than essential. In case you’re recent to agents, or simply don’t wish to navigate the labyrinth that’s the Langchain documentation, please proceed reading below.

In a nutshell, that is how Langchain agents work:
Our agent uses a prompt to determine what to do at each step.
Steps are dynamically created by filling within the prompt with the present context and former outputs. Each LLM call [+ optional tool invocation] is a step, and its output feeds into the subsequent, forming a chain.
Using this conceptionally recursive loop, the agent can reason and perform the proper intended motion over all of the steps required. What number of steps depends on the agent’s ability to reason and the way clearly the termination condition is defined.
It’s a Lang-chain. Get it? 🤗
The prompt
As noted, the prompt is the recursive glue that maintains context across LLM and power invocations. You’ll see placeholders (defined below) used when the agent is first initialized.
We use a little bit of LangChain’s built-in memory abstractions, included with each tool call. Other than that, the agent fills within the gaps, deciding each the subsequent step and which tool to call.
For readability, the essential prompt is below. You’ll be able to either plug it directly into the agent script or load it from the filesystem before running.
"You're an experiment automation agent answerable for monitoring
and maintaining ML experiments.
Current context:
{chat_history}
Your workflow:
1. First, read preferences from preferences.md to know thresholds and settings
2. Check TensorBoard at the desired URL and capture a screenshot
3. Analyze key metrics (validation loss, training loss, accuracy) from the screenshot
4. Check Docker container health for the training container
5. Take corrective actions based on evaluation:
- Restart unhealthy containers
- Adjust hyperparameters based on user preferences
and anomalous patterns, restarting the experiment if essential
6. Log all observations and actions to memory
Essential guidelines:
- All the time read preferences first to get current configuration
- Use visual evaluation to know metric trends
- Be conservative with config changes (only adjust if clearly needed)
- Write detailed memory entries for future reference
- Check container health before and after any restart
- When modifying config, use appropriate values from preferences
Available tools: {tool_names}
Tool descriptions: {tools}
Current task: {input}
Think step-by-step and use tools to finish the workflow.
"""
Now with ~100ish lines, we now have our agent. The agent is initialized, then we define a series of steps. For every step, the current_task directive is populated in our prompt, and every tool updates a shared memory instance ConverstationSummaryBufferMemory
We’re going to use OpenAI for this agent, nonetheless, Langchain provides alternatives, including hosting your personal. If cost is a problem, there are open-sourced models which may be used here.
import os
from datetime import datetime
from pathlib import Path
from langchain.agents import AgentExecutor, create_react_agent
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.memory import ConversationSummaryBufferMemory
# Import tools from tools.py
from tools import (
read_preferences,
check_tensorboard,
analyze_metric,
check_container_health,
restart_container,
modify_config,
write_memory
)
PROMPT=open("prompt.txt").read()
class ExperimentAutomation:
def __init__(self, openai_key=None):
"""Initialize the agent"""
self.llm = ChatOpenAI(
temperature=0.8,
model="gpt-4-turbo-preview",
api_key=openai_key or os.getenv('OPENAI_API_KEY')
)
# Initialize memory for conversation context
self.memory = ConversationSummaryBufferMemory(
llm=self.llm,
max_token_limit=32000,
memory_key="chat_history",
return_messages=True
)
def create_agent(self):
"""Create LangChain agent with imported tools"""
tools = [
lambda **kwargs: read_preferences(memory=self.memory, **kwargs),
lambda **kwargs: check_tensorboard(memory=self.memory, **kwargs),
lambda **kwargs: analyze_metric(memory=self.memory, **kwargs),
lambda **kwargs: check_container_health(memory=self.memory, **kwargs),
lambda **kwargs: restart_container(memory=self.memory, **kwargs),
lambda **kwargs: modify_config(memory=self.memory, **kwargs),
lambda **kwargs: write_memory(memory=self.memory, **kwargs)
]
# Create the prompt template
prompt = PromptTemplate.from_template(PROMPT)
agent = create_react_agent(
llm=self.llm,
tools=tools,
prompt=prompt
)
# Create agent executor with memory
return AgentExecutor(
agent=agent,
tools=tools,
memory=self.memory,
verbose=True,
max_iterations=15,
handle_parsing_errors=True,
return_intermediate_steps=True
)
def run_automation_cycle(self):
"""Execute the total automation cycle step-by-step"""
write_memory(
entry="Automation cycle began",
category="SYSTEM",
memory=self.memory
)
try:
agent = self.create_agent()
# Define the workflow as individual steps
workflow_steps = [
"Read preferences from preferences.md to capture thresholds and settings",
"Check TensorBoard at the specified URL and capture a screenshot",
"Analyze validation loss, training loss, and accuracy from the screenshot",
"Check Docker container health for the training container",
"Restart unhealthy containers if needed",
"Adjust hyperparameters according to preferences and restart container if necessary",
"Write all observations and actions to memory"
]
# Execute each step individually
for step in workflow_steps:
result = agent.invoke({"input": step})
# Write step output to memory
if result.get("output"):
memory_summary = f"Step: {step}nOutput: {result['output']}"
write_memory(entry=memory_summary, category="STEP", memory=self.memory)
write_memory(
entry="Automation cycle accomplished successfully",
category="SYSTEM",
memory=self.memory
)
return result
except Exception as e:
error_msg = f"Automation cycle failed: {str(e)}"
write_memory(entry=error_msg, category="ERROR", memory=self.memory)
raise
def essential():
try:
automation = ExperimentAutomation(openai_key=os.environ["OPENAI_API_KEY"])
result = automation.run_automation_cycle()
if result.get('output'):
print(f"nFinal Output:n{result['output']}")
if result.get('intermediate_steps'):
print(f"nSteps Executed: {len(result['intermediate_steps'])}")
print("n✓ Automation cycle accomplished successfully")
except Exception as e:
print(f"n✗ Automation failed: {e}")
write_memory(entry=f"Critical failure: {str(e)}", category="ERROR")
import sys
sys.exit(1)
if __name__ == "__main__":
essential()
Now that we now have our agent, and tools, let’s discuss how we actually express our intent as a researcher – a very powerful piece.
Define behavior and preferences with natural language
As described, defining what we’re on the lookout for once we start an experiment is important to getting the proper behavior from an agent.
Although image reasoning models have come quite far, and have a superb little bit of context, they still have a ways to go before they will understand what a superb policy loss curve looks like in Hierarchical Policy Optimization, or what the perplexity of the codebook should seem like in a Vector Quantized Variational Autoencoder, something I’ve been optimizing over the past week.
For this, we initialize any automated reasoning with a preferences.md.
Let’s start with some general settings
# Experiment Preferences
This file defines my preferences for this experiment.
The agent should all the time read this primary before taking any motion.
---
## General Settings
- experiment_name: vqvae
- container_name: vqvae-train
- tensorboard_url: http://localhost:6006
- memory_file: memory.md
- maximum_adjustments_per_run: 4
---
## More details
You'll be able to all the time add more sections here. The read_preferences task will parse
and reason over each section.
Now, let’s define metrics of interest. This is very necessary within the case of visual reasoning.
Inside the markdown document, define yaml blocks which will probably be parsed by the agent using the read_preferences tool. Adding this little bit of structure is useful for using preferences as arguments to other tools.
```yaml
metrics:
- name: perplexity
pattern: should remain high through the course of coaching
restart_condition: premature collapse to zero
hyperparameters: |
if collapse, increase `perplexity_weight` from current value to 0.2
- name: prediction_loss
pattern: should decrease over the course of coaching
restart_condition: increases or stalls
hyperparameters: |
if increases, increase the `prediction_weight` value from current to 0.4
- name: codebook_usage
pattern: should remain fixed at > 90%
restart_condition: drops below 90% for a lot of epochs
hyperparameters: |
decrease the `codebook_size` param from 512 to 256.
```
The important thing idea is that the preferences.md should provide enough structured and descriptive detail so the agent can:
Compare its evaluation against your intent, e.g., if the agent sees validation loss = 0.6 but preferences say val_loss_threshold ought to be 0.5, it knows what the corrective motion ought to be
Read the thresholds and constraints (YAML or key-value) for metrics, hyperparameters, and container management.
Understand intent or intent patterns described in human-readable sections, like “only adjust learning rate if validation loss exceeds threshold and accuracy is stagnating.”
Wiring all of it together
Now that we now have a containerized experiment + an agent, we’d like to schedule the agent. This is so simple as running the agent process via a cron task. This runs our agent once every hour, providing a tradeoff between cost (in tokens) vs. operational efficiency.
0 * * * * /usr/bin/python3 /path/to/agent.py >> /var/log/agent.log 2>&1
I’ve found that this agent doesn’t need the most recent reasoning model and performs wonderful with the previous generations from Anthropic and OpenAI.
Wrapping up
If research time is finite, it ought to be spent on research, not babysitting experiments.
Your agent should handle monitoring, restarts, and parameter adjustments without constant supervision. When the drag disappears, what stays is the actual work: forming hypotheses, designing higher models, and testing ideas that matter.
Hopefully, this agent will free you up a bit to dream up the subsequent big idea. Enjoy.
References
Müller, T., Smith, J., & Li, K. (2023). . GitHub repository. https://github.com/hwchase17/langchain
OpenAI. (2023). . https://platform.openai.com/docs
