LangChain for EDA: Construct a CSV Sanity-Check Agent in Python

-

, agents perform actions.

That’s exactly what we’re going to check out in today’s article.

In this text, we’ll use LangChain and Python to construct our own CSV sanity check agent. With this agent, we’ll automate typical exploratory data evaluation (EDA) tasks as displaying columns, detecting missing values (NaNs) and retrieving descriptive statistics.

Agents determine step-by-step which tool to call and when to reply an issue about our data. This can be a big difference from an application in the normal sense, where the developer defines how the method works (e.g., via if-else loops). It also goes far beyond easy prompting because we’re constructing a system that acts (albeit in a straightforward way) and doesn’t just talk.

This text is for you if you happen to:

  • …work with Pandas and wish to automate EDA.
  • …find LLMs exciting, but have little experience with LangChain up to now.
  • …want to know how agents really work (from setup to mini-evaluation) using a straightforward example.

What we construct & why

An agent is a system to which we assign tasks. The system then decides for itself which tools to make use of to resolve these tasks.

This requires three components:

Agent = LLM + Tools + Control logic

Let’s take a more in-depth take a look at the three components:

  • The LLM provides the intelligence: It understands the query, plans steps, and decides what to do.
  • The tools are small Python functions that the agent is allowed to call (e.g., or : They supply specific information from the info, similar to column names or statistics.
  • The control logic (policy) ensures that the LLM doesn’t respond immediately, but first decides whether it should use a tool. It thinks step-by-step: First, the query is analyzed, then the suitable tool is chosen, then the result’s interpreted and, if vital, a next step is chosen, and eventually a response is returned.

As a substitute of manually describing all data as in classic prompting, we transfer the responsibility to the agent: The system should act by itself, but only with the tools provided.

Let’s take a look at a straightforward example:

A user asks: “What’s the common age within the CSV?”

At this point, the agent calls up the tool we’ve defined, . The output is a clearly structured value (e.g., “mean”: 29.7). Here we may see that this could reduce or minimize hallucinations, because the system knows what to use and can’t return a solution similar to “Probably between 20 and 40.”

LangChain as a framework

We use the LangChain framework for the agent. This enables us to attach LLMs with tools and construct systems with defined behavior. The system can perform actions as a substitute of just providing answers or generating text. An in depth explanation would make this text too long. But in a previous article, yow will discover an evidence of LangChain and a comparison with Langflow: LangChain vs Langflow: Construct a Easy LLM App with Code or Drag & Drop.

What the agent does for us

After we receive a brand new CSV, we often ask ourselves the next questions first (start of exploratory data evaluation):

  • What columns are there?
  • Where is data missing?
  • What do the descriptive statistics appear to be?

This is precisely what we would like the agent to do routinely.

Tools we define for the agent

For the agent to work, it needs clearly defined tools. It’s best to define them as small, specific, and controlled as possible. This fashion, we avoid errors, hallucinations or unclear outputs because they make the output deterministic. In addition they make the agent reproducible and testable because the identical input should produce a consistent result.

In our example, we define three tools:

  • schema: Returns column names and data types.
  • nulls: Shows columns with missing values (including number).
  • describe: Provides descriptive statistics for numeric columns.

Later, we are going to add a small mini-evaluation to be certain that our agent is working appropriately.

Why is that this an agent and never an app?

We usually are not constructing a classic program with a set sequence (e.g., using if-else), but fairly the model plans itself based on the query, selects the suitable tool, and combines steps as vital to reach at a solution:

Visualization by the writer.

Hands-On-Example: CSV-Sanity-Check Agent with LangChain

1) Setup

Prerequisite: Python 3.10 or higher should be installed. Many packages within the AI tooling world require ≥ 3.10. You will discover the code and the link to the repo below.

With the code below, we first create a brand new project, create an isolated Python environment and activate it. We do that in order that packages and versions are reproducible and don’t consolidate with other projects.

.

mkdir csv-agent

cd csv-agent
python -m venv .venv
.venvScriptsactivate

Then we install the vital packages:

pip install "langchain>=0.2,<0.3" "langchain-openai>=0.1.7" "langchain-community>=0.2" pandas seaborn

With this command, we pin LangChain to the 0.2 line and install the OpenAI connection and the community package. We also install pandas for the EDA functions and seaborn for loading the Titanic sample dataset.

The image shows creating an environment and installing packages.
Screenshot taken by the writer.

.

2) Prepare the info set in prepare_data.py

Next, we create a Python file called . I exploit Visual Studio Code for this, but you too can use one other IDE. On this file, we load the Titanic dataset, because it is publicly available.

# prepare_data.py
import seaborn as sns
df = sns.load_dataset("titanic")
df.to_csv("titanic.csv", index=False)
print("Saved titanic.csv")

With , we load the general public dataset (891 rows + first row with column names) directly into memory and put it aside as titanic.csv. The dataset comprises only numeric, Boolean and categorical columns, making it ideal for an EDA agent.

:

Within the terminal, we execute the Python file with the next command, in order that the titanic.csv file is positioned within the project:

python prepare_data.py

We then see within the terminal that the csv has been saved and see the titanic.csv file within the folder:

The image shows the result in the terminal after the csv is saved.
Screenshot taken by the writer.
The image shows the folder structure of the project.
Screenshot taken by the writer.

Side Note – Titanic dataset

The evaluation is predicated on the Titanic dataset (OpenML ID 40945), which is marked as public on OpenML.

After we open the file, we see the next 14 columns and 891 rows of information. The Titanic dataset is a classic example of exploratory data evaluation (EDA). It comprises information on 891 passengers of the Titanic and is usually used to analyze the connection between characteristics (e.g., gender, age, ticket class) and survival.

The image shows the Titanic dataset in Excel.
Screenshot taken by the writer.

Listed below are the 14 columns with a transient explanation:

  • survived: Survived (1) or didn’t survive (0).
  • pclass: Ticket class (1 = 1st class, 2 = 2nd class, 3 = third class).
  • sex: Gender of the passenger.
  • age: Age of the passenger (in years, could also be missing).
  • sibsp: Variety of siblings/spouses on board.
  • parch: Number of fogeys/children on board.
  • fare: Fare paid by the passenger.
  • embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).
  • class: Ticket class as text (First, Second, Third). Corresponds to pclass.
  • who: Categorization “man,” “woman,” “child.”
  • adult_male: Boolean field: Was the passenger an adult male (True/False)?
  • deck: Cabin deck (often missing).
  • embark_town: City of port of embarkation (Cherbourg, Queenstown, Southampton).
  • alone: Boolean field: Did the passenger travel alone (True/False)?

2) Define tools in important.py

Next, we define the assorted tools. To do that, we create a brand new Python file called and put it aside within the csv-agent folder as well. We add the next code to it:

# important.py
import os, json
import pandas as pd

# --- 0) Loading CSV ---
DF_PATH = "titanic.csv"
df = pd.read_csv(DF_PATH)

# --- 1) Defining tools as small, concise commands ---
# IMPORTANT: Tools return strings (on this case, JSON strings) in order that the LLM sees clearly structured responses.

from langchain_core.tools import tool

@tool
def tool_schema(dummy: str) -> str:
    """Returns column names and data types as JSON."""
    schema = {col: str(dtype) for col, dtype in df.dtypes.items()}
    return json.dumps(schema)

@tool
def tool_nulls(dummy: str) -> str:
    """Returns columns with the variety of missing values as JSON (only columns with >0 missing values)."""
    nulls = df.isna().sum()
    result = {col: int(n) for col, n in nulls.items() if n > 0}
    return json.dumps(result)

@tool
def tool_describe(input_str: str) -> str:
    """
    Returns describe() statistics.
    Optional: input_str can contain a comma-separated list of columns, e.g. "age, fare".
    """
    cols = None
    if input_str and input_str.strip():
        cols = [c.strip() for c in input_str.split(",") if c.strip() in df.columns]
    stats = df[cols].describe() if cols else df.describe()
    # describe() has a MultiIndex. Flatten it for the LLM to maintain it readable:
    return stats.to_csv(index=True)

After importing the vital packages, we load titanic.csv into once and define three small, narrowly defined tools. Let’s take a more in-depth take a look at each of those tools:

  • returns the column names and data types as JSON. This offers us an outline of what we’re coping with and is generally step one in any data evaluation. Even when a tool doesn’t need input (like schema), it must still accept one argument, since the agent all the time passes a string. We simply ignore it.
  • counts missing values per column and returns only columns with missing values.
  • calls df.describe(). It is vital to notice that this tool only works for numeric columns. Strings or Booleans, alternatively, are ignored. That is a vital step within the sanity check or EDA. This enables us to quickly see the mean, min, max, etc. of different columns. For big CSVs, describe() can take a protracted time. On this case, you would integrate as sampling logic, for instance.

These tools are the controlled interfaces through which the LLM is allowed to access the info. They’re deterministic and subsequently reproducible. Tools should ideally be clear and limited: In other words, they need to have just one function or task.


Why do we’d like tools in any respect?


What exactly does the code do?

With the @tool decorator, LangChain routinely infers the tool’s name, description and argument schema from the function signature and docstring. This implies we only need to jot down the function itself. LangChain takes care of the remainder.

  • The model passes arguments that match the tool’s schema (often JSON). On this tutorial we keep things easy and accept a single string argument (e.g., input_str: str or a dummy string we ignore).
  • Tools all the time return a string (text). JSON is good for structured data, which we define with .
This image shows how the agent uses multi-step reasoning with tools.
Visualization by the writer.

This can be a multi-step thought process. The LLM plans iteratively. As a substitute of responding directly, it thinks step-by-step: it decides which tool to call, interprets the result, and should proceed until it has enough information to reply.

4) Registering tools for LangChain in important.py

We add the code below to the identical file to register the previously defined tools for the agent:

# --- 2) Registering tools for LangChain ---

tools = [tool_schema, tool_nulls, tool_describe]

With this code, we simply collect the decorated functions into an inventory. Each function has already been converted right into a LangChain tool by the @tool decorator.

5) Configuring LLM in important.py

Next, we configure the LLM that the agent uses. Here, you possibly can either use the variant for OpenAI or for an open-source tool with Ollama.

I used OpenAI, which is why we first have to set the API key:

At OpenAI, we create a brand new API key:

The image shows how to create an API-Key in OpenAI.
Screenshot taken by the writer.

We then copy it directly (it should not be displayed later) and set it as an environment variable within the terminal with the next command.

setx OPENAI_API_KEY "your_key”

It is vital to restart cmd and reactivate .venv afterwards. We will use echo to examine whether an API key has been saved.

The image shows how to check in the terminal, if the API-Key was saved.
Screenshot taken by the writer.

Now we add the next code to the top of :

# --- 3) Configure LLM ---
# Option A: OpenAI (easy)
#   export OPENAI_API_KEY=...    # Windows: setx OPENAI_API_KEY "YOUR_KEY"
#   Use a lower temperature for more stable tool usage
USE_OPENAI = bool(os.getenv("OPENAI_API_KEY"))

if USE_OPENAI:
    from langchain_openai import ChatOpenAI
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.1)
else:
    # Option B: Local with Ollama (make sure that to drag the model first, e.g. 'ollama run llama3')
    from langchain_community.chat_models import ChatOllama
    llm = ChatOllama(model="llama3.1:8b", temperature=0.1)

The code uses OpenAI if an OpenAI_API_KEY is accessible, otherwise Ollama locally.

We set the temperature to 0.1. This ensures that the responses are more deterministic, which is very important for the next test.

We also use gpt-4o-mini because the LLM. This can be a lightweight model from OpenAI with a concentrate on tool usage.

6) Defining the agent’s behavior in important.py using the policy

On this step, we define how the agent should behave. The system prompt sets the policy.

# --- 4) Narrow Policy/Prompt (Agent Behavior) ---
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

SYSTEM_PROMPT = (
    "You might be a data-focused assistant. "
    "If an issue requires information from the CSV, first use an appropriate tool. "
    "Use just one tool call per step if possible. "
    "Answer concisely and in a structured way. "
    "If no tool suits, briefly explain why.nn"
    "Available tools:n{tools}n"
    "Use only these tools: {tool_names}."
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", SYSTEM_PROMPT),
        ("human", "{input}"),
        MessagesPlaceholder(variable_name="agent_scratchpad"),
    ]
)

_tool_desc = "n".join(f"- {t.name}: {t.description}" for t in tools)
_tool_names = ", ".join(t.name for t in tools)
prompt = prompt.partial(tools=_tool_desc, tool_names=_tool_names)

First, we import to structure our agent’s prompt. A very powerful a part of the code is the system prompt: it defines the policy, i.e., the “rules of the sport” for the agent. In it, we define that the agent may only use one tool per step, that it must be concise, and that it might only use the tools we’ve defined.

With the last two lines within the system prompt, we be certain that lists all available tools with their descriptions and with , we be certain that the agent can only use these names and can’t invent fantasy tools.

As well as, we use . That is where the agent stores intermediate steps: The agent stores which tools it has called and which ends up it has received. This enables it to proceed its own chain of reasoning until it arrives at a final answer.

7) Create tool-calling agent in important.py

Within the last step, we define the agent:

# --- 5) Create & Run Tool-Calling Agent ---
from langchain.agents import create_tool_calling_agent, AgentExecutor

agent = create_tool_calling_agent(llm=llm, tools=tools, prompt=prompt)
agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    verbose=False,   # optional: True for debug logs
    max_iterations=3,
)

if __name__ == "__main__":
    user_query = "Which columns have missing values? List 'Column: Count'."
    result = agent_executor.invoke({"input": user_query})
    print("n=== AGENT ANSWER ===")
    print(result["output"])

With , we connect our LLM, the tools and the prompt to form a tool-calling agent.

To be certain that the method runs easily, we use the . It takes care of the so-called agent loop: The agent first plans what must be done, then calls up a tool, receives the result and decides whether one other tool is required or whether it could provide the ultimate answer. This cycle repeats until the result is prepared.

With , we will view the intermediate steps within the terminal, which is amazingly helpful for debugging. For instance, we will see which tool was called when or what data was returned. If the whole lot is running easily, we may set it to to maintain the output clearer.

With , we limit what number of reasoning–tool–response cycles the agent may perform. This helps prevent infinite loops or excessive tool calls. In our example, the agent might reasonably call schema → nulls → describe before answering.

With the last a part of the code, the agent is executed with the sample input “Which columns have missing values?”. The result’s printed within the terminal.

8) Run the script: Run the file important.py within the terminal.

Now we enter within the terminal to start out the agent. We then see the ultimate answer within the terminal:

The image shows the result that the agent shows in the terminal (how many missing values).
Screenshot taken by the writer.

Mini-Evaluation

Finally, we would like to examine our agent, which we do with a small evaluation. This ensures that the agent behaves appropriately and doesn’t introduce any “regressions” when we modify something within the code afterward.

At the top of , we add the code below:

def ask_agent(query: str) -> str:
    return agent_executor.invoke({"input": query})["output"]

With , we encapsulate the agent call in a function that simply returns a string. This enables us to call the agent later from other files.

The lower block ensures that a test run is performed when known as directly. If, alternatively, we import important into one other file, only the function is provided.

Now we create the file and insert the next code:

# mini_eval.py

from important import ask_agent

tests = [
    ("Which columns have missing values?", ["age", "embarked", "deck", "embark_town"]),
    ("Show me the primary 3 columns with their data types.", ["survived", "pclass", "sex"]),
    ("Give me a statistical summary of the 'age' column.", ["mean", "min", "max"]),
]

def passed(q, out, must_include):
    text = out.lower()
    return all(any(tok in text for tok in (m.lower(), str(m).lower())) for m in must_include)

if __name__ == "__main__":
    okay = 0
    for q, must in tests:
        out = ask_agent(q)
        result = passed(q, out, must)
        print(f"[{'OK' if result else 'FAIL'}] {q}n{out}n")
        okay += int(result)
    print(f"Passed {okay}/{len(tests)}")

Within the code, we define three test cases. Each test consists of an issue for the agent and an inventory of keywords that must appear in the reply. The function checks whether these keywords are included.

Expected test results

  • Test 1: “Which columns have missing values?”
    Expected: Output mentions age, deck, embarked, embark_town.
  • Test 2: “Show me the primary 3 columns with their data types.” Expected: Output comprises survived, pclass, sex with types similar to int64 or object.
  • Test 3: “Give me a statistical summary of the ‘age’ column.” Expected output: Output comprises mean ≈ 29.7, min = 0.42, max = 80.

If the whole lot runs appropriately, the script reports “Passed 3/3” at the top.

We get this output within the terminal. So the test works:

The image shows the result of the mini-evaluation.
Screenshot taken by the writer.

You will discover the code & the csv within the repo on GitHub.


Final Thoughts – Pitfalls, suggestions and next steps

LangChain may be very practical for this instance since it already includes and nicely illustrates the complete agent loop (planning, tool calling, control). For small or clearly structured tasks, nonetheless, alternatives similar to pure function calling (e.g., via the OpenAI API) or classic EDA frameworks like Great Expectations is perhaps sufficient. That said, LangChain does add some overhead. For those who only need fixed EDA checks, a plain Python script could be leaner and faster. LangChain is very worthwhile when you need to extend things flexibly or orchestrate multiple tools and agents.

When working with agents, there are just a few things it is best to consider:

One common pitfall is unclear tool descriptions: If the descriptions are too vague, the model can easily select the flawed tool (misrouting). With precise and concrete descriptions, we will greatly reduce this.

One other vital point is testing: Even a small mini-evaluation with three easy tests is useful in detecting regressions (errors that stay unnoticed as a consequence of subsequent changes) at an early stage.

It’s also value starting small: In our example, we only worked with three clearly defined tools, but now we all know that they work reliably.

With regard to this agent, it may also be useful to include sampling (for instance, for very large CSV files to avoid performance issues. Bear in mind that LLM agents may grow to be costly if every query triggers multiple tool calls.

In this text, we built a single agent that checks CSV files. In practice, multiple agents would often work together: For instance, one agent could ensure data quality while a second agent creates visualizations. Such multi-agent systems are the following step in solving more complex tasks.

As a next step, we could also incorporate LangGraph to increase the agent loop with states and orchestration. This may allow us to assemble agents as in a flowchart, including interruptions, memory, or more flexible control logic.

Finally, in our example, we manually defined the three tools schema, nulls, and describe. With the Model Context Protocol (MCP), we could connect tools in a standardized way. For instance, we could connect databases, APIs or IDEs.

Where Can You Proceed Learning?

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x