I’ve been fascinated by debates—the strategic framing, the sharp retorts, and the rigorously timed comebacks. Debates aren’t just entertaining; they’re structured battles of ideas, driven by logic and evidence. Recently, I began wondering: could we replicate that dynamic using AI agents—having them debate one another autonomously, complete with real-time fact-checking and moderation? The result was Deb8flow, an autonomous AI debating environment powered by LangGraph, OpenAI’s GPT-4o model, and the brand new integrated Web Search feature.
In Deb8flow, two agents—Pro and Con—square off on a given topic while a Moderator manages turn-taking. A dedicated Fact Checker reviews every claim in real time using GPT-4o’s recent browsing capabilities, and a final Judge evaluates the arguments for quality and coherence. If an agent repeatedly makes factual errors, they’re routinely disqualified—ensuring the controversy stays grounded in reality.
This text offers an in-depth take a look at the advanced architecture and dynamic workflows that power autonomous AI debates. I’ll walk you thru how Deb8flow’s modular design leverages LangGraph’s state management and conditional routing, alongside GPT-4o’s capabilities.
Even if you happen to’re recent to AI agents or LangGraph (see resources [1] and [2] for primers), I’ll explain the important thing concepts clearly. And if you happen to’d prefer to explore further, the complete project is offered on GitHub: iason-solomos/Deb8flow.
Able to see how AI agents can debate autonomously in practice?
Let’s dive in.
High-Level Overview: Autonomous Debates with Multiple Agents
In Deb8flow, we orchestrate a formal debate between two AI agents – one arguing Pro and one Con – complete with a Moderator, a Fact Checker, and a final Judge. The controversy unfolds autonomously, with each agent playing a job in a structured format.
At its core, Deb8flow is a LangGraph-powered agent system, built atop LangChain, using GPT-4o to power each role—Pro, Con, Judge, and beyond. We use GPT-4o’s preview model with browsing capabilities to enable real-time fact-checking. In essence, the Pro and Con agents debate; after each statement, a fact-checker agent uses GPT-4o’s web search to catch any hallucinations or inaccuracies in that statement in real time. The controversy only continues once the statement is verified. The entire process is coordinated by a LangGraph-defined workflow that ensures proper turn-taking and conditional logic.
Image generated by the creator with DALL-E
The controversy workflow goes through these stages:
- Topic Generation: A Topic Generator agent produces a nuanced, debatable topic for the session (e.g. ).
- Opening: The Pro Argument Agent makes a gap statement in favor of the subject, kicking off the controversy.
- Rebuttal: The Debate Moderator then gives the ground to the Con Argument agent, who rebuts the Pro’s opening statement.
- Counter: The Moderator gives the ground back to the Pro agent, who counters the Con agent’s points.
- Closing: The Moderator switches the ground to the Con agent one last time for a closing argument.
- Judgment: Finally, the Judge agent reviews the complete debate history and evaluates either side based on argument quality, clarity, and persuasiveness. Probably the most convincing side wins.
After each speech, the Fact Checker agent steps in to confirm the factual accuracy of that statement. If a debater’s claim doesn’t delay (e.g. cites a incorrect statistic or “hallucinates” a fact), the workflow triggers a : the speaker has to correct or modify their statement. (If either debater accumulates 3 fact-check failures, they’re routinely disqualified for repeatedly spreading inaccuracies, and their opponent wins by default.) This mechanism keeps our AI debaters honest and grounded in point of fact!
Prerequisites and Setup
Before diving into the code, ensure that you’ve the next in place:
- Python 3.12+ installed.
- An OpenAI API key with access to the GPT-4o model. You may create your personal API key here: https://platform.openai.com/settings/organization/api-keys
- Project Code: Clone the Deb8flow repository from GitHub (
git clone https://github.com/iason-solomos/Deb8flow.git
). The repo features arequirements.txt
for all required packages. Key dependencies include LangChain/LangGraph (for constructing the agent graph) and the OpenAI Python client. - Install Dependencies: In your project directory, run:
pip install -r requirements.txt
to put in the needed libraries. - Create a
.env
file within the project root to carry your OpenAI API credentials. It needs to be of the shape:OPENAI_API_KEY_GPT4O = "sk-…"
- You may as well at any time take a look at the README file: https://github.com/iason-solomos/Deb8flow if you happen to simply wish to run the finished app.
Once dependencies are installed and the environment variable is ready, you ought to be able to run the app. The project structure is organized for clarity:
Deb8flow/
├── configurations/
│ ├── debate_constants.py
│ └── llm_config.py
├── nodes/
│ ├── base_component.py
│ ├── topic_generator_node.py
│ ├── pro_debater_node.py
│ ├── con_debater_node.py
│ ├── debate_moderator_node.py
│ ├── fact_checker_node.py
│ ├── fact_check_router_node.py
│ └── judge_node.py
├── prompts/
│ ├── topic_generator_prompts.py
│ ├── pro_debater_prompts.py
│ ├── con_debater_prompts.py
│ └── … (prompts for other agents)
├── tests/ (accommodates unit and whole workflow tests)
└── debate_workflow.py
A fast tour of this structure:
configurations/
holds constant definitions and LLM configuration classes.
nodes/
accommodates the implementation of every agent or functional node in the controversy (each of those is a module defining one agent’s behavior).
prompts/
stores the prompt templates for the language model (so each agent knows learn how to prompt GPT-4o for its specific task).
debate_workflow.py
ties every part together by defining the LangGraph workflow (the graph of nodes and transitions).
debate_state.py
defines the shared data structure that the agents can be using on each run.
tests/
includes some basic tests and example runs to make it easier to confirm every part is working.
Under the Hood: State Management and Workflow Setup
To coordinate a posh multi-turn debate, we want a shared state and a well-defined flow. We’ll start by how Deb8flow defines the debate state and constants, after which see how the LangGraph workflow is constructed.
Defining the Debate State Schema (debate_state.py
)
Deb8flow uses a shared state (https://langchain-ai.github.io/langgraph/concepts/low_level/#state ) in the shape of a Python TypedDict
that every one agents can read from and update. This state tracks the controversy’s progress and context – things just like the topic, the history of messages, whose turn it’s, etc. By centralizing this information, each agent node could make decisions based on the present state of the controversy.
Link: debate_state.py
from typing import TypedDict, List, Dict, Literal
DebateStage = Literal["opening", "rebuttal", "counter", "final_argument"]
class DebateMessage(TypedDict):
speaker: str # e.g. pro or con
content: str # The message each speaker produced
validated: bool # Whether the FactChecker okay’d this message
stage: DebateStage # The stage of the controversy when this message was produced
class DebateState(TypedDict):
debate_topic: str
positions: Dict[str, str]
messages: List[DebateMessage]
opening_statement_pro_agent: str
stage: str # "opening", "rebuttal", "counter", "final_argument"
speaker: str # "pro" or "con"
times_pro_fact_checked: int # The variety of times the professional agent has been fact-checked. If it reaches 3, the professional agent is disqualified.
times_con_fact_checked: int # The variety of times the con agent has been fact-checked. If it reaches 3, the con agent is disqualified.
Key fields that we want to have within the DebateState
include:
debate_topic
(str): The subject being debated.messages
(List[DebateMessage]): An inventory of all messages exchanged up to now. Each message is a dictionary with fields forspeaker
(e.g."pro"
or"con"
or"fact_checker"
), the messagecontent
(text), avalidated
flag (whether it passed fact-check), and thestage
of the controversy when it was produced.stage
(str): The present debate stage (certainly one of"opening"
,"rebuttal"
,"counter"
,"final_argument"
).speaker
(str): Whose turn it’s currently ("pro"
or"con"
).times_pro_fact_checked
/times_con_fact_checked
(int): Counters for a way persistently either side has been caught with a false claim. (In our rules, if a debater fails fact-check 3 times, they may very well be disqualified or routinely lose.)positions
(Dict[str, str]): (Optional) A mapping of either side’s general stance (e.g.,"pro": "In favor of the subject"
).
By structuring the controversy’s state, agents find it easy to access the conversation history or check the present stage, and the control logic can update the state between turns. The state is actually the memory of the controversy.
Constants and Configuration
To avoid “magic strings” scattered within the code, we define some constants in debate_constants.py
. For instance, constants for stage names (STAGE_OPENING = "opening"
, etc.), speaker identifiers (SPEAKER_PRO = "pro"
, SPEAKER_CON = "con"
, etc.), and node names (NODE_PRO_DEBATER = "pro_debater_node"
, etc.). These make the code easier to take care of and skim.
# Stage names
STAGE_OPENING = "opening"
STAGE_REBUTTAL = "rebuttal"
STAGE_COUNTER = "counter"
STAGE_FINAL_ARGUMENT = "final_argument"
STAGE_END = "end"
# Speakers
SPEAKER_PRO = "pro"
SPEAKER_CON = "con"
SPEAKER_JUDGE = "judge"
# Node names
NODE_PRO_DEBATER = "pro_debater_node"
NODE_CON_DEBATER = "con_debater_node"
NODE_DEBATE_MODERATOR = "debate_moderator_node"
NODE_JUDGE = "judge_node"
We also arrange LLM configuration in llm_config.py. Here, we define classes for OpenAI or Azure OpenAI configs after which create a dictionary llm_config_map
mapping model names to their config. As an example, we map "gpt-4o"
to an OpenAILLMConfig
that holds the model name and API key. This fashion, each time we want to initialize a GPT-4o agent, we are able to just do llm_config_map["gpt-4o"]
to get the suitable config. All our foremost agents (debaters, topic generator, judge) use this same GPT-4o configuration.
import os
from dataclasses import dataclass
from typing import Union
@dataclass
class OpenAILLMConfig:
"""
A knowledge class to store configuration details for OpenAI models.
Attributes:
model_name (str): The name of the OpenAI model to make use of.
openai_api_key (str): The API key for authenticating with the OpenAI service.
"""
model_name: str
openai_api_key: str
llm_config_map = {
"gpt-4o": OpenAILLMConfig(
model_name="gpt-4o",
openai_api_key=os.getenv("OPENAI_API_KEY_GPT4O"),
)
}
Constructing the LangGraph Workflow (debate_workflow.py
)
With state and configs in place, we construct the debate workflow graph. LangGraph’s StateGraph is the backbone that connects all our agent nodes within the order they need to execute. Here’s how we set it up:
class DebateWorkflow:
def _initialize_workflow(self) -> StateGraph:
workflow = StateGraph(DebateState)
# Nodes
workflow.add_node("generate_topic_node", GenerateTopicNode(llm_config_map["gpt-4o"]))
workflow.add_node("pro_debater_node", ProDebaterNode(llm_config_map["gpt-4o"]))
workflow.add_node("con_debater_node", ConDebaterNode(llm_config_map["gpt-4o"]))
workflow.add_node("fact_check_node", FactCheckNode())
workflow.add_node("fact_check_router_node", FactCheckRouterNode())
workflow.add_node("debate_moderator_node", DebateModeratorNode())
workflow.add_node("judge_node", JudgeNode(llm_config_map["gpt-4o"]))
# Entry point
workflow.set_entry_point("generate_topic_node")
# Flow
workflow.add_edge("generate_topic_node", "pro_debater_node")
workflow.add_edge("pro_debater_node", "fact_check_node")
workflow.add_edge("con_debater_node", "fact_check_node")
workflow.add_edge("fact_check_node", "fact_check_router_node")
workflow.add_edge("judge_node", END)
return workflow
async def run(self):
workflow = self._initialize_workflow()
graph = workflow.compile()
# graph.get_graph().draw_mermaid_png(output_file_path="workflow_graph.png")
initial_state = {
"topic": "",
"positions": {}
}
final_state = await graph.ainvoke(initial_state, config={"recursion_limit": 50})
return final_state
Let’s break down what’s happening:
- We initialize a brand new
StateGraph
with ourDebateState
type because the state schema. - We add each node (agent) to the graph with a reputation. For nodes that need an LLM, we pass within the GPT-4o config. For instance,
"pro_debater_node"
is added asProDebaterNode(llm_config_map["gpt-4o"])
, meaning the Pro debater agent will use GPT-4o as its underlying model. - We set the entry point of the graph to
"generate_topic_node"
. This implies step one of the workflow is to generate a debate topic. - Then we add directed edges to attach nodes. The sides above encode the first sequence: topic -> pro’s turn -> fact-check -> (then a routing decision) -> … eventually -> judge -> END. We don’t connect the Moderator or Fact Check Router with static edges, since these nodes use dynamic commands to redirect the flow. The ultimate edge connects the judge to an
END
marker to terminate the graph.
When the workflow runs, control will pass along these edges so as, but each time we hit a router or moderator node, that node will output a command telling the graph which node to go to next (overriding the default edge). That is how we create conditional loops: the fact_check_router_node
might send us back to a debater node for a retry, as an alternative of following a straight line. LangGraph supports this by allowing nodes to return a special Command
object with goto
instructions.
In summary, at a high level we’ve defined an agentic workflow: a graph of autonomous agents where control can branch and loop based on the agents’ outputs. Now, let’s explore what each of those agent nodes actually does.
Agent Nodes Breakdown
Each stage or role in the controversy is encapsulated in a node (agent). In LangGraph, nodes are sometimes easy functions, but I wanted a more object-oriented approach for clarity and reusability. So in Deb8flow, every node is a class with a __call__
method. All of the foremost agent classes inherit from a standard BaseComponent
for shared functionality. This design makes the system modular: we are able to easily swap out or extend agents by modifying their class definitions, and every agent class is answerable for its piece of the workflow.
Let’s undergo the important thing agents one after the other.
BaseComponent
– A Reusable Agent Base Class
Most of our agent nodes (just like the debaters and judge) share common needs: they use an LLM to generate output, they could must retry on errors, they usually should track token usage. The BaseComponent
class (defined in nodes/base_component.py
) provides these common features so we don’t repeat code.
class BaseComponent:
"""
A foundational class for managing LLM-based workflows with token tracking.
Can handle each Azure OpenAI (AzureChatOpenAI) and OpenAI (ChatOpenAI).
"""
def __init__(
self,
llm_config: Optional[LLMConfig] = None,
temperature: float = 0.0,
max_retries: int = 5,
):
"""
Initializes the BaseComponent with optional LLM configuration and temperature.
Args:
llm_config (Optional[LLMConfig]): Configuration for either Azure or OpenAI.
temperature (float): Controls the randomness of LLM outputs. Defaults to 0.0.
max_retries (int): How persistently to retry on 429 errors.
"""
logger = logging.getLogger(self.__class__.__name__)
tracer = trace.get_tracer(__name__, tracer_provider=get_tracer_provider())
self.logger = logger
self.tracer = tracer
self.llm: Optional[ChatOpenAI] = None
self.output_parser: Optional[StrOutputParser] = None
self.state: Optional[DebateState] = None
self.prompt_template: Optional[ChatPromptTemplate] = None
self.chain: Optional[RunnableSequence] = None
self.documents: Optional[List] = None
self.prompt_tokens = 0
self.completion_tokens = 0
self.max_retries = max_retries
if llm_config just isn't None:
self.llm = self._init_llm(llm_config, temperature)
self.output_parser = StrOutputParser()
def _init_llm(self, config: LLMConfig, temperature: float):
"""
Initializes an LLM instance for either Azure OpenAI or standard OpenAI.
"""
if isinstance(config, AzureOpenAILLMConfig):
# If it's Azure, use the AzureChatOpenAI class
return AzureChatOpenAI(
deployment_name=config.deployment_name,
azure_endpoint=config.azure_endpoint,
openai_api_version=config.openai_api_version,
openai_api_key=config.openai_api_key,
temperature=temperature,
)
elif isinstance(config, OpenAILLMConfig):
# If it's standard OpenAI, use the ChatOpenAI class
return ChatOpenAI(
model_name=config.model_name,
openai_api_key=config.openai_api_key,
temperature=temperature,
)
else:
raise ValueError("Unsupported LLMConfig type.")
def validate_initialization(self) -> None:
"""
Ensures we've an LLM and an output parser.
"""
if not self.llm:
raise ValueError("LLM just isn't initialized. Ensure `llm_config` is provided.")
if not self.output_parser:
raise ValueError("Output parser just isn't initialized.")
def execute_chain(self, inputs: Any) -> Any:
"""
Executes the LLM chain, tracks token usage, and retries on 429 errors.
"""
if not self.chain:
raise ValueError("No chain is initialized for execution.")
retry_wait = 1 # Initial wait time in seconds
for attempt in range(self.max_retries):
try:
with get_openai_callback() as cb:
result = self.chain.invoke(inputs)
self.logger.info("Prompt Token usage: %s", cb.prompt_tokens)
self.logger.info("Completion Token usage: %s", cb.completion_tokens)
self.prompt_tokens = cb.prompt_tokens
self.completion_tokens = cb.completion_tokens
return result
except Exception as e:
# If the error mentions 429, do exponential backoff and retry
if "429" in str(e):
self.logger.warning(
f"Rate limit reached. Retrying in {retry_wait} seconds... "
f"(Attempt {attempt + 1}/{self.max_retries})"
)
time.sleep(retry_wait)
retry_wait *= 2
else:
self.logger.error(f"Unexpected error: {str(e)}")
raise e
raise Exception("API request failed after maximum variety of retries")
def create_chain(
self, system_template: str, human_template: str
) -> RunnableSequence:
"""
Creates a sequence for unstructured outputs.
"""
self.validate_initialization()
self.prompt_template = ChatPromptTemplate.from_messages(
[
("system", system_template),
("human", human_template),
]
)
self.chain = self.prompt_template | self.llm | self.output_parser
return self.chain
def create_structured_output_chain(
self, system_template: str, human_template: str, output_model: Type[BaseModel]
) -> RunnableSequence:
"""
Creates a sequence that yields structured outputs (parsed right into a Pydantic model).
"""
self.validate_initialization()
self.prompt_template = ChatPromptTemplate.from_messages(
[
("system", system_template),
("human", human_template),
]
)
self.chain = self.prompt_template | self.llm.with_structured_output(output_model)
return self.chain
def build_return_with_tokens(self, node_specific_data: dict) -> dict:
"""
Convenience method so as to add token usage info into the return values.
"""
return {
**node_specific_data,
"prompt_tokens": self.prompt_tokens,
"completion_tokens": self.completion_tokens,
}
def __call__(self, state: DebateState) -> None:
"""
Updates the node's local copy of the state.
"""
self.state = state
for key, value in state.items():
setattr(self, key, value)
Key features of BaseComponent
:
- It stores an LLM client (e.g. an OpenAI
ChatOpenAI
instance) initialized with a given model and API key, in addition to an output parser. - It provides a technique
create_chain(system_template, human_template)
which sets up a LangChain prompt chain (aRunnableSequence
) combining a system prompt and a human prompt. This chain is what actually generates outputs when run. - It has an
execute_chain(inputs)
method that invokes the chain and includes logic to retry if the OpenAI API returns a rate-limit error (HTTP 429). This is finished with exponential backoff as much as amax_retries
count. - It keeps track of token usage (prompt tokens and completion tokens) for logging or evaluation.
- The
__call__
approach to BaseComponent (which each subclass will call viasuper().__call__(state)
) can perform any setup needed before the node’s foremost logic runs (like ensuring the LLM is initialized).
By constructing on BaseComponent
, each agent class can deal with its unique logic (like what prompt to make use of and learn how to handle the state), while inheriting the heavy lifting of interacting with GPT-4o reliably.
Topic Generator Agent (GenerateTopicNode
)
The Topic Generator (topic_generator_node.py) is the primary agent within the graph. Its job is to provide you with a debatable topic for the session. We give it a prompt that instructs it to output a nuanced topic that might reasonably have a professional and con side.
This agent inherits from BaseComponent
and uses a prompt chain (system + human prompt) to generate one item of text – the controversy topic. When called, it executes the chain (with no special input, just using the prompt) and gets back a topic_text
. It then updates the state with:
debate_topic
: the generated topic (stripped of any extra whitespace),positions
: a dictionary assigning the professional and con stances (by default we use"In favor of the subject"
and"Against the subject"
),stage
: set to"opening"
,speaker
: set to"pro"
(so the Pro side will speak first).
In code, the return might appear like:
return {
"debate_topic": debate_topic,
"positions": positions,
"stage": "opening",
"speaker": first_speaker # "pro"
}
Listed below are the prompts for the subject generator:
SYSTEM_PROMPT = """
You might be a brainstorming AI that implies debate topics.
You'll provide a single, interesting or timely topic that may have two opposing views.
"""
HUMAN_PROMPT = """
Please suggest one debate topic for 2 AI agents to debate.
For instance, it may very well be about technology, politics, philosophy, or any interesting domain.
Just provide the subject in a concise sentence.
"""
Then we pass these prompts within the constructor of the category itself.
class GenerateTopicNode(BaseComponent):
def __init__(self, llm_config, temperature: float = 0.7):
super().__init__(llm_config, temperature)
# Create the prompt chain.
self.chain: RunnableSequence = self.create_chain(
system_template=SYSTEM_PROMPT,
human_template=HUMAN_PROMPT
)
def __call__(self, state: DebateState) -> Dict[str, str]:
"""
Generates a debate topic and assigns positions to the 2 debaters.
"""
super().__call__(state)
topic_text = self.execute_chain({})
# Store the subject and assign stances within the DebateState
debate_topic = topic_text.strip()
positions = {
"pro": "In favor of the subject",
"con": "Against the subject"
}
first_speaker = "pro"
self.logger.info("Welcome to our debate panel! Today's debate topic is: %s", debate_topic)
return {
"debate_topic": debate_topic,
"positions": positions,
"stage": "opening",
"speaker": first_speaker
}
It’s a pattern we’ll repeat for all classes aside from those not using LLMs and the actual fact checker.
Now we are able to implement the two stars of the show, the Pro and Con argument agents!
Debater Agents (Pro and Con)
Link: pro_debater_node.py
The 2 debater agents are very similar in structure, but each uses different prompt templates tailored to their role (pro vs con) and the stage of the controversy.
The Pro debater, for instance, has to handle an opening statement and a counter-argument (countering the Con’s rebuttal). We also need logic for retries in case an announcement fails fact-check. In code, the ProDebater class sets up multiple prompt chains:
opening_chain
and anopening_retry_chain
(using barely different human prompts – the retry prompt might instruct it to try again without repeating any factually dubious claims).counter_chain
andcounter_retry_chain
for the counter-argument stage.
class ProDebaterNode(BaseComponent):
def __init__(self, llm_config, temperature: float = 0.7):
super().__init__(llm_config, temperature)
self.opening_chain = self.create_chain(SYSTEM_PROMPT, OPENING_HUMAN_PROMPT)
self.opening_retry_chain = self.create_chain(SYSTEM_PROMPT, OPENING_RETRY_HUMAN_PROMPT)
self.counter_chain = self.create_chain(SYSTEM_PROMPT, COUNTER_HUMAN_PROMPT)
self.counter_retry_chain = self.create_chain(SYSTEM_PROMPT, COUNTER_RETRY_HUMAN_PROMPT)
def __call__(self, state: DebateState) -> Dict[str, Any]:
super().__call__(state)
debate_topic = state.get("debate_topic")
messages = state.get("messages", [])
stage = state.get("stage")
speaker = state.get("speaker")
# Check if retrying (last message was by pro and never validated)
last_msg = messages[-1] if messages else None
retrying = last_msg and last_msg["speaker"] == SPEAKER_PRO and never last_msg["validated"]
if stage == STAGE_OPENING and speaker == SPEAKER_PRO:
chain = self.opening_retry_chain if retrying else self.opening_chain # select which chain we're triggering: the traditional one or the fact-cehcked one
result = chain.invoke({
"debate_topic": debate_topic
})
elif stage == STAGE_COUNTER and speaker == SPEAKER_PRO:
opponent_msg = self._get_last_message_by(SPEAKER_CON, messages)
debate_history = get_debate_history(messages)
chain = self.counter_retry_chain if retrying else self.counter_chain
result = chain.invoke({
"debate_topic": debate_topic,
"opponent_statement": opponent_msg,
"debate_history": debate_history
})
else:
raise ValueError(f"Unknown turn for ProDebater: stage={stage}, speaker={speaker}")
new_message = create_debate_message(speaker=SPEAKER_PRO, content=result, stage=stage)
self.logger.info("Speaker: %s, Stage: %s, Retry: %snMessage:n%s", speaker, stage, retrying, result)
return {
"messages": messages + [new_message]
}
def _get_last_message_by(self, speaker_prefix, messages):
for m in reversed(messages):
if m.get("speaker") == speaker_prefix:
return m["content"]
return ""
When the ProDebater’s __call__
runs, it looks at the present stage
and speaker
within the state to choose what to do:
- If it’s the opening stage and the speaker is “pro”, it uses the
opening_chain
to generate a gap argument. If the last message from Pro was marked invalid (not validated), it knows it is a retry, so it will use theopening_retry_chain
as an alternative. - If it’s the counter stage and speaker is “pro”, it generates a counter-argument to regardless of the opponent (Con) just said. It would fetch the last message by the Con from the
messages
history, and feed that into the prompt (in order that the Pro can directly counter it). Again, if the last Pro message was invalid, it will switch to the retry chain.
After generating its argument, the Debater agent creates a brand new message entry (with speaker="pro"
, the content text, validated=False
initially, and the stage) and appends it to the state’s message list. That becomes the output of the node (LangGraph will merge this partial state update into the worldwide state).
The Con Debater agent mirrors this logic for its stages:
It similarly appends its message to the state.
It has a rebuttal and closing argument (final argument) stage, each with a traditional and a retry chain.
It checks if it’s the rebuttal stage (speaker “con”) or final argument stage (speaker “con”) and invokes the suitable chain, possibly using the last Pro message for context when rebutting.
Through the use of class-based implementation, our debaters’ code is less complicated to take care of. We will clearly separate what the Pro does vs what the Con does, even in the event that they share structure. Also, by encapsulating prompt chains contained in the class, each debater can manage multiple possible outputs (regular vs retry) cleanly.
Prompt design: The actual prompts (in prompts/pro_debater_prompts.py
and con_debater_prompts.py
) guide the GPT-4o model to tackle a persona (“You might be a debater arguing the subject…”) and produce the argument. In addition they instruct the model to maintain statements factual and logical. If a fact check fails, the retry prompt may say something like: “Your previous statement had an unverified claim. Revise your argument to be factually correct while maintaining your position.” – encouraging the model to correct itself.
With this, our AI debaters can engage in a multi-turn duel, and even get well from factual missteps.
Fact Checker Agent (FactCheckNode
)
After each debater speaks, the Fact Checker agent swoops in to confirm their claims. This agent is implemented in fact_checker_node.py
, and interestingly, it uses the GPT-4o model’s browsing ability slightly than our own custom prompts. Essentially, we delegate the fact-checking to OpenAI’s GPT-4 with web search.
How does this work? The OpenAI Python client for GPT-4 (with browsing) allows us to send a user message and get a structured response. In FactCheckNode.__call__
, we do something like:
completion = self.client.beta.chat.completions.parse(
model="gpt-4o-search-preview",
web_search_options={},
messages=[{
"role": "user",
"content": (
f"Consider the following statement from a debate. "
f"If the statement contains numbers, or figures from studies, fact-check it online.nn"
f"Statement:n"{claim}"nn"
f"Reply clearly whether any numbers or studies might be inaccurate or hallucinated, and why."
f"n"
f"If the statement doesn't contain references to studies or numbers cited, don't go online to fact-check, and just consider it successfully fact-checked, with a 'yes' score.nn"
)
}],
response_format=FactCheck
)
If the result’s “yes” (meaning the claim seems truthful or a minimum of not factually incorrect), the Fact Checker will mark the last message’s validated
field as True within the state, and output {"validated": True}
with no further changes. This signals that the controversy can proceed normally.
If the result’s “no” (meaning it found the claim to be incorrect or dubious), the Fact Checker will append a brand new message to the state with speaker="fact_checker"
describing the finding (or we could simply mark it, but providing a transient note like will be useful). It would also set validated: False
and increment a counter for whichever side made the claim. The output state from this node includes validated: False
and an updated times_pro_fact_checked
or times_con_fact_checked
count.
We also use a Pydantic BaseModel to regulate the output of the LLM:
class FactCheck(BaseModel):
"""
Pydantic model for the actual fact checking the claims made by debaters.
Attributes:
binary_score (str): 'yes' if the claim is verifiable and truthful, 'no' otherwise.
"""
binary_score: str = Field(
description="Indicates if the claim is verifiable and truthful. 'yes' or 'no'."
)
justification: str = Field(
description="Explanation of the reasoning behind the rating."
)
Debate Moderator Agent (DebateModeratorNode
)
The Debate Moderator is the conductor of the controversy. As an alternative of manufacturing lengthy text, this agent’s job is to administer turn-taking and stage progression. Within the workflow, after an announcement is validated by the Fact Checker, control passes to the Moderator node. The Moderator then issues a Command
that updates the state for the following turn and directs the flow to the suitable next agent.
The logic in DebateModeratorNode.__call__
(see nodes/debate_moderator_node.py
) goes roughly like this:
if stage == STAGE_OPENING and speaker == SPEAKER_PRO:
return Command(
update={"stage": STAGE_REBUTTAL, "speaker": SPEAKER_CON},
goto=NODE_CON_DEBATER
)
elif stage == STAGE_REBUTTAL and speaker == SPEAKER_CON:
return Command(
update={"stage": STAGE_COUNTER, "speaker": SPEAKER_PRO},
goto=NODE_PRO_DEBATER
)
elif stage == STAGE_COUNTER and speaker == SPEAKER_PRO:
return Command(
update={"stage": STAGE_FINAL_ARGUMENT, "speaker": SPEAKER_CON},
goto=NODE_CON_DEBATER
)
elif stage == STAGE_FINAL_ARGUMENT and speaker == SPEAKER_CON:
return Command(
update={},
goto=NODE_JUDGE
)
raise ValueError(f"Unexpected stage/speaker combo: stage={stage}, speaker={speaker}")
Each conditional corresponds to some extent in the controversy where a turn just ended, and sets up the following turn. For instance, after the opening (Pro just spoke), it sets stage to rebuttal, switches speaker to Con, and directs the workflow to the Con debater node. After the final_argument (Con’s closing), it directs to the Judge with no further update (the controversy stage effectively ends).
Fact Check Router (FactCheckRouterNode
)
That is one other control node (just like the Moderator) that introduces conditional logic. The Fact Check Router sits right after the Fact Checker agent within the flow. Its purpose is to branch the workflow depending on the fact-check result.
In nodes/fact_check_router_node.py
, the logic is:
if pro_fact_checks >= 3 or con_fact_checks >= 3:
disqualified = SPEAKER_PRO if pro_fact_checks >= 3 else SPEAKER_CON
winner = SPEAKER_CON if disqualified == SPEAKER_PRO else SPEAKER_PRO
verdict_msg = {
"speaker": "moderator",
"content": (
f"Debate ended early as a consequence of excessive factual inaccuracies.nn"
f"DISQUALIFIED: {disqualified.upper()} (exceeded fact check limit)n"
f"WINNER: {winner.upper()}"
),
"validated": True,
"stage": "verdict"
}
return Command(
update={"messages": messages + [verdict_msg]},
goto=END
)
if last_message.get("validated"):
return Command(goto=NODE_DEBATE_MODERATOR)
elif speaker == SPEAKER_PRO:
return Command(goto=NODE_PRO_DEBATER)
elif speaker == SPEAKER_CON:
return Command(goto=NODE_CON_DEBATER)
raise ValueError("Unable to find out routing in FactCheckRouterNode.")
First, the Fact Check Router checks if either side’s fact-check count has reached 3. In that case, it creates a Moderator-style message announcing an early end: the offending side is disqualified and the opposite side is the winner. It appends this verdict to the messages and returns a Command that jumps to END
, effectively terminating the controversy without going to the Judge (because we already know the end result).
If we’re not ending the controversy early, it then looks on the Fact Checker’s result for the last message (which is stored as validated
on that message). If validated is , we go to the controversy moderator: Command(goto=debate_moderator_node)
.
Else if the statement fails fact-check, the workflow goes back to the debater to supply a revised statement (with the state counters updated to reflect the failure). This loop can occur multiple times if needed (as much as the disqualification limit).
This dynamic control is the guts of Deb8flow’s “agentic” nature – the power to adapt the trail of execution based on the content of the agents’ outputs. It showcases LangGraph’s strength: combining control flow with state. We’re essentially encoding debate rules (like allowing retries for false claims, or ending the controversy if someone cheats too often) directly into the workflow graph.
Judge Agent (JudgeNode
)
Last but not least, the Judge agent delivers the ultimate verdict based on rhetorical skill, clarity, structure, and overall persuasiveness. Its system prompt and human prompt make this explicit:
- System Prompt: “You might be an impartial debate judge AI. … Evaluate which debater presented their case more clearly, persuasively, and logically. It’s essential to deal with communication skills, structure of argument, rhetorical strength, and overall coherence.”
- Human Prompt: “Here is the complete debate transcript. Please analyze the performance of each debaters—PRO and CON. Evaluate rhetorical performance—clarity, structure, persuasion, and relevance—and choose who presented their case more effectively.”
When the Judge node runs, it receives the complete debate transcript (all validated messages) alongside the unique topic. It then uses GPT-4o to look at how either side framed their arguments, handled counterpoints, and supported (or didn’t support) claims with examples or logic. Crucially, the Judge is forbidden to judge which position is (or who it thinks is likely to be correct)—only .
Below is an example final verdict from a Deb8flow run on the subject:
“Should governments implement a universal basic income in response to increasing automation within the workforce?”
WINNER: PRO
REASON: The PRO debater presented a more compelling and rhetorically effective case for universal basic income. Their arguments were well-structured, starting with a transparent statement of the difficulty and the need of UBI in response to automation. They effectively addressed potential counterarguments by highlighting the unprecedented speed and scope of current technological changes, which distinguishes the present situation from past technological shifts. The PRO also provided empirical evidence from UBI pilot programs to counter the CON's claims about work disincentives and economic inefficiencies, reinforcing their argument with real-world examples.
In contrast, the CON debater, while presenting valid concerns about UBI, relied heavily on historical analogies and assumptions about workforce adaptability without adequately addressing the unique challenges posed by modern automation. Their arguments concerning the fiscal burden and potential inefficiencies of UBI were less supported by specific evidence in comparison with the PRO's rebuttals.
Overall, the PRO's arguments were more coherent, persuasive, and backed by empirical evidence, making their case more convincing to a neutral observer.
Langsmith Tracing
Throughout Deb8flow’s development, I relied on LangSmith (LangChain’s tracing and observability toolkit) to make sure the complete debate pipeline was behaving appropriately. Because we’ve multiple agents passing control between themselves, it’s easy for unexpected loops or misrouted states to occur. LangSmith provides a convenient solution to:
- Visualize Execution Flow: You may see each agent’s prompt, the tokens consumed (so it’s also possible to track costs), and any intermediate states. This makes it much simpler to verify that, say, the Con Debater is correctly referencing the Pro Debater’s last message, or that the Fact Checker is accurately receiving the claim to confirm.
- Debug State Updates: If the Moderator or Fact Check Router is sending the flow to the incorrect node, the trace will highlight that mismatch. You may trace which agent was invoked at each step and why, helping you see stage or speaker misalignments early.
- Track Prompt and Completion Tokens: With multiple GPT-4o calls, it’s useful to see what number of tokens each stage is using, which LangSmith logs routinely if you happen to enable tracing.
Integrating LangSmith is unexpectedly easy. You’ll just need to offer these 3 keys in your .env file: LANGCHAIN_API_KEY
LANGCHAIN_TRACING_V2
LANGCHAIN_PROJECT
You then can open the LangSmith UI to see a structured trace of every run. This greatly reduces the guesswork involved in debugging multi-agent systems and is, in my experience, essential for more complex AI orchestration like ours. Example of a single run:

Reflections and Next Steps
Constructing Deb8flow was an eye-opening exercise in orchestrating autonomous agent workflows. We didn’t just chain a single model call – we created an entiredebate simulation with AI agents, each with a particular role, and allowed them to interact in keeping with a algorithm. LangGraph provided a transparent framework to define how data and control flows between agents, making the complex sequence manageable in code. Through the use of class-based agents and a shared state, we maintained modularity and clarity, which is able to repay for any software engineering project in the long term.
An exciting aspect of this project was seeing emergent behavior. Although each agent follows a script (a prompt), the unscripted combination – a debater attempting to deceive, a fact-checker catching it, the debater rephrasing – felt surprisingly realistic! It’s a small step toward more Agentic Ai systems that may perform non-trivial multi-step tasks with oversight on one another.
There’s loads of ideas for improvement:
- User Interaction: Currently it’s fully autonomous, but one could add a mode where a human provides the subject and even takes the role of 1 side against an AI opponent.
- We will switch the order during which the Debaters talk.
- We will change the prompts, and thus to a superb degree the behavior of the agents, and experiment with different prompts.
- Make the debaters also perform web search before producing their statements, thus providing them with the newest information.
The broader implication of Deb8flow is the way it showcases a pattern for composable AI agents. By defining clear boundaries and interactions (similar to microservices in software), we are able to have complex AI-driven processes that remain interpretable and controllable. Each agent is sort of a cog in a machine, and LangGraph is the gear system making them work in unison.
I discovered this project energizing, and I hope it inspires you to explore multi-agent workflows. Whether it’s debating, collaborating on writing, or solving problems from different expert angles, the mix of GPT, tools, and structured agentic workflows opens up a brand new world of possibilities for AI development. Completely satisfied hacking!
References
[1] D. Bouchard, “From Basics to Advanced: Exploring LangGraph,” , Nov. 22, 2023. [Online]. Available: https://medium.com/data-science/from-basics-to-advanced-exploring-langgraph-e8c1cf4db787. [Accessed: Apr. 1, 2025].
[2] A. W. T. Ng, “Constructing a Research Agent that Can Write to Google Docs: Part 1,” , Jan. 11, 2024. [Online]. Available: https://towardsdatascience.com/building-a-research-agent-that-can-write-to-google-docs-part-1-4b49ea05a292/. [Accessed: Apr. 1, 2025].