Deb8flow: Orchestrating Autonomous AI Debates with LangGraph and GPT-4o

I’ve been fascinated by debates—the strategic framing, the sharp retorts, and the rigorously timed comebacks. Debates aren’t just entertaining; they’re structured battles of ideas, driven by logic and evidence. Recently, I began wondering: could we replicate that dynamic using AI agents—having them debate one another autonomously, complete with real-time fact-checking and moderation? The result was Deb8flow, an autonomous AI debating environment powered by LangGraph, OpenAI’s GPT-4o model, and the brand new integrated Web Search feature.

In Deb8flow, two agents—Pro and Con—square off on a given topic while a Moderator manages turn-taking. A dedicated Fact Checker reviews every claim in real time using GPT-4o’s recent browsing capabilities, and a final Judge evaluates the arguments for quality and coherence. If an agent repeatedly makes factual errors, they’re routinely disqualified—ensuring the controversy stays grounded in reality.

This text offers an in-depth take a look at the advanced architecture and dynamic workflows that power autonomous AI debates. I’ll walk you thru how Deb8flow’s modular design leverages LangGraph’s state management and conditional routing, alongside GPT-4o’s capabilities.

Even if you happen to’re recent to AI agents or LangGraph (see resources [1] and [2] for primers), I’ll explain the important thing concepts clearly. And if you happen to’d prefer to explore further, the complete project is offered on GitHub: iason-solomos/Deb8flow.

Able to see how AI agents can debate autonomously in practice?

Let’s dive in.

High-Level Overview: Autonomous Debates with Multiple Agents

In Deb8flow, we orchestrate a formal debate between two AI agents – one arguing Pro and one Con – complete with a Moderator, a Fact Checker, and a final Judge. The controversy unfolds autonomously, with each agent playing a job in a structured format.

At its core, Deb8flow is a LangGraph-powered agent system, built atop LangChain, using GPT-4o to power each role—Pro, Con, Judge, and beyond. We use GPT-4o’s preview model with browsing capabilities to enable real-time fact-checking. In essence, the Pro and Con agents debate; after each statement, a fact-checker agent uses GPT-4o’s web search to catch any hallucinations or inaccuracies in that statement in real time. The controversy only continues once the statement is verified. The entire process is coordinated by a LangGraph-defined workflow that ensures proper turn-taking and conditional logic.

Image generated by the creator with DALL-E

The controversy workflow goes through these stages:

Topic Generation: A Topic Generator agent produces a nuanced, debatable topic for the session (e.g. ).
Opening: The Pro Argument Agent makes a gap statement in favor of the subject, kicking off the controversy.
Rebuttal: The Debate Moderator then gives the ground to the Con Argument agent, who rebuts the Pro’s opening statement.
Counter: The Moderator gives the ground back to the Pro agent, who counters the Con agent’s points.
Closing: The Moderator switches the ground to the Con agent one last time for a closing argument.
Judgment: Finally, the Judge agent reviews the complete debate history and evaluates either side based on argument quality, clarity, and persuasiveness. Probably the most convincing side wins.

After each speech, the Fact Checker agent steps in to confirm the factual accuracy of that statement. If a debater’s claim doesn’t delay (e.g. cites a incorrect statistic or “hallucinates” a fact), the workflow triggers a : the speaker has to correct or modify their statement. (If either debater accumulates 3 fact-check failures, they’re routinely disqualified for repeatedly spreading inaccuracies, and their opponent wins by default.) This mechanism keeps our AI debaters honest and grounded in point of fact!

Prerequisites and Setup

Before diving into the code, ensure that you’ve the next in place:

Python 3.12+ installed.
An OpenAI API key with access to the GPT-4o model. You may create your personal API key here: https://platform.openai.com/settings/organization/api-keys
Project Code: Clone the Deb8flow repository from GitHub (git clone https://github.com/iason-solomos/Deb8flow.git). The repo features a requirements.txt for all required packages. Key dependencies include LangChain/LangGraph (for constructing the agent graph) and the OpenAI Python client.
Install Dependencies: In your project directory, run: pip install -r requirements.txt to put in the needed libraries.
Create a .env file within the project root to carry your OpenAI API credentials. It needs to be of the shape: OPENAI_API_KEY_GPT4O = "sk-…"
You may as well at any time take a look at the README file: https://github.com/iason-solomos/Deb8flow if you happen to simply wish to run the finished app.

Once dependencies are installed and the environment variable is ready, you ought to be able to run the app. The project structure is organized for clarity:

Deb8flow/
├── configurations/
│ ├── debate_constants.py
│ └── llm_config.py
├── nodes/
│ ├── base_component.py
│ ├── topic_generator_node.py
│ ├── pro_debater_node.py
│ ├── con_debater_node.py
│ ├── debate_moderator_node.py
│ ├── fact_checker_node.py
│ ├── fact_check_router_node.py
│ └── judge_node.py
├── prompts/
│ ├── topic_generator_prompts.py
│ ├── pro_debater_prompts.py
│ ├── con_debater_prompts.py
│ └── … (prompts for other agents)
├── tests/ (accommodates unit and whole workflow tests)
└── debate_workflow.py

A fast tour of this structure:

configurations/ holds constant definitions and LLM configuration classes.

nodes/ accommodates the implementation of every agent or functional node in the controversy (each of those is a module defining one agent’s behavior).

prompts/ stores the prompt templates for the language model (so each agent knows learn how to prompt GPT-4o for its specific task).

debate_workflow.py ties every part together by defining the LangGraph workflow (the graph of nodes and transitions).

debate_state.py defines the shared data structure that the agents can be using on each run.

tests/ includes some basic tests and example runs to make it easier to confirm every part is working.

Under the Hood: State Management and Workflow Setup

To coordinate a posh multi-turn debate, we want a shared state and a well-defined flow. We’ll start by how Deb8flow defines the debate state and constants, after which see how the LangGraph workflow is constructed.

Defining the Debate State Schema (`debate_state.py`)

Deb8flow uses a shared state (https://langchain-ai.github.io/langgraph/concepts/low_level/#state ) in the shape of a Python TypedDict that every one agents can read from and update. This state tracks the controversy’s progress and context – things just like the topic, the history of messages, whose turn it’s, etc. By centralizing this information, each agent node could make decisions based on the present state of the controversy.

Link: debate_state.py

from typing import TypedDict, List, Dict, Literal


DebateStage = Literal["opening", "rebuttal", "counter", "final_argument"]

class DebateMessage(TypedDict):
    speaker: str  # e.g. pro or con
    content: str  # The message each speaker produced
    validated: bool  # Whether the FactChecker okay’d this message
    stage: DebateStage # The stage of the controversy when this message was produced

class DebateState(TypedDict):
    debate_topic: str
    positions: Dict[str, str]
    messages: List[DebateMessage]
    opening_statement_pro_agent: str
    stage: str  # "opening", "rebuttal", "counter", "final_argument"
    speaker: str  # "pro" or "con"
    times_pro_fact_checked: int # The variety of times the professional agent has been fact-checked. If it reaches 3, the professional agent is disqualified.
    times_con_fact_checked: int # The variety of times the con agent has been fact-checked. If it reaches 3, the con agent is disqualified.

Key fields that we want to have within the DebateState include:

debate_topic (str): The subject being debated.
messages (List[DebateMessage]): An inventory of all messages exchanged up to now. Each message is a dictionary with fields for speaker (e.g. "pro" or "con" or "fact_checker"), the message content (text), a validated flag (whether it passed fact-check), and the stage of the controversy when it was produced.
stage (str): The present debate stage (certainly one of "opening", "rebuttal", "counter", "final_argument").
speaker (str): Whose turn it’s currently ("pro" or "con").
times_pro_fact_checked / times_con_fact_checked (int): Counters for a way persistently either side has been caught with a false claim. (In our rules, if a debater fails fact-check 3 times, they may very well be disqualified or routinely lose.)
positions (Dict[str, str]): (Optional) A mapping of either side’s general stance (e.g., "pro": "In favor of the subject").

By structuring the controversy’s state, agents find it easy to access the conversation history or check the present stage, and the control logic can update the state between turns. The state is actually the memory of the controversy.

Constants and Configuration

To avoid “magic strings” scattered within the code, we define some constants in debate_constants.py. For instance, constants for stage names (STAGE_OPENING = "opening", etc.), speaker identifiers (SPEAKER_PRO = "pro", SPEAKER_CON = "con", etc.), and node names (NODE_PRO_DEBATER = "pro_debater_node", etc.). These make the code easier to take care of and skim.

debate_constants.py:

# Stage names
STAGE_OPENING = "opening"
STAGE_REBUTTAL = "rebuttal"
STAGE_COUNTER = "counter"
STAGE_FINAL_ARGUMENT = "final_argument"
STAGE_END = "end"

# Speakers
SPEAKER_PRO = "pro"
SPEAKER_CON = "con"
SPEAKER_JUDGE = "judge"

# Node names
NODE_PRO_DEBATER = "pro_debater_node"
NODE_CON_DEBATER = "con_debater_node"
NODE_DEBATE_MODERATOR = "debate_moderator_node"
NODE_JUDGE = "judge_node"

We also arrange LLM configuration in llm_config.py. Here, we define classes for OpenAI or Azure OpenAI configs after which create a dictionary llm_config_map mapping model names to their config. As an example, we map "gpt-4o" to an OpenAILLMConfig that holds the model name and API key. This fashion, each time we want to initialize a GPT-4o agent, we are able to just do llm_config_map["gpt-4o"] to get the suitable config. All our foremost agents (debaters, topic generator, judge) use this same GPT-4o configuration.

import os
from dataclasses import dataclass
from typing import Union

@dataclass
class OpenAILLMConfig:
    """
    A knowledge class to store configuration details for OpenAI models.

    Attributes:
        model_name (str): The name of the OpenAI model to make use of.
        openai_api_key (str): The API key for authenticating with the OpenAI service.
    """
    model_name: str
    openai_api_key: str


llm_config_map = {
    "gpt-4o": OpenAILLMConfig(
        model_name="gpt-4o",
        openai_api_key=os.getenv("OPENAI_API_KEY_GPT4O"),
    )
}

Constructing the LangGraph Workflow (`debate_workflow.py`)

With state and configs in place, we construct the debate workflow graph. LangGraph’s StateGraph is the backbone that connects all our agent nodes within the order they need to execute. Here’s how we set it up:

class DebateWorkflow:

    def _initialize_workflow(self) -> StateGraph:
        workflow = StateGraph(DebateState)
        # Nodes
        workflow.add_node("generate_topic_node", GenerateTopicNode(llm_config_map["gpt-4o"]))
        workflow.add_node("pro_debater_node", ProDebaterNode(llm_config_map["gpt-4o"]))
        workflow.add_node("con_debater_node", ConDebaterNode(llm_config_map["gpt-4o"]))
        workflow.add_node("fact_check_node", FactCheckNode())
        workflow.add_node("fact_check_router_node", FactCheckRouterNode())
        workflow.add_node("debate_moderator_node", DebateModeratorNode())
        workflow.add_node("judge_node", JudgeNode(llm_config_map["gpt-4o"]))

        # Entry point
        workflow.set_entry_point("generate_topic_node")

        # Flow
        workflow.add_edge("generate_topic_node", "pro_debater_node")
        workflow.add_edge("pro_debater_node", "fact_check_node")
        workflow.add_edge("con_debater_node", "fact_check_node")
        workflow.add_edge("fact_check_node", "fact_check_router_node")
        workflow.add_edge("judge_node", END)
        return workflow



    async def run(self):
        workflow = self._initialize_workflow()
        graph = workflow.compile()
        # graph.get_graph().draw_mermaid_png(output_file_path="workflow_graph.png")
        initial_state = {
            "topic": "",
            "positions": {}
        }
        final_state = await graph.ainvoke(initial_state, config={"recursion_limit": 50})
        return final_state

Let’s break down what’s happening:

We initialize a brand new StateGraph with our DebateState type because the state schema.
We add each node (agent) to the graph with a reputation. For nodes that need an LLM, we pass within the GPT-4o config. For instance, "pro_debater_node" is added as ProDebaterNode(llm_config_map["gpt-4o"]), meaning the Pro debater agent will use GPT-4o as its underlying model.
We set the entry point of the graph to "generate_topic_node". This implies step one of the workflow is to generate a debate topic.
Then we add directed edges to attach nodes. The sides above encode the first sequence: topic -> pro’s turn -> fact-check -> (then a routing decision) -> … eventually -> judge -> END. We don’t connect the Moderator or Fact Check Router with static edges, since these nodes use dynamic commands to redirect the flow. The ultimate edge connects the judge to an END marker to terminate the graph.

When the workflow runs, control will pass along these edges so as, but each time we hit a router or moderator node, that node will output a command telling the graph which node to go to next (overriding the default edge). That is how we create conditional loops: the fact_check_router_node might send us back to a debater node for a retry, as an alternative of following a straight line. LangGraph supports this by allowing nodes to return a special Command object with goto instructions.

In summary, at a high level we’ve defined an agentic workflow: a graph of autonomous agents where control can branch and loop based on the agents’ outputs. Now, let’s explore what each of those agent nodes actually does.

Agent Nodes Breakdown

Each stage or role in the controversy is encapsulated in a node (agent). In LangGraph, nodes are sometimes easy functions, but I wanted a more object-oriented approach for clarity and reusability. So in Deb8flow, every node is a class with a __call__ method. All of the foremost agent classes inherit from a standard BaseComponent for shared functionality. This design makes the system modular: we are able to easily swap out or extend agents by modifying their class definitions, and every agent class is answerable for its piece of the workflow.

Let’s undergo the important thing agents one after the other.

`BaseComponent` – A Reusable Agent Base Class

Most of our agent nodes (just like the debaters and judge) share common needs: they use an LLM to generate output, they could must retry on errors, they usually should track token usage. The BaseComponent class (defined in nodes/base_component.py) provides these common features so we don’t repeat code.

class BaseComponent:
    """
    A foundational class for managing LLM-based workflows with token tracking.
    Can handle each Azure OpenAI (AzureChatOpenAI) and OpenAI (ChatOpenAI).
    """

    def __init__(
        self,
        llm_config: Optional[LLMConfig] = None,
        temperature: float = 0.0,
        max_retries: int = 5,
    ):
        """
        Initializes the BaseComponent with optional LLM configuration and temperature.

        Args:
            llm_config (Optional[LLMConfig]): Configuration for either Azure or OpenAI.
            temperature (float): Controls the randomness of LLM outputs. Defaults to 0.0.
            max_retries (int): How persistently to retry on 429 errors.
        """
        logger = logging.getLogger(self.__class__.__name__)
        tracer = trace.get_tracer(__name__, tracer_provider=get_tracer_provider())

        self.logger = logger
        self.tracer = tracer
        self.llm: Optional[ChatOpenAI] = None
        self.output_parser: Optional[StrOutputParser] = None
        self.state: Optional[DebateState] = None
        self.prompt_template: Optional[ChatPromptTemplate] = None
        self.chain: Optional[RunnableSequence] = None
        self.documents: Optional[List] = None
        self.prompt_tokens = 0
        self.completion_tokens = 0
        self.max_retries = max_retries

        if llm_config just isn't None:
            self.llm = self._init_llm(llm_config, temperature)
            self.output_parser = StrOutputParser()

    def _init_llm(self, config: LLMConfig, temperature: float):
        """
        Initializes an LLM instance for either Azure OpenAI or standard OpenAI.
        """
        if isinstance(config, AzureOpenAILLMConfig):
            # If it's Azure, use the AzureChatOpenAI class
            return AzureChatOpenAI(
                deployment_name=config.deployment_name,
                azure_endpoint=config.azure_endpoint,
                openai_api_version=config.openai_api_version,
                openai_api_key=config.openai_api_key,
                temperature=temperature,
            )
        elif isinstance(config, OpenAILLMConfig):
            # If it's standard OpenAI, use the ChatOpenAI class
            return ChatOpenAI(
                model_name=config.model_name,
                openai_api_key=config.openai_api_key,
                temperature=temperature,
            )
        else:
            raise ValueError("Unsupported LLMConfig type.")

    def validate_initialization(self) -> None:
        """
        Ensures we've an LLM and an output parser.
        """
        if not self.llm:
            raise ValueError("LLM just isn't initialized. Ensure `llm_config` is provided.")
        if not self.output_parser:
            raise ValueError("Output parser just isn't initialized.")

    def execute_chain(self, inputs: Any) -> Any:
        """
        Executes the LLM chain, tracks token usage, and retries on 429 errors.
        """
        if not self.chain:
            raise ValueError("No chain is initialized for execution.")

        retry_wait = 1  # Initial wait time in seconds

        for attempt in range(self.max_retries):
            try:
                with get_openai_callback() as cb:
                    result = self.chain.invoke(inputs)
                    self.logger.info("Prompt Token usage: %s", cb.prompt_tokens)
                    self.logger.info("Completion Token usage: %s", cb.completion_tokens)
                    self.prompt_tokens = cb.prompt_tokens
                    self.completion_tokens = cb.completion_tokens

                return result

            except Exception as e:
                # If the error mentions 429, do exponential backoff and retry
                if "429" in str(e):
                    self.logger.warning(
                        f"Rate limit reached. Retrying in {retry_wait} seconds... "
                        f"(Attempt {attempt + 1}/{self.max_retries})"
                    )
                    time.sleep(retry_wait)
                    retry_wait *= 2
                else:
                    self.logger.error(f"Unexpected error: {str(e)}")
                    raise e

        raise Exception("API request failed after maximum variety of retries")

    def create_chain(
        self, system_template: str, human_template: str
    ) -> RunnableSequence:
        """
        Creates a sequence for unstructured outputs.
        """
        self.validate_initialization()
        self.prompt_template = ChatPromptTemplate.from_messages(
            [
                ("system", system_template),
                ("human", human_template),
            ]
        )
        self.chain = self.prompt_template | self.llm | self.output_parser
        return self.chain

    def create_structured_output_chain(
        self, system_template: str, human_template: str, output_model: Type[BaseModel]
    ) -> RunnableSequence:
        """
        Creates a sequence that yields structured outputs (parsed right into a Pydantic model).
        """
        self.validate_initialization()
        self.prompt_template = ChatPromptTemplate.from_messages(
            [
                ("system", system_template),
                ("human", human_template),
            ]
        )
        self.chain = self.prompt_template | self.llm.with_structured_output(output_model)
        return self.chain

    def build_return_with_tokens(self, node_specific_data: dict) -> dict:
        """
        Convenience method so as to add token usage info into the return values.
        """
        return {
            **node_specific_data,
            "prompt_tokens": self.prompt_tokens,
            "completion_tokens": self.completion_tokens,
        }

    def __call__(self, state: DebateState) -> None:
        """
        Updates the node's local copy of the state.
        """
        self.state = state
        for key, value in state.items():
            setattr(self, key, value)

Key features of BaseComponent:

It stores an LLM client (e.g. an OpenAI ChatOpenAI instance) initialized with a given model and API key, in addition to an output parser.
It provides a technique create_chain(system_template, human_template) which sets up a LangChain prompt chain (a RunnableSequence) combining a system prompt and a human prompt. This chain is what actually generates outputs when run.
It has an execute_chain(inputs) method that invokes the chain and includes logic to retry if the OpenAI API returns a rate-limit error (HTTP 429). This is finished with exponential backoff as much as a max_retries count.
It keeps track of token usage (prompt tokens and completion tokens) for logging or evaluation.
The __call__ approach to BaseComponent (which each subclass will call via super().__call__(state)) can perform any setup needed before the node’s foremost logic runs (like ensuring the LLM is initialized).

By constructing on BaseComponent, each agent class can deal with its unique logic (like what prompt to make use of and learn how to handle the state), while inheriting the heavy lifting of interacting with GPT-4o reliably.

Topic Generator Agent (`GenerateTopicNode`)

The Topic Generator (topic_generator_node.py) is the primary agent within the graph. Its job is to provide you with a debatable topic for the session. We give it a prompt that instructs it to output a nuanced topic that might reasonably have a professional and con side.

This agent inherits from BaseComponent and uses a prompt chain (system + human prompt) to generate one item of text – the controversy topic. When called, it executes the chain (with no special input, just using the prompt) and gets back a topic_text. It then updates the state with:

debate_topic: the generated topic (stripped of any extra whitespace),
positions: a dictionary assigning the professional and con stances (by default we use "In favor of the subject" and "Against the subject"),
stage: set to "opening",
speaker: set to "pro" (so the Pro side will speak first).

In code, the return might appear like:

return {
    "debate_topic": debate_topic,
    "positions": positions,
    "stage": "opening",
    "speaker": first_speaker  # "pro"
}

Listed below are the prompts for the subject generator:

SYSTEM_PROMPT = """
You might be a brainstorming AI that implies debate topics.
You'll provide a single, interesting or timely topic that may have two opposing views.
"""

HUMAN_PROMPT = """
Please suggest one debate topic for 2 AI agents to debate.
For instance, it may very well be about technology, politics, philosophy, or any interesting domain.
Just provide the subject in a concise sentence.
"""

Then we pass these prompts within the constructor of the category itself.

class GenerateTopicNode(BaseComponent):
    def __init__(self, llm_config, temperature: float = 0.7):
        super().__init__(llm_config, temperature)
        # Create the prompt chain.
        self.chain: RunnableSequence = self.create_chain(
            system_template=SYSTEM_PROMPT,
            human_template=HUMAN_PROMPT
        )

    def __call__(self, state: DebateState) -> Dict[str, str]:
        """
        Generates a debate topic and assigns positions to the 2 debaters.
        """
        super().__call__(state)

        topic_text = self.execute_chain({})

        # Store the subject and assign stances within the DebateState
        debate_topic = topic_text.strip()
        positions = {
            "pro": "In favor of the subject",
            "con": "Against the subject"
        }

        
        first_speaker = "pro"
        self.logger.info("Welcome to our debate panel! Today's debate topic is: %s", debate_topic)
        return {
            "debate_topic": debate_topic,
            "positions": positions,
            "stage": "opening",
            "speaker": first_speaker
        }

It’s a pattern we’ll repeat for all classes aside from those not using LLMs and the actual fact checker.

Now we are able to implement the two stars of the show, the Pro and Con argument agents!

Debater Agents (Pro and Con)

Link: pro_debater_node.py

The 2 debater agents are very similar in structure, but each uses different prompt templates tailored to their role (pro vs con) and the stage of the controversy.

The Pro debater, for instance, has to handle an opening statement and a counter-argument (countering the Con’s rebuttal). We also need logic for retries in case an announcement fails fact-check. In code, the ProDebater class sets up multiple prompt chains:

opening_chain and an opening_retry_chain (using barely different human prompts – the retry prompt might instruct it to try again without repeating any factually dubious claims).
counter_chain and counter_retry_chain for the counter-argument stage.

class ProDebaterNode(BaseComponent):
    def __init__(self, llm_config, temperature: float = 0.7):
        super().__init__(llm_config, temperature)
        self.opening_chain = self.create_chain(SYSTEM_PROMPT, OPENING_HUMAN_PROMPT)
        self.opening_retry_chain = self.create_chain(SYSTEM_PROMPT, OPENING_RETRY_HUMAN_PROMPT)
        self.counter_chain = self.create_chain(SYSTEM_PROMPT, COUNTER_HUMAN_PROMPT)
        self.counter_retry_chain = self.create_chain(SYSTEM_PROMPT, COUNTER_RETRY_HUMAN_PROMPT)

    def __call__(self, state: DebateState) -> Dict[str, Any]:
        super().__call__(state)

        debate_topic = state.get("debate_topic")
        messages = state.get("messages", [])
        stage = state.get("stage")
        speaker = state.get("speaker")

        # Check if retrying (last message was by pro and never validated)
        last_msg = messages[-1] if messages else None
        retrying = last_msg and last_msg["speaker"] == SPEAKER_PRO and never last_msg["validated"]

        if stage == STAGE_OPENING and speaker == SPEAKER_PRO:
            chain = self.opening_retry_chain if retrying else self.opening_chain # select which chain we're triggering: the traditional one or the fact-cehcked one
            result = chain.invoke({
                "debate_topic": debate_topic
            })
        elif stage == STAGE_COUNTER and speaker == SPEAKER_PRO:
            opponent_msg = self._get_last_message_by(SPEAKER_CON, messages)
            debate_history = get_debate_history(messages)
            chain = self.counter_retry_chain if retrying else self.counter_chain
            result = chain.invoke({
                "debate_topic": debate_topic,
                "opponent_statement": opponent_msg,
                "debate_history": debate_history
            })
        else:
            raise ValueError(f"Unknown turn for ProDebater: stage={stage}, speaker={speaker}")
        new_message = create_debate_message(speaker=SPEAKER_PRO, content=result, stage=stage)
        self.logger.info("Speaker: %s, Stage: %s, Retry: %snMessage:n%s", speaker, stage, retrying, result)
        return {
            "messages": messages + [new_message]
        }

    def _get_last_message_by(self, speaker_prefix, messages):
        for m in reversed(messages):
            if m.get("speaker") == speaker_prefix:
                return m["content"]
        return ""

When the ProDebater’s __call__ runs, it looks at the present stage and speaker within the state to choose what to do:

If it’s the opening stage and the speaker is “pro”, it uses the opening_chain to generate a gap argument. If the last message from Pro was marked invalid (not validated), it knows it is a retry, so it will use the opening_retry_chain as an alternative.
If it’s the counter stage and speaker is “pro”, it generates a counter-argument to regardless of the opponent (Con) just said. It would fetch the last message by the Con from the messages history, and feed that into the prompt (in order that the Pro can directly counter it). Again, if the last Pro message was invalid, it will switch to the retry chain.

After generating its argument, the Debater agent creates a brand new message entry (with speaker="pro", the content text, validated=False initially, and the stage) and appends it to the state’s message list. That becomes the output of the node (LangGraph will merge this partial state update into the worldwide state).

The Con Debater agent mirrors this logic for its stages:

It similarly appends its message to the state.

It has a rebuttal and closing argument (final argument) stage, each with a traditional and a retry chain.

It checks if it’s the rebuttal stage (speaker “con”) or final argument stage (speaker “con”) and invokes the suitable chain, possibly using the last Pro message for context when rebutting.

con_debater_node.py

Through the use of class-based implementation, our debaters’ code is less complicated to take care of. We will clearly separate what the Pro does vs what the Con does, even in the event that they share structure. Also, by encapsulating prompt chains contained in the class, each debater can manage multiple possible outputs (regular vs retry) cleanly.

Prompt design: The actual prompts (in prompts/pro_debater_prompts.py and con_debater_prompts.py) guide the GPT-4o model to tackle a persona (“You might be a debater arguing the subject…”) and produce the argument. In addition they instruct the model to maintain statements factual and logical. If a fact check fails, the retry prompt may say something like: “Your previous statement had an unverified claim. Revise your argument to be factually correct while maintaining your position.” – encouraging the model to correct itself.

With this, our AI debaters can engage in a multi-turn duel, and even get well from factual missteps.

Fact Checker Agent (`FactCheckNode`)

After each debater speaks, the Fact Checker agent swoops in to confirm their claims. This agent is implemented in fact_checker_node.py, and interestingly, it uses the GPT-4o model’s browsing ability slightly than our own custom prompts. Essentially, we delegate the fact-checking to OpenAI’s GPT-4 with web search.

How does this work? The OpenAI Python client for GPT-4 (with browsing) allows us to send a user message and get a structured response. In FactCheckNode.__call__, we do something like:

completion = self.client.beta.chat.completions.parse(
            model="gpt-4o-search-preview",
            web_search_options={},
            messages=[{
                "role": "user",
                "content": (
                        f"Consider the following statement from a debate. "
                        f"If the statement contains numbers, or figures from studies, fact-check it online.nn"
                        f"Statement:n"{claim}"nn"
                        f"Reply clearly whether any numbers or studies might be inaccurate or hallucinated, and why."
                        f"n"
                        f"If the statement doesn't contain references to studies or numbers cited, don't go online to fact-check, and just consider it successfully fact-checked, with a 'yes' score.nn"
                )
            }],
            response_format=FactCheck
        )

If the result’s “yes” (meaning the claim seems truthful or a minimum of not factually incorrect), the Fact Checker will mark the last message’s validated field as True within the state, and output {"validated": True} with no further changes. This signals that the controversy can proceed normally.

If the result’s “no” (meaning it found the claim to be incorrect or dubious), the Fact Checker will append a brand new message to the state with speaker="fact_checker" describing the finding (or we could simply mark it, but providing a transient note like will be useful). It would also set validated: False and increment a counter for whichever side made the claim. The output state from this node includes validated: False and an updated times_pro_fact_checked or times_con_fact_checked count.

We also use a Pydantic BaseModel to regulate the output of the LLM:

class FactCheck(BaseModel):
    """
    Pydantic model for the actual fact checking the claims made by debaters.

    Attributes:
        binary_score (str): 'yes' if the claim is verifiable and truthful, 'no' otherwise.
    """

    binary_score: str = Field(
        description="Indicates if the claim is verifiable and truthful. 'yes' or 'no'."
    )
    justification: str = Field(
        description="Explanation of the reasoning behind the rating."
    )

Debate Moderator Agent (`DebateModeratorNode`)

The Debate Moderator is the conductor of the controversy. As an alternative of manufacturing lengthy text, this agent’s job is to administer turn-taking and stage progression. Within the workflow, after an announcement is validated by the Fact Checker, control passes to the Moderator node. The Moderator then issues a Command that updates the state for the following turn and directs the flow to the suitable next agent.

The logic in DebateModeratorNode.__call__ (see nodes/debate_moderator_node.py) goes roughly like this:

if stage == STAGE_OPENING and speaker == SPEAKER_PRO:
            return Command(
                update={"stage": STAGE_REBUTTAL, "speaker": SPEAKER_CON},
                goto=NODE_CON_DEBATER
            )
        elif stage == STAGE_REBUTTAL and speaker == SPEAKER_CON:
            return Command(
                update={"stage": STAGE_COUNTER, "speaker": SPEAKER_PRO},
                goto=NODE_PRO_DEBATER
            )
        elif stage == STAGE_COUNTER and speaker == SPEAKER_PRO:
            return Command(
                update={"stage": STAGE_FINAL_ARGUMENT, "speaker": SPEAKER_CON},
                goto=NODE_CON_DEBATER
            )
        elif stage == STAGE_FINAL_ARGUMENT and speaker == SPEAKER_CON:
            return Command(
                update={},
                goto=NODE_JUDGE
            )

        raise ValueError(f"Unexpected stage/speaker combo: stage={stage}, speaker={speaker}")

Each conditional corresponds to some extent in the controversy where a turn just ended, and sets up the following turn. For instance, after the opening (Pro just spoke), it sets stage to rebuttal, switches speaker to Con, and directs the workflow to the Con debater node. After the final_argument (Con’s closing), it directs to the Judge with no further update (the controversy stage effectively ends).

Fact Check Router (`FactCheckRouterNode`)

That is one other control node (just like the Moderator) that introduces conditional logic. The Fact Check Router sits right after the Fact Checker agent within the flow. Its purpose is to branch the workflow depending on the fact-check result.

In nodes/fact_check_router_node.py, the logic is:

if pro_fact_checks >= 3 or con_fact_checks >= 3:
            disqualified = SPEAKER_PRO if pro_fact_checks >= 3 else SPEAKER_CON
            winner = SPEAKER_CON if disqualified == SPEAKER_PRO else SPEAKER_PRO

            verdict_msg = {
                "speaker": "moderator",
                "content": (
                    f"Debate ended early as a consequence of excessive factual inaccuracies.nn"
                    f"DISQUALIFIED: {disqualified.upper()} (exceeded fact check limit)n"
                    f"WINNER: {winner.upper()}"
                ),
                "validated": True,
                "stage": "verdict"
            }
            return Command(
                update={"messages": messages + [verdict_msg]},
                goto=END
            )
        if last_message.get("validated"):
            return Command(goto=NODE_DEBATE_MODERATOR)
        elif speaker == SPEAKER_PRO:
            return Command(goto=NODE_PRO_DEBATER)
        elif speaker == SPEAKER_CON:
            return Command(goto=NODE_CON_DEBATER)
        raise ValueError("Unable to find out routing in FactCheckRouterNode.")

First, the Fact Check Router checks if either side’s fact-check count has reached 3. In that case, it creates a Moderator-style message announcing an early end: the offending side is disqualified and the opposite side is the winner. It appends this verdict to the messages and returns a Command that jumps to END, effectively terminating the controversy without going to the Judge (because we already know the end result).

If we’re not ending the controversy early, it then looks on the Fact Checker’s result for the last message (which is stored as validated on that message). If validated is , we go to the controversy moderator: Command(goto=debate_moderator_node).

Else if the statement fails fact-check, the workflow goes back to the debater to supply a revised statement (with the state counters updated to reflect the failure). This loop can occur multiple times if needed (as much as the disqualification limit).

This dynamic control is the guts of Deb8flow’s “agentic” nature – the power to adapt the trail of execution based on the content of the agents’ outputs. It showcases LangGraph’s strength: combining control flow with state. We’re essentially encoding debate rules (like allowing retries for false claims, or ending the controversy if someone cheats too often) directly into the workflow graph.

Judge Agent (`JudgeNode`)

Last but not least, the Judge agent delivers the ultimate verdict based on rhetorical skill, clarity, structure, and overall persuasiveness. Its system prompt and human prompt make this explicit:

System Prompt: “You might be an impartial debate judge AI. … Evaluate which debater presented their case more clearly, persuasively, and logically. It’s essential to deal with communication skills, structure of argument, rhetorical strength, and overall coherence.”
Human Prompt: “Here is the complete debate transcript. Please analyze the performance of each debaters—PRO and CON. Evaluate rhetorical performance—clarity, structure, persuasion, and relevance—and choose who presented their case more effectively.”

When the Judge node runs, it receives the complete debate transcript (all validated messages) alongside the unique topic. It then uses GPT-4o to look at how either side framed their arguments, handled counterpoints, and supported (or didn’t support) claims with examples or logic. Crucially, the Judge is forbidden to judge which position is (or who it thinks is likely to be correct)—only .

Below is an example final verdict from a Deb8flow run on the subject:
“Should governments implement a universal basic income in response to increasing automation within the workforce?”

WINNER: PRO

REASON: The PRO debater presented a more compelling and rhetorically effective case for universal basic income. Their arguments were well-structured, starting with a transparent statement of the difficulty and the need of UBI in response to automation. They effectively addressed potential counterarguments by highlighting the unprecedented speed and scope of current technological changes, which distinguishes the present situation from past technological shifts. The PRO also provided empirical evidence from UBI pilot programs to counter the CON's claims about work disincentives and economic inefficiencies, reinforcing their argument with real-world examples.

In contrast, the CON debater, while presenting valid concerns about UBI, relied heavily on historical analogies and assumptions about workforce adaptability without adequately addressing the unique challenges posed by modern automation. Their arguments concerning the fiscal burden and potential inefficiencies of UBI were less supported by specific evidence in comparison with the PRO's rebuttals.

Overall, the PRO's arguments were more coherent, persuasive, and backed by empirical evidence, making their case more convincing to a neutral observer.

Langsmith Tracing

Throughout Deb8flow’s development, I relied on LangSmith (LangChain’s tracing and observability toolkit) to make sure the complete debate pipeline was behaving appropriately. Because we’ve multiple agents passing control between themselves, it’s easy for unexpected loops or misrouted states to occur. LangSmith provides a convenient solution to:

Visualize Execution Flow: You may see each agent’s prompt, the tokens consumed (so it’s also possible to track costs), and any intermediate states. This makes it much simpler to verify that, say, the Con Debater is correctly referencing the Pro Debater’s last message, or that the Fact Checker is accurately receiving the claim to confirm.
Debug State Updates: If the Moderator or Fact Check Router is sending the flow to the incorrect node, the trace will highlight that mismatch. You may trace which agent was invoked at each step and why, helping you see stage or speaker misalignments early.
Track Prompt and Completion Tokens: With multiple GPT-4o calls, it’s useful to see what number of tokens each stage is using, which LangSmith logs routinely if you happen to enable tracing.

Integrating LangSmith is unexpectedly easy. You’ll just need to offer these 3 keys in your .env file: LANGCHAIN_API_KEY

LANGCHAIN_TRACING_V2

LANGCHAIN_PROJECT

You then can open the LangSmith UI to see a structured trace of every run. This greatly reduces the guesswork involved in debugging multi-agent systems and is, in my experience, essential for more complex AI orchestration like ours. Example of a single run:

The trace in waterfall mode in Lansmith of 1 run, showing how the entire flow ran. Source: Generated by the creator using Langsmith.

Reflections and Next Steps

Constructing Deb8flow was an eye-opening exercise in orchestrating autonomous agent workflows. We didn’t just chain a single model call – we created an entiredebate simulation with AI agents, each with a particular role, and allowed them to interact in keeping with a algorithm. LangGraph provided a transparent framework to define how data and control flows between agents, making the complex sequence manageable in code. Through the use of class-based agents and a shared state, we maintained modularity and clarity, which is able to repay for any software engineering project in the long term.

An exciting aspect of this project was seeing emergent behavior. Although each agent follows a script (a prompt), the unscripted combination – a debater attempting to deceive, a fact-checker catching it, the debater rephrasing – felt surprisingly realistic! It’s a small step toward more Agentic Ai systems that may perform non-trivial multi-step tasks with oversight on one another.

There’s loads of ideas for improvement:

User Interaction: Currently it’s fully autonomous, but one could add a mode where a human provides the subject and even takes the role of 1 side against an AI opponent.
We will switch the order during which the Debaters talk.
We will change the prompts, and thus to a superb degree the behavior of the agents, and experiment with different prompts.
Make the debaters also perform web search before producing their statements, thus providing them with the newest information.

The broader implication of Deb8flow is the way it showcases a pattern for composable AI agents. By defining clear boundaries and interactions (similar to microservices in software), we are able to have complex AI-driven processes that remain interpretable and controllable. Each agent is sort of a cog in a machine, and LangGraph is the gear system making them work in unison.

I discovered this project energizing, and I hope it inspires you to explore multi-agent workflows. Whether it’s debating, collaborating on writing, or solving problems from different expert angles, the mix of GPT, tools, and structured agentic workflows opens up a brand new world of possibilities for AI development. Completely satisfied hacking!

References

[1] D. Bouchard, “From Basics to Advanced: Exploring LangGraph,” , Nov. 22, 2023. [Online]. Available: https://medium.com/data-science/from-basics-to-advanced-exploring-langgraph-e8c1cf4db787. [Accessed: Apr. 1, 2025].

[2] A. W. T. Ng, “Constructing a Research Agent that Can Write to Google Docs: Part 1,” , Jan. 11, 2024. [Online]. Available: https://towardsdatascience.com/building-a-research-agent-that-can-write-to-google-docs-part-1-4b49ea05a292/. [Accessed: Apr. 1, 2025].

Deb8flow: Orchestrating Autonomous AI Debates with LangGraph and GPT-4o

High-Level Overview: Autonomous Debates with Multiple Agents

Prerequisites and Setup

Under the Hood: State Management and Workflow Setup

Defining the Debate State Schema (`debate_state.py`)

Constants and Configuration

Constructing the LangGraph Workflow (`debate_workflow.py`)

Agent Nodes Breakdown

`BaseComponent` – A Reusable Agent Base Class

Topic Generator Agent (`GenerateTopicNode`)

Debater Agents (Pro and Con)

Fact Checker Agent (`FactCheckNode`)

Debate Moderator Agent (`DebateModeratorNode`)

Fact Check Router (`FactCheckRouterNode`)

Judge Agent (`JudgeNode`)

Langsmith Tracing

Reflections and Next Steps

References

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

The Recent and Fresh analytics in Inference Endpoints

How AlphaChip transformed computer chip design

The Machine Learning “Advent Calendar” Day 13: LASSO and Ridge Regression in Excel

Introducing Gradio's recent Dataframe!

Demis Hassabis & John Jumper awarded Nobel Prize in Chemistry

Deb8flow: Orchestrating Autonomous AI Debates with LangGraph and GPT-4o

High-Level Overview: Autonomous Debates with Multiple Agents

Prerequisites and Setup

Under the Hood: State Management and Workflow Setup

Defining the Debate State Schema (debate_state.py)

Constants and Configuration

Constructing the LangGraph Workflow (debate_workflow.py)

Agent Nodes Breakdown

BaseComponent – A Reusable Agent Base Class

Topic Generator Agent (GenerateTopicNode)

Debater Agents (Pro and Con)

Fact Checker Agent (FactCheckNode)

Debate Moderator Agent (DebateModeratorNode)

Fact Check Router (FactCheckRouterNode)

Judge Agent (JudgeNode)

Langsmith Tracing

Reflections and Next Steps

References

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

Defining the Debate State Schema (`debate_state.py`)

Constructing the LangGraph Workflow (`debate_workflow.py`)

`BaseComponent` – A Reusable Agent Base Class

Topic Generator Agent (`GenerateTopicNode`)

Fact Checker Agent (`FactCheckNode`)

Debate Moderator Agent (`DebateModeratorNode`)

Fact Check Router (`FactCheckRouterNode`)

Judge Agent (`JudgeNode`)

What are your thoughts on this topic?
Let us know in the comments below.