Picture this: 4 AI experts sitting around a poker table, debating your hardest decisions in real-time. That is exactly what Consilium, the multi-LLM platform I built in the course of the Gradio Agents & MCP Hackathon, does. It lets AI models discuss complex questions and reach consensus through structured debate.
The platform works each as a visible Gradio interface and as an MCP (Model Context Protocol) server that integrates directly with applications like Cline (Claude Desktop had issues because the timeout couldn’t be adjusted). The core idea was all the time about LLMs reaching consensus through discussion; that is where the name Consilium got here from. Later, other decision modes like majority voting and ranked selection were added to make the collaboration more sophisticated.
From Concept to Architecture
This wasn’t my original hackathon idea. I initially wanted to construct an easy MCP server to refer to my projects in RevenueCat. But I reconsidered when I noticed a multi-LLM platform where these models discuss questions and return well-reasoned answers can be way more compelling.
The timing turned out to be perfect. Shortly after the hackathon, Microsoft published their AI Diagnostic Orchestrator (MAI-DxO), which is actually an AI doctor panel with different roles like “Dr. Challenger Agent” that iteratively diagnose patients. Of their setup with OpenAI o3, they appropriately solved 85.5% of medical diagnosis benchmark cases, while practicing physicians achieved only 20% accuracy. This validates exactly what Consilium demonstrates: multiple AI perspectives collaborating can dramatically outperform individual evaluation.
After selecting the concept, I needed something that worked as each an MCP server and a fascinating Hugging Face space demo. Initially I considered using the usual Gradio Chat component, but I wanted my submission to face out. The concept was to seat LLMs around a table in a boardroom with speech bubbles, which should capture the collaborative discussion while also making it visually engaging. As I didn’t manage to style a typical table nicely so it was actually recognized as a table, I went for a poker-style roundtable. This approach also let me undergo two hackathon tracks by constructing a custom Gradio component and MCP server.
Constructing the Visual Foundation
The custom Gradio component became the center of the submission; the poker-style roundtable where participants sit and display speech bubbles showing their responses, pondering status, and research activities immediately caught the attention of anyone visiting the space. The component development was remarkably smooth due to Gradio’s excellent developer experience, though I did encounter one documentation gap around PyPI publishing that led to my first contribution to the Gradio project.
roundtable = consilium_roundtable(
  label="AI Expert Roundtable",
label_icon="https://huggingface.co/front/assets/huggingface_logo-noborder.svg",
  value=json.dumps({
    "participants": [],
    "messages": [],
    "currentSpeaker": None,
    "pondering": [],
    "showBubbles": [],
    "avatarImages": avatar_images
  })
)
The visual design proved robust throughout the hackathon; after the initial implementation, only features like user-defined avatars and center table text were added, while the core interaction model remained unchanged.
In case you are focused on creating your individual custom Gradio component you must take a have a look at Custom Components in 5 minutes and yes the title doesn’t lie; it literally only takes 5 minutes for the fundamental setup.
Session State Management
The visual roundtable maintains state through a session-based dictionary system where each user gets isolated state storage via user_sessions[session_id]. The core state object tracks participants, messages, currentSpeaker, pondering, and showBubbles arrays which are updated through update_visual_state() callbacks. When models are pondering, speaking, or research is being executed, the engine pushes incremental state updates to the frontend by appending to the messages array and toggling speaker/pondering states, creating the real-time visual flow without complex state machines – just direct JSON state mutations synchronized between backend processing and frontend rendering.
Making LLMs Actually Discuss
While implementing, I noticed there was no real discussion happening between the LLMs because they lacked clear roles. They received the total context of ongoing discussions but didn’t know methods to engage meaningfully. I introduced distinct roles to create productive debate dynamics, which, after a number of tweaks, ended up being like this:
self.roles = {
'standard': "Provide expert evaluation with clear reasoning and evidence.",
'expert_advocate': "You might be a PASSIONATE EXPERT advocating on your specialized position. Present compelling evidence with conviction.",
'critical_analyst': "You might be a RIGOROUS CRITIC. Discover flaws, risks, and weaknesses in arguments with analytical precision.",
'strategic_advisor': "You might be a STRATEGIC ADVISOR. Deal with practical implementation, real-world constraints, and actionable insights.",
'research_specialist': "You might be a RESEARCH EXPERT with deep domain knowledge. Provide authoritative evaluation and evidence-based insights.",
'innovation_catalyst': "You might be an INNOVATION EXPERT. Challenge conventional pondering and propose breakthrough approaches."
}
This solved the discussion problem but raised a brand new query: methods to determine consensus or discover the strongest argument? I implemented a lead analyst system where users select one LLM to synthesize the and evaluate whether consensus was reached.
I also wanted users to manage communication structure. Beyond the default full-context sharing, I added two alternative modes:
- Ring: Each LLM only receives the previous participant’s response Â
- Star: All messages flow through the lead analyst as a central coordinator
Finally, discussions need endpoints. I implemented configurable rounds (1-5), with testing showing that more rounds increase the likelihood of reaching consensus (though at higher computational cost).
LLM Selection and Research Integration
The present model selection includes Mistral Large, DeepSeek-R1, Meta-Llama-3.3-70B, and QwQ-32B. While notable models like Claude Sonnet and OpenAI’s o3 are absent, this reflected hackathon credit availability and sponsor award considerations slightly than technical limitations.
self.models = {
  'mistral': {
    'name': 'Mistral Large',
    'api_key': mistral_key,
    'available': bool(mistral_key)
  },
  'sambanova_deepseek': {
    'name': 'DeepSeek-R1',
    'api_key': sambanova_key,
    'available': bool(sambanova_key)
  }
...
}
For models supporting function calling, I integrated a dedicated research agent that appears as one other roundtable participant. Fairly than giving models direct web access, this agent approach provides visual clarity about external resource availability and ensures consistent access across all function-calling models.
def handle_function_calls(self, completion, original_prompt: str, calling_model: str) -> str:
  """UNIFIED function call handler with enhanced research capabilities"""
 Â
  message = completion.selections[0].message
 Â
 Â
  if not hasattr(message, 'tool_calls') or not message.tool_calls:
    return message.content
 Â
 Â
  for tool_call in message.tool_calls:
    function_name = tool_call.function.name
    arguments = json.loads(tool_call.function.arguments)
   Â
   Â
    result = self._execute_research_function(function_name, arguments, calling_model_name)
The research agent accesses five sources: Web Search, Wikipedia, arXiv, GitHub, and SEC EDGAR. I built these tools on an extensible base class architecture for future expansion while specializing in freely embeddable resources.
class BaseTool(ABC):
  """Base class for all research tools"""
 Â
  def __init__(self, name: str, description: str):
    self.name = name
    self.description = description
self.last_request_time = 0
    self.rate_limit_delay = 1.0
 Â
  @abstractmethod
  def search(self, query: str, **kwargs) -> str:
    """Essential search method - implemented by subclasses"""
    pass
 Â
  def score_research_quality(self, research_result: str, source: str = "web") -> Dict[str, float]:
    """Rating research based on recency, authority, specificity, relevance"""
    quality_score = {
      "recency": self._check_recency(research_result),
      "authority": self._check_authority(research_result, source),
      "specificity": self._check_specificity(research_result),
      "relevance": self._check_relevance(research_result)
    }
    return quality_score
Since research operations might be time-intensive, the speech bubbles display progress indicators and time estimates to take care of user engagement during longer research tasks.
Discovering the Open Floor Protocol
After the hackathon, Deborah Dahl introduced me to the Open Floor Protocol, which aligns perfectly with the roundtable approach. This protocol provides standardized JSON message formatting for cross-platform agent communication. Its key differentiator from other agent-to-agent protocols is that every one agents maintain constant conversation awareness exactly like sitting at the identical table. One other feature I even have not seen with other protocols is that the ground manager can dynamically invite and take away agents from the ground and agents.
The protocol’s interaction patterns map on to Consilium’s architecture:
- Delegation: Transferring control between agents Â
- Channeling: Passing messages without modification Â
- Mediation: Coordinating behind the scenes Â
- Orchestration: Multiple agents collaborating
I’m currently integrating Open Floor Protocol support to permit users so as to add any OFP-compliant agents to their roundtable discussions. You may follow this development at: https://huggingface.co/spaces/azettl/consilium_ofp
Lessons Learned and Future Implications
The hackathon introduced me to multi-agent debate research I hadn’t previously encountered, including foundational studies like Encouraging Divergent Considering in Large Language Models through Multi-Agent Debate. The community experience was remarkable; all participants actively supported one another through Discord feedback and collaboration. Seeing my roundtable component integrated into one other hackathon project was one among my highlights working on Consilium.
I’ll proceed to work on Consilium and with expanded model selection, Open Floor Protocol integration, and configurable agent roles, the platform could support virtually any multi-agent debate scenario conceivable.
Constructing Consilium reinforced my conviction that AI’s future lies not only in additional powerful individual models but in systems enabling effective AI collaboration. As specialized smaller language models grow to be more efficient and resource-friendly, I imagine roundtables of task-specific SLMs with dedicated research agents may provide compelling alternatives to general-purpose large language models for a lot of use cases.
