Multi-Agent Arena: Insights from London Great Agent Hack 2025

Persons are going to make use of increasingly more AI. Acceleration goes to be the trail forward for computing. These fundamental trends, I completely imagine in them.

Jensen Huang. Nvidia CEO

I had the amazing opportunity to take part in the Great Agent Hack 2025, hosted by Holistic AI at UCL[2, 3]. The hackathon was structured around three big challenges: Agent Iron Man, Agent Glass Box, and Dear Grandma, each representing a distinct philosophy of agentic AI. These weren’t just creative names for convenient categories; they reflected three pillars of how we take into consideration agents today: robustness, transparency, and user safety (of anyone, including your grandma 😄). Being immersed in that environment for a weekend was a sort of reset button for me: it was energising, it jogged my memory why I enjoy working on this field, and it left me genuinely inspired to continue learning and constructing, even when there’s never enough time to explore all the things that’s happening around AI.

On this hackathon, greater than 50 projects were developed across three tracks. The main target of this text can be on key moments from the event and a handful of projects that stood out to me personally, while recognizing that each team contributed something invaluable to the broader conversation on constructing robust and trustworthy agents. For readers who wish to explore the complete range of ideas, the entire gallery of 51 submissions is obtainable here: https://hai-great-agent-hack-2025.devpost.com/project-gallery?page=1 [4].

Figure 1. Official leaflet and my T-shirt from The Great Agent Hack 2025. Image by the creator.

Hosted by the UCL Centre for Digital Innovation (CDI), we spent the weekend in some truly unique spaces in East London, the sort of place where you walk past the Orbit Tower (the red sculpture from the 2012 Olympics) after which code under a rotating floating Earth contained in the constructing (Figure 2). London was already covered in Christmas lights all over the place you walked, so moving between the hackathon and town felt like stepping between a research lab and a vacation postcard.

**Figure 2.** East London views: UCL East campus and the ArcelorMittal Orbit (also called Orbit Tower) (left), and the floating Earth installation contained in the UCL Centre for Digital Innovation (right). Photos by the creator.

In total, the hackathon brought together greater than 200 participants and roughly 25 different awards across every kind of categories. Teams weren’t dropped in cold: before the weekend they’d access to tutorials, example notebooks, and other resources that helped them prepare [5], select a track, and hit the bottom running once the clock began. As deliverables, each team was expected to submit a public GitHub repository, record a brief demo, and create a poster or slide deck to present their solution to the jury, which made it much easier to grasp the complete workflow and real-world potential of each project.

The jury got here from a surprisingly diverse mixture of organisations: Holistic AI (the organiser), the UCL Centre for Digital Innovation (CDI), AWS, Valyu, NVIDIA, Entrepreneurs First, and others, including firms serious about the talent and concepts on display. They chose the winners for every of the three foremost tracks, but additionally handed out an entire constellation of mystery and special awards that celebrated far more than simply probably the most technically advanced solution.

Amongst these special awards there was a Brave Soldier-style prize for the team that showed true resilience and kept going even when their teammates began disappearing, literally leaving one soldier standing; a Best Pitch award, because selling your idea can be a part of getting the job done (especially since technical professionals are likely to struggle a bit with this); and a Highest Resource Usage prize for the teams that actually leaned into AWS and squeezed every last spark out of the cloud. These and other award categories are summarised on the hackathon website [2].

Some of the curious things concerning the weekend was the prospect to see NVIDIA’s ultra‑compact AI supercomputer up close and even take a photograph with the long-lasting leather‑jacket setup to recreate the famous Elon Musk × Jensen Huang “leather jacket moment” [6] shown on the large screen (Figure 3). To make it even higher, among the agents we were attempting to break within the Dear Grandma challenge were actually running on similar NVIDIA GPU hardware, so this tiny supercomputer was literally the brain behind the agents that competitors were attacking.

**Figure 3.** The total NVIDIA experience: the leather-jacket photo setup with the DGX Spark (left) and a close-up of the ultra-compact DGX Spark (right). Images by the creator.

The Agentic Arena

As mentioned at the start of this text, the guts of the weekend was structured around three tracks (Figure 4). Each explored a distinct query about modern AI agents: how you can construct them so that they , how you can make them , and how you can be certain they .

Teams could pick whichever track best fit their use case, but in practice many projects naturally crossed track boundaries; an indication of how eager people were to learn, connect, and produce together different features of the agent lifecycle (yes, the concept the more tracks you join the greater your possibilities of winning was floating around too, but we’ll skip that for now 😉).

**Figure 4.** The three tracks of the Great Agent Hack 2025: (construct agents that don’t break), (understand agent behaviour), and (attack like a red team, defend like a guardian). Image by Creator.

Track A. Agent Iron Man: Agents that work, and last

This was the engineering reality check track. The goal was to construct a high-performing, production-ready multi-agent architecture with clear agent roles, tools, and memory wired together in a way that would actually survive outside a hackathon.

Evaluation focused on things that typically only hurt you in production: performance (speed, latency, cost), robustness (how the agent handles tool failures, bad inputs, and edge cases), architecture quality (clean separation between agents, protected tool orchestration, sensible fallbacks), and monitoring (observability, structured outputs, basic health checks). Teams were also expected to account for carbon footprint by favouring smaller or cheaper models where possible and measuring energy and token usage, so the agent stays a conservative, responsible use of compute.

This track can be a small taste of what’s coming as agents change into more widely used and systems grow more complex, with many services talking to one another while still needing to satisfy tight latency and price targets.

Between the projects, one which caught my eye was FairQuote [4]: an intelligent automobile‑insurance underwriting system that uses an orchestrator agent plus specialised intake, pricing, and policy agents that coordinate to gather data, assess risk, calculate premiums, and generate explainable policies in a single conversation; architecturally, it points toward the subsequent wave of multi‑agent enterprise workflows, where robustness, clear responsibilities, and powerful observability matter just as much because the underlying models.

Underwriting is a very good example since it’s considered one of the toughest and most business-critical problems in insurance. It sits on the intersection of regulation, actuarial science, and customer experience: every decision about accepting a risk, pricing it, or applying exclusions passes through this process. When underwriting is slow or opaque, customers get frustrated, partners lose trust, and insurers risk mispriced portfolios and regulatory scrutiny. When it really works well, it quietly keeps the system stable, allocating capital efficiently, protecting the balance sheet, and supporting fair pricing across segments.

So, on this track, it was great to see not only solid engineering, but additionally the true problems teams tackled: underwriting, end-to-end claims handling, fraud investigation, and even emergency-services dispatch, where multi-agent systems coordinated triage and decision support in real time. Even when the weekend outputs were still demos, they pointed toward the multi-agent patterns, safeguards, and monitoring that may matter as similar architectures move from hackathon tables into live enterprise environments.

Team tool selections lined up closely with the hackathon’s beneficial stack: AWS AgentCore with the Strands Agents SDK for orchestration, Amazon Nova and other Bedrock-hosted models (smaller SLMs to remain frugal), and evaluation frameworks like AgentHarm [7]. The latter enables you to test whether an LLM agent can appropriately sequence synthetic tools resembling dark-web search, web scrapers, email senders, payment or bank-transfer functions, and code or shell tools; so you’ll be able to measure each its robustness to jailbreaks and the way capable it stays at executing multi-step harmful workflows once safety barriers are bypassed.

Track B. Agent Glass Box: Agents you’ll be able to see, and trust

The transparency track focused on making agentic systems explainable, auditable, and interpretable for humans and organisations. Teams were asked to construct agents whose reasoning, memory updates, and actions may very well be traced and inspected in real time, as an alternative of remaining opaque black boxes. In practice, the projects fell into several families: observability pipelines, explainability tools, governance and safety layers and expert‑discovery or traceability tools.

For me, considered one of the projects that best captured the concept of a “glass box” was GenAI Explainer. Everyone knows text-to-image diffusion models will be powerful but dangerous: traditional diffusion systems have already been shown to breed societal biases [8], and even newer models like FLUX.1 can still reflect patterns of their training data [9] while offering almost no insight into why a selected image appears the way in which it does. On the hackathon, the GenAI Explainer team tackled this by wrapping FLUX.1 with an explainability layer that enables you to see how each word or segment of a prompt influences the generated image, audit outputs for brand, legal, or safety compliance, and iteratively refine prompts while watching the impact live, with every generation step tracked. In practice, they turned diffusion from a black box into something much closer to a glass-box, auditable workflow.

Ultimately, Track B was a reminder that algorithmic transparency isn’t any longer optional: legal and risk teams increasingly need to indicate that automated decisions are explainable and never biased, and the sort of ‘glass‑box’ considering behind projects like GenAI Explainer is something we must always carry into every agentic application we construct.

On this track, team tool selections combined tracing platforms resembling LangSmith or LangFuse, AWS observability services like CloudWatch, X‑Ray, or Bedrock monitoring, and research tools like AgentGraph [10] (converting traces into interactive knowledge graphs), AgentSeer [11] (constructing motion graphs and doing failure/vulnerability evaluation), and the Who_and_When failure‑attribution [12] dataset to analyse and visualise agent traces in depth, to say just a couple of.

Track C. Dear Grandma: Agents that stay protected, and behave

On this track, teams got seven secret LLM agents 🐺🦊🦅🐻🐜🐘🦎, each represented by an animal, and the mission was to interrupt them, understand them, and discover them. These seven hidden “stealth agents” symbolised different behaviours, strengths, and attack surfaces that teams needed to uncover. The challenge was to construct a red‑teaming framework that would attack any of the seven live animal‑agent endpoints using the API provided by the event organisers, backed by NVIDIA powered infrastructure.

Within the hackathon, each “animal” agent was a live AI system exposed through a single API service, with different routes for every animal. Teams could send prompts to those animal‑specific routes and observe how the agents behaved in real time, each with its own personality and capabilities, which helped red‑teamers design targeted tests and attacks.

Figure 5. Example of a jailbreak test against among the “animal” agents: in front of a DAN‑style prompt, each model responds with a playful refusal and a consistent safety message, revealing each their shared guardrails and their distinct personalities.

Track C wasn’t limited to the seven “animal” agents behind the API; attacking industrial systems like ChatGPT, Claude, or Gemini was also allowed so long as teams treated it as a part of a scientific security assessment.

In this manner, the answer should analyse, attack, and explain AI agent vulnerabilities, perform behavioural forensics, and understand why the attack works.

The jailbreaking lab team use a two‑step process where they first built an attack library of proven jailbreak prompts, based on techniques reported within the literature resembling Base64 obfuscation, CSS/HTML injection, and other prompt‑level tricks. Second, they applied a genetic algorithm to mutate and improve these prompts: at any time when an attack from the 1st step partially succeeded, the algorithm would tweak it (changing wording, adding context, combining two prompts, or further obfuscating instructions) in order that successful variants were kept and weak ones were discarded. Over time, this evolutionary search produced stronger and stronger adversarial prompts and even uncovered entirely latest ways to interrupt the agents.

HSIA was one other standout project that pushed these ideas into the robotics world. As an alternative of attacking the animal agents, they targeted a Visual–Language–Motion (VLA) robotic system and showed how its perception may very well be corrupted on the semantic level. The pixels within the image stayed the exact same; what modified was the interior caption generated by the model. With subtle, rigorously crafted perturbations, the VLA system could flip from “I see a bottle within the image” to “I see a knife within the image,” regardless that no knife was present, leading the robot to act on a false belief about its environment. Their work highlights that multimodal systems will be compromised without touching the raw image, exposing a critical vulnerability for next-generation robotic AI.

Lessons Learned

If I needed to summarise what this hackathon taught me, it will be:

Be a Brave Soldier. Perseverance matters greater than competition. It’s not about beating others; it’s about staying resilient, adapting when things break (because they ), and delivering the most effective version of your idea. Events like this aren’t just technical challenges; they’re opportunities to showcase your talent and the sort of determination firms genuinely value.

Prepare ahead of time. The teams that did well weren’t necessarily probably the most senior, they were those who arrived already knowing the format, the expectations, the evaluation criteria, and had passed through the tutorials and resources shared upfront.

Master the 5-minute pitch. That is critical. Evaluators and judges move fast. You may spend several days constructing something, but you simply get a couple of minutes to make them care. So, have a pitch ready that explains the worth of your project clearly, quickly, and in a way that sparks curiosity. If those 5 minutes are great, the judges will ask for more. This is applicable equally to junior profiles and senior engineers (storytelling is a component of the job). I struggle with this too; in real life we often don’t have much time to prove our ideas.

These Events Are Becoming More Meaningful Than Ever. These events are gaining more interest yearly, and the organisers even doubled the variety of spots this yr, which shows how invaluable the experience is. That’s why it’s so necessary to participate only in the event you truly wish to be there and may commit your time and energy.

Study the sponsors. Before the event, look up the businesses involved and take into consideration which of them is perhaps most serious about your approach. Tailor your pitch accordingly. Sponsors should not just judges they’re potential collaborators, mentors, and even future teammates.

Strong Fundamentals Beat Shiny Models. One key takeaway from the hackathon is that winning wasn’t about using the latest or most hyped models. The highest teams didn’t succeed because they relied on the biggest or flashiest architectures, they excelled because they built strong solutions on top of solid, well-understood techniques: genetic algorithms, robust diffusion models, between other. The true differentiator was how creatively they combined these foundations with agentic methodologies, clever evaluation setups, and smart engineering to tackle persistent challenges.

Collaborative Innovation Accelerates Progress. The event highlighted how cross-disciplinary collaboration between academia, industry, and AI governance experts can significantly strengthen each AI development and governance frameworks. Even participants who weren’t in technical roles contributed invaluable ideas grounded in real problems from their very own domains, bringing perspectives that pure engineering alone can’t provide. It’s also an ideal opportunity to attach with people outside your usual technical bubble, expanding not only your network, but the way in which you consider the impact and applications of AI.

Finally, an even bigger reflection: agents are evolving fast, and with that comes latest architectural challenges, safety concerns, and responsibilities. These should not hypothetical problems of the long run, they’re happening without delay. Being responsible with AI applications shouldn’t be a hype-driven slogan; it’s a part of the each day job of any AI or data science skilled.

Conclusions

These events are quietly shaping how we take into consideration AI governance. While you put powerful agentic systems under time pressure and in messy, realistic scenarios, you’re forced to confront unpredictable behaviour head-on. That’s where the true learning happens: how can we balance rapid innovation with trust and safety? How can we design evaluation frameworks and guardrails that allow us move fast losing control? This hackathon didn’t just reward clever models, it rewarded thoughtful governance.

And while there are many AI events popping up all over the place, that is considered one of the few it’s best to really keep watch over, the type that genuinely helps you grow, exposes you to real-world challenges, and reminds you why it’s value staying curious and keeping your skills sharp.

References

References so as of appearance:

[1] “NVIDIA CEO Jensen Huang kicks off CES 2025. The Future is Here!” , 2025. Link.

[2] Great Agent Hack 2025: Holistic AI x UCL. Available at: https://hackathon.holisticai.com/ (accessed November 22, 2025).

[3] Valyu AI. (2025). . Retrieved from https://www.valyu.ai/blogs/the-great-agent-hack-2025-agent-performance-reliability-and-valyu-powered-retrieval

[4] Great Agent Hack 2025. “Project gallery — Great Agent Hack 2025: Construct and test transparent, robust, and protected AI agents for real‑world impact.” Devpost. Available at: https://hai-great-agent-hack-2025.devpost.com/project-gallery?page=1.

[5] Holistic AI. (2025). [Source code]. GitHub. https://github.com/holistic-ai/hackathon-2025 (Last accessed: November 30, 2025)

[6] (n.d.). YouTube Shorts. https://www.youtube.com/shorts/l7x_Tfrbubs

[7] Andriushchenko, M., Souly, A., Dziemian, M., Duenas, D., Lin, M., Wang, J., Hendrycks, D., Zou, A., Kolter, Z., Fredrikson, M., Winsor, E., Wynne, J., Gal, Y., & Davies, X. (2024). . arXiv. https://arxiv.org/abs/2410.09024

[8] Tiku N., Schaul K. and Chen S. That is how AI image generators see the worldWashington Posthttps://www.washingtonpost.com/technology/interactive/2023/ai-generated-images-bias-racism-sexism-stereotypes/ (last accessed Aug 20, 2025).

[9] Porikli, S., & Porikli, V. (2025). Hidden Bias within the Machine: Stereotypes in Text-to-Image Models. Available at: https://openreview.net/pdf?id=u4KsKVp53s

[10] Wu, Z., Cho, S., Munoz, C., King, T., Mohammed, U., Kazimi, E., Pérez-Ortiz, M., Bulathwela, S., & Koshiyama, A. (2025). . Holistic AI & University College London.

[11] Wicaksono, I., Wu, Z., Patel, R., King, T., Koshiyama, A., & Treleaven, P. (2025).

[12] Zhang, S., Yin, M., Zhang, J., Liu, J., Han, Z., Zhang, J., Li, B., Wang, C., Wang, H., Chen, Y., & Wu, Q. (2025). (arXiv Preprint No. 2505.00212).

Multi-Agent Arena: Insights from London Great Agent Hack 2025

The Agentic Arena

Track A. Agent Iron Man: Agents that work, and last

Track B. Agent Glass Box: Agents you’ll be able to see, and trust

Track C. Dear Grandma: Agents that stay protected, and behave

Lessons Learned

Conclusions

References

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Artificial Intelligence, Machine Learning, Deep Learning, and Generative AI — Clearly Explained

NVIDIA Releases 6 Million Multi-Lingual Reasoning Dataset

Introducing Gemma 3 270M: The compact model for hyper-efficient AI

Accelerating Pharma R&D with AI-Powered Structural Intelligence

Kaggle Game Arena evaluates AI models through games

Multi-Agent Arena: Insights from London Great Agent Hack 2025

The Agentic Arena

Track A. Agent Iron Man: Agents that work, and last

Track B. Agent Glass Box: Agents you’ll be able to see, and trust

Track C. Dear Grandma: Agents that stay protected, and behave

Lessons Learned

Conclusions

References

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.