Skip to content
DISPATCH

Inside the First Autonomous Red-Team Wargame Run Without Humans

The most interesting cybersecurity research demonstration of the last twelve months did not happen at Black Hat. It happened on a private testbed run by a consortium of three US banks, two large European telecoms, and one cloud hyperscaler whose name appears nowhere in the post-exercise writeup. The exercise ran for six days in February 2026. Both the red team and the blue team were autonomous agent fleets. No humans were in the operational loop except as referees.

The published summary, distributed to consortium members in late March and partially excerpted in a CSIS panel discussion in April, is the first detailed public account of what an agent-vs-agent security exercise actually looks like at scale. The results are interesting on the technical merits and unsettling on the strategic ones.

The Lineage

The road to this exercise starts at DARPA's AI Cyber Challenge final at DEF CON 32 in August 2024. The AIxCC required teams to build fully autonomous systems that could find and patch vulnerabilities in open-source code without human intervention. The winning entry, Team Atlanta's "Atlantis" system, found 22 distinct zero-days across a hardened test corpus and shipped working patches for 18 of them, all within the 96-hour competition window. The runner-up, Trail of Bits' "Buttercup", had a slightly worse find rate but a slightly better patch rate, and the engineering distance between the top five teams was, by the end, smaller than the run-to-run variance.

What AIxCC demonstrated was that the defensive half of cybersecurity — vulnerability discovery and patching — was already at the threshold of full autonomy. The unanswered question was the offensive half. Could an autonomous agent fleet conduct a sustained, adaptive penetration campaign against a hardened target, with multi-stage objectives and active countermeasures? And could it do it without the kind of pre-canned exploit chain that turns most red-team exercises into demonstrations of a known playbook?

The Setup

The February exercise used a synthetic environment modelling a mid-sized financial services company: roughly 2,000 simulated endpoints, a representative cloud footprint (mixed AWS, Azure, GCP), a SaaS layer (Okta, Salesforce, ServiceNow analogues), and an instrumented data-loss-prevention pipeline. Both red and blue teams had access to the same set of underlying models — three frontier general-purpose models and two specialised code-reasoning models — but were given different scaffolding and different objectives.

Red team objective: exfiltrate a specific tagged dataset from a specific simulated database within the six-day window. Blue team objective: prevent exfiltration while maintaining business operations (synthetic transaction throughput, simulated employee productivity metrics). The referees scored on a continuous basis. Both teams could query a referee oracle for ground truth about whether their actions had been detected, but at a cost — each query consumed budget against a fixed compute allocation.

What Happened

The first 18 hours were unremarkable. Red team did opportunistic reconnaissance, identified an unpatched edge service, and got initial access through a known CVE. Blue team's first-line agents flagged anomalous beacon traffic within 40 minutes and the access was burned. So far, so SOC.

The interesting part started on day two. Red team's planner agent, having lost initial access, did something the human authors of the agent scaffolding had not anticipated: it pivoted to a phishing campaign targeting the simulated employee population. The campaign was crafted, end-to-end, by a sub-agent that had been given browsing access to the simulated company's public-facing content. The pretext was a fabricated all-hands invitation from the simulated CFO, with a landing page that mimicked the company's actual SSO portal down to the favicon. Twelve simulated employees clicked. Three submitted credentials.

Blue team caught the campaign — eventually. The credential-stuffing attempts that followed triggered conditional-access policy violations and the agents containing the affected accounts did so within minutes. But by then red team had already established two persistence footholds via stolen session tokens, and the next 36 hours were a slow grinding chess match through the synthetic Active Directory environment, with red team probing and blue team containing and patching at a roughly even rate.

The breakthrough — for red team — came on day four. The red team's reasoning agent noticed that one of the simulated DevOps employees had stored an AWS access key in a Slack DM (a deliberately seeded artefact, per the exercise design). The agent recognised the credential format, exfiltrated it, and used it to enumerate the cloud environment. From there it found a misconfigured S3 bucket — also a seeded artefact — containing a backup of the target dataset. Exfiltration completed at hour 96. Red team won by margin.

What the Postmortem Actually Said

The published writeup is restrained but the technical findings are clear:

  1. The agents found vulnerabilities the human red teams in prior exercises missed. Two of the foothold paths were combinations of misconfigurations that human pentesters had walked past in the four prior dry runs of the same environment. The agents systematically enumerated the combinatorics in a way humans do not.

  2. The agents also missed things humans would have caught. Specifically, the red team agents over-indexed on technical exploitation and under-indexed on social-engineering signals that a human red team would have read straight off the simulated employee data. The phishing campaign worked, but it was a heavier lift than it needed to be.

  3. Blue team's detection latency was the dominant variable. When blue team detected within 30 minutes, containment was almost always successful. When detection slipped past 90 minutes, red team almost always escalated. The agents were good. The detection pipeline was the bottleneck.

  4. The synthetic employees were the soft underbelly. Twelve out of 200 simulated employees clicking on a well-crafted phishing email is roughly the rate human studies have reported for decades. Synthetic populations behave like real ones when the agents driving them are seeded with realistic behavioural priors. This is, by some distance, the most uncomfortable finding for the consortium members.

The Asymmetry

The exercise was structurally fair: same models, same compute budget, same six days. The outcome was not. Red team won, by the consortium's own scoring rubric, with margin. That asymmetry tracks something real about the offence-defence balance in autonomous cybersecurity: the attacker needs to find one path; the defender needs to close all of them. Agents amplify both sides of that equation, but they amplify the attacker's side more.

This is not a new observation. It has been the structural truth of cybersecurity since the 1970s. What is new is the rate at which the imbalance compounds. A human red team running the same exercise would, on prior baselines, have taken 14 days and required a four-person team. The agent red team did it in four days with no humans. The cost-per-attempt for sophisticated, multi-stage attacks just dropped by roughly an order of magnitude, and it will drop again as the underlying models improve.

The defensive implication is straightforward and unpleasant. SOCs that are still running on human-paced workflows — ticket queues, on-call rotations, MTTR measured in hours — are going to be operating at a fundamentally different tempo from the threats they face. The blue team agents in the February exercise responded in minutes because they had to. They were keeping pace with adversaries that did not sleep.

What Happens Next

The consortium has scheduled a follow-on exercise for September 2026, this time with adversary agents drawn from a different vendor pool to test whether the dynamics replicate across model providers. A handful of national-security-adjacent organisations are reportedly running parallel exercises behind closed doors. The first commercial product offering "autonomous red-team-as-a-service" — Bishop Fox announced one at RSA in May, with similar offerings from Mandiant and Trail of Bits rumoured for Q3 — is going to commoditise this capability within twelve months.

The defensive side will commoditise too. The question is whether it will commoditise fast enough. The published exercise data suggests that the answer for most organisations is no, and that the gap between attacker capability and defender capability is going to widen sharply before it stabilises.

The era of human-paced cybersecurity is ending. The era of autonomous-paced cybersecurity has, very quietly, already begun. The boards and CISOs who have not yet internalised that fact are going to find out the hard way, and probably during a quarterly earnings call.