16APR

AISI: GPT-5.5 matches Mythos on 32-step attack

3 min read

13:29UTC

The UK AI Security Institute published its evaluation of OpenAI's GPT-5.5 on 1 May, finding the model scored 71.4 per cent on expert-level capture-the-flag tasks and cleared AISI's 32-step enterprise-network attack range, becoming the second model after Anthropic's Mythos to do so.

← Back to Three federal surveys, one 34-to-1 gap Jump to analysis ↓

EconomicDeveloping

Key takeaway

Two frontier AI models can now autonomously execute 32-step attack chains, and the supervisory framework was built for one.

The UK AI Security Institute (AISI) published its evaluation of OpenAI's GPT-5.5 on 1 May 2026 ¹. The model scored 71.4 per cent on expert-level capture-the-flag tasks against Mythos's 73 per cent, and completed AISI's 32-step "The Last Ones" enterprise-network attack range end-to-end, becoming the second model after Anthropic's Claude Mythos Preview to clear the threshold. The agentic capability AISI estimated at 20 hours of trained-human work in its earlier Mythos evaluation is no longer exclusive to one frontier laboratory.

The supervisory consequence runs straight into existing rules. The Bank of England Financial Policy Committee directive in April on agentic AI risk in payments and financial markets was scoped around a single frontier model. Treasury Secretary Scott Bessent and Federal Reserve Chair Jerome Powell convened five Wall Street CEOs at Treasury on 8 April over Mythos's capabilities. The Glasswing restricted-access architecture, where Anthropic distributed Mythos to 17 partners under coordinated-disclosure terms, has no equivalent for GPT-5.5. Financial firms that built risk frameworks around Mythos's specific behavioural profile must now extend them to a model with different safety training and a different deployment surface.

AISI's threshold cleared in roughly four weeks suggests the 32-step capability runs on underlying compute and post-training approach rather than a unique architectural breakthrough. Expect a third frontier model to clear it within two quarters; AISI's evaluation cadence is the constraint, not the lab capacity. The supervisory premise the BoE FPC framed in April is one month old and already outdated by a model release.

For the workforce displacement argument, the 32-step autonomous capability is the operational profile of a junior analyst, paralegal, or software engineer. Jamie Dimon told JPMorgan's February investor meeting the bank had "displaced people from AI" ; $600 million annually now goes to retraining. AISI has now confirmed two firms can sell that capability into the same financial-supervisory void. For account holders and pension contributors, the practical question is whether the FCA can supervise a payments system in which two competing AI models can autonomously execute 32-step operations when its April directive was scoped around just one.

Deep Analysis

In plain English

The UK's AI Security Institute is a government body that tests how capable AI models are at potentially dangerous tasks, including hacking into computer networks. In May 2026, it confirmed that OpenAI's newest model, GPT-5.5, can autonomously complete a 32-step process to attack and compromise an enterprise computer network. It scored 71.4% on expert-level tests. The only previous model that could do this was Anthropic's Claude Mythos, which scored 73%. Bank of England and FCA rules issued in April to manage AI risk in financial firms were written assuming only Anthropic's Mythos had cleared this capability threshold. GPT-5.5 cleared the same threshold on 1 May, making both sets of rules outdated within weeks of publication. For the AI jobs beat, the agentic capability that makes AI useful for complex multi-step work tasks, the same feature that makes it capable of network attacks, is now available from at least two competing suppliers.

Deep Analysis

Root Causes

The AISI benchmark was designed in Q3 2025 when Anthropic's Mythos was the only model approaching the 32-step capability threshold. The evaluation framework was calibrated to that frontier, using a custom enterprise network range ('The Last Ones') built to challenge Mythos specifically.

OpenAI's GPT-5.5 clearing the same benchmark within weeks of Mythos is not coincidental: frontier model capability timelines have compressed from 18-24 months per generation to 6-9 months, driven by the same $190-200 billion capex programmes at Microsoft, Amazon, and Google. The benchmark proliferation is a direct output of the infrastructure race described in events 2, 3, and 5 of this update.

The regulatory lag is structural: governments commission safety evaluations on a quarterly cycle, but capability jumps now occur on a monthly cycle. AISI published its Mythos evaluation in April 2026; GPT-5.5 cleared the same threshold by 1 May, a six-week interval between regulatory assessment and frontier proliferation.

What could happen next?

Consequence
The Bank of England FPC and FCA will be required to revise their April AI directives to address multi-model capability rather than single-frontier-model risk, adding regulatory complexity and likely delaying implementation timelines.
Immediate · 0.8
Risk
Financial institutions holding Glasswing-level AI access to either model face a materially different threat model than the single-supplier architecture regulators assumed in April; internal AI governance frameworks built around that assumption are now inadequate.
Short term · 0.72
Precedent
The six-week gap between the AISI Mythos evaluation and GPT-5.5 clearing the same threshold establishes that capability-based AI regulation is structurally unable to keep pace with frontier development under current evaluation timelines.
Medium term · 0.85

Source Landscape

This story draws on neutral-leaning sources

Primary parallel: The 1998 proliferation of 128-bit SSL encryption from PGP (one supplier) to Netscape Navigator (mass market) created a structurally identical problem for US export controls. The Arms Export Control Act had classified 128-bit encryption as a munition; once Netscape shipped it to millions of consumers, the legal framework was overtaken by facts.

The Clinton administration revised the rules in 1999. The AISI threshold operates analogously: it was written around frontier capability held by one lab, and one lab's proliferation changes the enforcement architecture overnight.

Counter-parallel: In 2017, when multiple nation-states gained access to NSA-derived WannaCry-class cyber tools simultaneously, the global response was not immediate regulatory revision but rather a series of uncoordinated national responses that took three years to harmonise under the Budapest Convention framework. The GPT-5.5 proliferation risks the same fragmented response.

Consensus view: RUSI's cyber research group (director Ciaran Martin, former NCSC head) and Cambridge University's Centre for the Study of Existential Risk (CSER, researcher Shahar Avin) both assessed the proliferation from one to two frontier models as the critical inflection in agentic capability risk.

Martin's specific concern: the Bank of England FPC directive issued in April was calibrated to a capability held by a single firm, which regulators could engage directly. With two frontier labs clearing the threshold, regulatory containment requires either binding international standards or pre-deployment evaluation mandates, neither of which is currently in place.

Counter-view: The Information Technology and Innovation Foundation (ITIF, Alan McQuinn) published a counter-assessment arguing that agentic capability benchmarks are divorced from real-world deployment constraints.

GPT-5.5's 71.4% success on the AISI 'The Last Ones' range reflects performance under laboratory conditions; real enterprise networks have asset-specific configurations, detection layers, and human-in-the-loop responses that reduce effective attack success rates substantially. McQuinn cited Mandiant incident response data showing AI-augmented attacks currently have a 15-23% success rate on defended corporate networks.

Key tension: Whether the FCA's supervisory architecture, which requires firms to notify regulators before deploying agentic AI above certain capability thresholds, can be operationalised fast enough now that two models clear the threshold simultaneously.

First Reported In

Update #8 · Beijing court bans AI sackings as Big Tech burns cash

AISI· 2 May 2026

Read original →

Causes and effects

Caused by

AISI confirms Mythos 20-hour attack chain

AISI's April evaluation of Mythos established the 32-step autonomous benchmark; the GPT-5.5 evaluation confirms the same threshold is now cleared by a second frontier lab.

Occurred 15 Apr 2026

Read story →

BoE flags agentic AI systemic risk

Bank of England FPC's April directive was scoped to a single frontier model; GPT-5.5 clearing the same threshold within weeks makes that directive immediately outdated.

Occurred 10 Apr 2026

Read story →

Dimon: JPMorgan displaced workers from AI

JPMorgan Dimon's AI displacement admission provides the corporate context against which AISI's GPT-5.5 proliferation is most consequential for financial-sector supervision.

Occurred 24 Feb 2026

Read story →

This Event

AISI: GPT-5.5 matches Mythos on 32-step attack

The autonomous capability that took financial regulators by surprise three weeks ago is no longer exclusive to one frontier laboratory; the supervisory architecture is one model behind.

Led to

Intuit cuts 3,000, licenses its data

Anthropic's Project Glasswing established the model of frontier-AI firms accumulating proprietary sector data; Intuit's multi-year deal extends that accumulation to consumer tax and financial data.

Occurred 20 May 2026

Read story →

Washington pulls a live AI model

AISI confirmed GPT-5.5 cleared the same 32-step attack chain on 1 May, the capability comparison Anthropic cited in disputing the selective application of the directive.

Occurred 12 Jun 2026

Read story →

Different Perspectives

Barclays

Barclays economist Pooja Sriram flagged a 28,000-a-month bleed in finance and information roles the same week Microsoft disputed that AI drove its own 4,800 cuts. The bank treats Challenger's AI-attribution share as a lagging indicator against faster erosion visible in raw labour-market data.

European Commission

Brussels deferred the Digital Omnibus's Annex III employment-compliance deadline from 2 August 2026 to December 2027, even as California advanced three binding AI-hiring bills the same week. The 17-month delay leaves EU workers without the algorithmic-hiring safeguards the regulation already promises.

OpenAI

OpenAI proposed a 5% US government equity stake worth $42.6bn, structured as a public wealth fund modelled on the Alaska Permanent Fund, with Sam Altman pitching it directly to Trump, Bessent and Lutnick. The offer pre-empts Sanders' rival one-time 50% AI-stock tax, which has not yet reached committee.

India's IT and outsourcing sector

BAT's transfer of 3,500 roles to Accenture on 29 June fits a delivery model Indian IT firms increasingly run: consultancies win Western contracts, then execute through offshore centres. The sector expects more Fit2Win-style transfers, not straight redundancies, as employers absorb AI without cutting outsourced headcount.

European Trade Union Confederation

ETUC says the Council's shift from 'ensure' to 'support' in the AI-literacy duty, confirmed in the Digital Omnibus's final adoption on 29 June, is a collapse of the legal threshold, not a drafting tidy-up. It expects EU workers to face AI-driven hiring and monitoring decisions with a statutory right to explanation that exists in name only.

British American Tobacco's Fit2Win workforce

BAT is cutting 9,000 roles under Fit2Win, transferring 3,500 to Accenture rather than making them redundant, to reach roughly £500m in AI-driven savings by 2027. For affected staff, that distinction decides whether they keep a job at all, just not at BAT.