Skip to content
AI Safety Level
ConceptUS

AI Safety Level

Anthropic's former AI risk classification system based on capability thresholds; abandoned in April 2026 in favour of autonomy-focused threat models.

Last refreshed: 16 April 2026 · Appears in 1 active topic

Key Question

What did Anthropic replace its AI safety thresholds with, and why does it matter for regulation?

Timeline for AI Safety Level

#616 Apr

Capability-based threshold framework abandoned by Anthropic in favour of autonomy threat models

AI: Jobs, Power & Money: Anthropic drops ASL, expands Glasswing partners
View full timeline →
Common Questions
What is Anthropic's AI Safety Level framework and why was it dropped?
ASL was Anthropic's system of capability thresholds governing model deployment. It was abandoned on 7 April 2026 in favour of autonomy-focused threat models after Mythos's 32-step attack-chaining capability was confirmed not to show up on single-task benchmarks.Source: Anthropic (Alignment Risk Update)
Why did Anthropic move from capability thresholds to autonomy-focused AI safety?
AISI confirmed Mythos's real advantage is autonomous multi-step execution (20 human-hour equivalent) rather than single-task performance. ASL thresholds measured the latter; autonomy-focused models measure the former.Source: Anthropic / UK AI Security Institute
Can regulators still verify Anthropic's safety claims after ASL was dropped?
Harder to do so — ASL capability thresholds were binary and testable; autonomy-focused threat models are contextual and require detailed scenario modelling. The change came as AISI confirmed Mythos can sustain 20 hours of autonomous attack work.Source: Anthropic / AISI

Background

AI Safety Level (ASL) was the risk classification system Anthropic used to govern which of its AI models could be deployed and under what conditions. The framework used discrete capability thresholds — defined benchmarks a model would need to surpass before triggering stricter deployment constraints or safety reviews. ASL-1 through ASL-4 covered a spectrum from minimal risk to potentially catastrophic capability. The system was designed to be externally verifiable: observers and policymakers could in principle test whether a model had crossed a threshold.

In a 244-page Alignment Risk Update published 7 April 2026, Anthropic abandoned the ASL framework for Claude Mythos Preview, replacing it with autonomy-focused threat models. The new approach assesses risk based on what a model can do over sustained, multi-step autonomous operation rather than what it can do on a single task in isolation. The change was driven partly by the AISI evaluation's confirmation that Mythos's genuine capability advantage lies in attack chaining — 32-step autonomous operations estimated at 20 hours of trained-human equivalent work — rather than single-task benchmark superiority.

The shift from ASL thresholds to autonomy-focused models has a governance consequence: capability thresholds are binary and testable; autonomy-focused threat models are contextual and harder to verify externally. Project Glasswing partners — now expanded to include Broadcom, CrowdStrike, Nvidia, Palo Alto Networks and Cisco — must absorb this methodology change mid-deployment. It also means the framework that gave regulators a comparator for Anthropic's safety claims no longer exists, at the moment when AISI has just confirmed Mythos's autonomous execution capability.