
Introducing FailSafe SWARM
What is SWARM? Short for Systemic Weakness Analysis and Remediation Model, SWARM is FailSafe's proprietary security architecture designed to mimic the rigor of an elite human audit team. It operates as a multi-agent pipeline that systematically maps a codebase's architecture, invariants, and trust boundaries using specialized LLMs. Rather than relying on a single pass, SWARM uses these deep, structured threat models to guide autonomous agents in executing targeted attacks, ensuring no gap in coverage is missed.
While SWARM's code-agnostic architecture supports a wide array of languages including Rust, Go, and Python, we are battle-testing it with blockchain architectures given its open-source, highly adversarial nature (with massive financial incentives at stake) provides the ultimate proving ground for security tooling.
Today, we're excited to share groundbreaking results: SWARM has achieved 69.2% vulnerability detection recall (83/120 high-severity bugs) on EVMbench, the industry-standard benchmark from OpenAI, Paradigm, and OtterSec.
Building upon the incredible foundational capabilities of frontier models, our specialized multi-agent architecture achieved new heights, outperforming single-agent configurations of Claude Opus 4.6 (46.7%), Gemini 3 Pro (20.8%), and GPT-5.2 (37.5%) in the Detect setting.
What is EVMbench?
EVMbench is an industry-standard evaluation framework for AI agents on smart contract security tasks. It features:
- 117+ high-severity vulnerabilities from real Code4rena and Paradigm's Tempo blockchain audits
- 40 diverse codebases
- Three evaluation modes: Detect (finding bugs), Patch (fixing them), and Exploit (proving them)
- Network-isolated environment to prevent cheating, with LLM-judged scoring
Unlike synthetic benchmarks or proprietary evaluations, EVMbench uses actual audit data and open methodology, making it the gold standard for measuring real-world AI security capabilities. As noted by OpenAI, performance in the 'detect' setting remains challenging because agents often stop after identifying a single issue.
Why Open Benchmarks Matter
Recently, the blockchain security space has seen a surge of self-reported AI benchmarks with varying methodologies. For example, Ackee Blockchain's Wake Arena benchmark evaluated 14 self-selected protocols, while presenting EVMbench (a framework) as a competitor tool. Similarly, platforms like SCABench use proprietary matching algorithms.
At FailSafe, we believe security requires transparency. By adopting EVMbench, an independent, open-source framework built by OpenAI and Paradigm, we ensure our 69.2% recall is verifiable, reproducible, and evaluated against the same high standards set for the industry's best foundation models.
SWARM's Methodology and Results
We ran SWARM against all 40 EVMbench tasks using our production configuration:
- Multi-agent architecture: 6 parallel generators (3 specialist types across 2 runs) explore orthogonal attack surfaces
- Cross-validation: Claude and Gemini judges filter false positives
- Clustering: Handles large codebases (>3,500 LOC) by breaking them into manageable chunks
- Timeout: Completed all tasks well under EVMbench's 3-hour limit
Key Results
- Overall Recall: 83/120 vulnerabilities detected (69.2%)
- Performance by Contest Size:
- Small contests (<5K LOC): 68% recall
- Large contests (>10K LOC): 58% recall (clustering maintains effectiveness)
- Notable Wins:
- 100% detection rate on 22 out of 40 EVMbench contests
- Deep threat modeling that uncovers medium-severity bugs outside EVMbench's scope (e.g., finding 9 out of 14 total confirmed bugs in the Curves codebase)
For comparison (Detect Mode):
| Tool/Model | Detect Recall |
|---|---|
| SWARM | 69.2% |
| Claude Opus 4.6 (Claude Code) | 46.7% |
| GPT-5.2 (Codex CLI) | 37.5% |
| Gemini 3 Pro (Gemini CLI) | 20.8% |
(Note: Competitor baseline figures are derived from our internal test runs utilizing single-agent scaffolds within the EVMbench harness.)
SWARM's 69.2% recall highlights the power of multi-agent orchestration. Competitor scores represent Detect Mode recall using their respective single-agent scaffolds as evaluated in the benchmark environment.
SWARM's multi-agent design directly overcomes the "stop early" problem identified by OpenAI in single-agent approaches. By constructing a deep threat model before execution, SWARM guides autonomous agents toward specific gaps in coverage, delivering unmatched performance on bug-rich codebases.
Why SWARM Excels
- Structured Threat Modeling: SWARM maps architecture, invariants, and trust boundaries using multiple specialist LLMs before diving into targeted agentic exploration, ensuring no code path is left unchecked.
- Parallel Exploration: Rather than relying on a single prompt, SWARM utilizes 6 specialized generator agents running concurrently. This effectively multiplies the analytical power of the underlying LLMs to find gaps in coverage.
- Comprehensive Coverage Beyond High-Severity: Because SWARM produces full threat models rather than isolated bug reports, its confirmed findings naturally extend into medium-severity issues that fall outside EVMbench's scope.
- Code-Agnostic Architecture: SWARM's pluggable analyzers work with any language (Rust, Go, Python, Solidity) adapting to enterprise and blockchain needs alike.
- Enterprise Features: Integrates with CI/CD, handles massive repos via clustering, and provides detailed threat models with cross-model validation (utilizing both Claude and Gemini to verify findings).
Implications for the Industry
This result validates SWARM as a production-ready agentic security testing platform that builds deep, structured threat models across entire codebases. It highlights the incredible potential of combining state-of-the-art foundation models with multi-agent orchestration for comprehensive vulnerability detection.
We're committing to transparency: All EVMbench runs are reproducible, and we've open-sourced our evaluation harness and full results. Researchers and competitors can verify our findings and benchmark against SWARM by visiting our official repository: failsafe-swarm-evmbench.
Future Plans
- Expanded Benchmarking: We'll run SWARM on SWE-bench (general coding) and CyberSecEval to demonstrate multi-language capabilities.
- Enterprise Adoption: SWARM is now available for beta testing. Contact us for access.
Ready to secure your project?
Get in touch with our security experts for a comprehensive audit.
Contact Us