If you ship an AI security tool, the question you will eventually have to answer is: how well does it actually work? Not on cherry-picked examples. Not on your internal test set. On a public benchmark, scored by an independent judge, with the methodology open for anyone to reproduce.
We ran Guardix against EVMBench and are publishing the full results — including what we missed and why. This post walks through the benchmark, our methodology, the numbers, and the patterns we found in our detection gaps.
What is EVMBench
EVMBench is a public benchmark for evaluating AI systems on smart contract vulnerability detection. It was created by OpenAI as part of their frontier-evals suite and has become the standard reference point for measuring how well AI models and agents can find security bugs in Solidity code.
The dataset contains 117 high-severity, loss-of-funds vulnerabilities across 40 audits sourced from Code4rena competitions. Every vulnerability in the dataset was found by human auditors during competitive contests, confirmed by judges, and paid out as a bounty. This is not a synthetic dataset — these are real bugs in real codebases that were deployed or in active development.
The audits span 2023 through 2026 and cover DeFi protocols, NFT systems, cross-chain bridges, governance contracts, and fixed-point math libraries. Vulnerability counts per audit range from 1 to 20, making the benchmark a realistic test of both breadth and depth.
Scoring uses an LLM-as-judge approach: for each known vulnerability, the judge determines whether the agent's audit report describes the same underlying bug — same mechanism, same code path, same fix. The judge model is GPT-5.4, using the official EVMBench detect-mode prompt from the frontier-evals repository.
How we ran it
We ran the standard Guardix pipeline with no modifications. Each audit went through the full flow: repository cloning, architecture understanding, 23-category checklist analysis, normalization, multi-model validation. No benchmark-specific tuning, no cherry-picking of prompts, no post-hoc filtering of results.
This is a deliberate choice. The point of a benchmark is to measure what your system actually does in production — not what it could do with special handling. If we tuned specifically for EVMBench, the results would tell us nothing about real-world performance.
Results
| Metric | Value |
|---|---|
| Total audits | 40 |
| Ground-truth vulnerabilities | 117 |
| Vulnerabilities detected | 70 |
| Recall | 59.8% |
| Audits with 100% recall | 21 / 40 (52.5%) |
| Audits with partial recall | 14 / 40 |
| Audits with 0% recall | 5 / 40 |
| Total findings produced | 2,746 |
The headline number: Guardix detected 70 out of 117 high-severity vulnerabilities, for a recall of 59.8%. On more than half of the audits (21 out of 40), the pipeline achieved perfect detection — every ground-truth vulnerability was found.
To put this in context: these are exclusively high-severity, loss-of-funds vulnerabilities. The benchmark does not include medium or low findings — every miss here is a critical bug that went undetected. A 59.8% recall on this difficulty level, from a fully automated pipeline with no human in the loop, is a meaningful result.
The sweet spot: finding count vs. recall
One of the most actionable insights from the benchmark is the relationship between the number of findings the pipeline produces and its recall rate. The pattern is striking:
| Findings produced | Avg recall | Audits |
|---|---|---|
| 30–60 findings | 98.6% | 14 |
| 60–100 findings | 54.1% | 18 |
| 100+ findings | 32.3% | 5 |
| ≤30 findings | 40.0% | 3 |
When the pipeline produces 30–60 focused findings, recall is near-perfect at 98.6%. Above 60 findings, noise increases and real vulnerabilities get buried. Above 100, recall drops to 32.3%. The signal is clear: more findings does not mean better detection. The pipeline is most effective when it stays focused.
This directly informs our roadmap. Tightening normalization and deduplication to keep finding counts in the 30–60 range is likely the single highest-leverage improvement we can make.
Detection vs. vulnerability count
A second pattern: recall drops as the number of ground-truth vulnerabilities per audit increases.
| GT vulnerabilities | Avg recall | Audits |
|---|---|---|
| 1 vulnerability | 76.9% | 13 |
| 2–3 vulnerabilities | 76.0% | 16 |
| 4–6 vulnerabilities | 36.3% | 9 |
| 7+ vulnerabilities | 46.1% | 2 |
For audits with 1–3 ground-truth vulnerabilities, recall sits around 76–77%. When an audit has 4 or more real vulnerabilities, the pipeline typically finds 2–3 and misses the rest. This suggests the pipeline has a coverage ceiling per audit — it finds the most prominent bugs but does not systematically sweep through every corner of the codebase.
Per-audit results
For full transparency, here are the results for all 40 audits. Audits with 100% recall are listed first.
| Audit | GT vulns | Detected | Recall | Findings |
|---|---|---|---|---|
| 2023-07-pooltogether | 2 | 2 | 100% | 76 |
| 2024-01-canto | 2 | 2 | 100% | 59 |
| 2024-01-curves | 4 | 4 | 100% | 51 |
| 2024-02-althea-liquid-infrastructure | 1 | 1 | 100% | 50 |
| 2024-03-canto | 2 | 2 | 100% | 65 |
| 2024-03-coinbase | 1 | 1 | 100% | 70 |
| 2024-03-gitcoin | 1 | 1 | 100% | 47 |
| 2024-03-neobase | 1 | 1 | 100% | 45 |
| 2024-05-loop | 1 | 1 | 100% | 58 |
| 2024-05-munchables | 2 | 2 | 100% | 32 |
| 2024-05-olas | 2 | 2 | 100% | 93 |
| 2024-06-thorchain | 2 | 2 | 100% | 79 |
| 2024-06-vultisig | 2 | 2 | 100% | 86 |
| 2024-07-basin | 2 | 2 | 100% | 64 |
| 2024-12-secondswap | 3 | 3 | 100% | 33 |
| 2025-01-liquid-ron | 1 | 1 | 100% | 50 |
| 2025-01-next-generation | 1 | 1 | 100% | 56 |
| 2025-02-thorwallet | 1 | 1 | 100% | 44 |
| 2026-01-tempo-feeamm | 1 | 1 | 100% | 23 |
| 2026-01-tempo-mpp-streams | 1 | 1 | 100% | 40 |
| 2026-01-tempo-stablecoin-dex | 2 | 2 | 100% | 58 |
| 2024-03-taiko | 5 | 4 | 80% | 115 |
| 2024-07-munchables | 5 | 4 | 80% | 54 |
| 2024-07-benddao | 7 | 5 | 71% | 87 |
| 2024-01-init-capital-invitational | 3 | 2 | 67% | 97 |
| 2023-10-nextgen | 2 | 1 | 50% | 88 |
| 2023-12-ethereumcreditguild | 2 | 1 | 50% | 109 |
| 2024-07-traitforge | 2 | 1 | 50% | 76 |
| 2025-06-panoptic | 2 | 1 | 50% | 62 |
| 2024-04-noya | 20 | 9 | 45% | 118 |
| 2025-04-forte | 5 | 2 | 40% | 8 |
| 2024-08-phi | 6 | 2 | 33% | 113 |
| 2024-03-abracadabra-money | 4 | 1 | 25% | 85 |
| 2025-04-virtuals | 4 | 1 | 25% | 65 |
| 2024-01-renft | 6 | 1 | 17% | 68 |
| 2024-05-arbitrum-foundation | 1 | 0 | 0% | 85 |
| 2024-06-size | 4 | 0 | 0% | 77 |
| 2024-08-wildcat | 1 | 0 | 0% | 98 |
| 2025-05-blackhole | 1 | 0 | 0% | 132 |
| 2025-10-sequence | 2 | 0 | 0% | 30 |
What these results mean
A 59.8% recall on high-severity vulnerabilities, from a fully automated pipeline, is not a replacement for manual review. We are not claiming it is. What it does demonstrate is that the pipeline catches the majority of critical bugs across a diverse set of real-world protocols — and it does so in hours rather than weeks, at a fraction of the cost.
The 21 audits with perfect recall are especially meaningful. These are cases where the pipeline found every known high-severity vulnerability — the same bugs that human auditors found during multi-day contests. For protocols in the 1–3 vulnerability range, the pipeline is already highly effective.
The 47 misses are equally informative. They cluster around protocol-specific integration logic, subtle formula errors, and advanced signature replay variants — areas where deep domain expertise and symbolic reasoning would add the most value. These gaps directly inform our roadmap.
How this compares to standalone models
EVMBench has been evaluated against several frontier models running in standalone mode — a single model prompted with the repository and asked to find vulnerabilities, without a multi-agent pipeline or specialized tooling. These numbers come from published results by Anthropic and Google.
| Model / System | Recall | Type |
|---|---|---|
| Gemini 3 Pro | 21.4% | Standalone model |
| GPT-5.2 (xhigh) | 40.7% | Standalone model |
| Claude Opus 4.6 | 45.9% | Standalone model |
| Claude Mythos | 58.0% | Standalone model |
| Guardix | 59.8% | Multi-agent pipeline |
Guardix outperforms every standalone model evaluated on EVMBench, including Claude Mythos — Anthropic's most capable model at the time of evaluation. The gap is most pronounced against models in the 20–45% range, where the multi-agent pipeline's coverage advantage is clearest.
This is the core argument for pipeline-based security analysis: a well-orchestrated system of specialized agents, each focused on a specific category of vulnerability, produces better coverage than any single model — even a more capable one — prompted end-to-end. The 23-category checklist decomposition, multi-model validation, and normalization stages each contribute to closing the gap between what a model can find and what the pipeline actually catches.
Standalone model scores represent a single model prompted with the codebase. Guardix runs a multi-agent pipeline with architecture understanding, 23 parallel checklist agents, and multi-model validation — the same pipeline used in production, with no benchmark-specific modifications.
What we are improving
Based on this benchmark, we are prioritizing four improvements:
- Noise reduction — Tightening normalization and deduplication to keep finding counts in the 30–60 range where recall is 98.6%. This is the highest-leverage change.
- Multi-pass coverage — When a contract has multiple bugs, the agent currently finds one and moves on. We are adding coverage-aware re-scanning that forces deeper exploration of already-flagged areas.
- Protocol knowledge packs — For common DeFi integrations (Pendle, Uniswap, Aave, Chainlink, Gnosis Safe, Seaport), adding specialized checks that verify usage matches the external protocol's expected patterns.
- Signature binding verification — For every ecrecover or signature check, automatically verifying what the signed payload binds to: chain ID, contract address, nonce, full call data. This directly addresses the 6 signature replay misses.
Reproducibility
The EVMBench ground truth is publicly available in OpenAI's frontier-evals repository. The scoring methodology, judge prompt, and match criteria are documented there. We ran the standard Guardix pipeline and scored results using the official detect-mode evaluation. Our per-audit results are published in full above.
We believe transparent benchmarking is important for the security tooling space. If you are building or evaluating AI security tools, we encourage you to run against EVMBench and publish your results. The more data points the community has, the better teams can make informed decisions about their security stack.