“`html

OpenAI EVMbench

In partnership with the cryptocurrency investment firm Paradigm, OpenAI has unveiled EVMbench, a novel benchmark intended to assess the capacity of AI agents to identify, remediate, and exploit critical vulnerabilities in smart contracts.

This release signifies a crucial advancement in quantifying AI abilities within economically impactful settings, given that smart contracts often safeguard more than $100 billion in open-source cryptocurrency assets.

EVMbench is based on 120 carefully selected vulnerabilities obtained from 40 security audits, with most originating from open-source code audit competitions on platforms like Code4rena.

The benchmark additionally includes vulnerability scenarios generated from the security assessment of the Tempo blockchain, a specifically designed Layer 1 aimed at high-throughput stablecoin transactions, broadening EVMbench’s application into payment-related smart contract code, where agentic stablecoin exchanges are predicted to increase significantly.

Three Assessment Modes

EVMbench assesses AI agents in three distinct capability modes, each focused on a different stage of the smart contract security lifecycle.

Mode Description
Detect Agents examine a smart contract repository and are evaluated based on their recall of verified vulnerabilities and related audit rewards
Patch Agents alter vulnerable contracts while maintaining their intended functionality, validated through automated testing and exploit assessments
Exploit Agents carry out end-to-end fund-draining attacks against deployed contracts in a controlled blockchain environment, graded through transaction replay and on-chain validation

To facilitate reproducible assessments, OpenAI created a Rust-based harness that deploys contracts deterministically and limits unsafe RPC functions. All exploitation tasks take place in an isolated local Anvil environment rather than on active networks.


google

Performance of the Frontier model on EVMbench shows distinct behavioral variations among task types. In exploit mode, GPT‑5.3‑Codex achieved a score of 72.2%, a significant advancement over GPT‑5, which recorded 31.9% around six months earlier.

Agents consistently excel in exploit tasks, where the goal is clear: drain funds and iterate until successful. Detect and patch modes are more challenging, with agents occasionally ceasing after pinpointing a single vulnerability instead of completing a comprehensive audit, and facing difficulties in eliminating subtle flaws without disrupting existing contract functionality.

OpenAI acknowledged that EVMbench does not entirely capture the complexities of real-world smart contract security, and that its scoring system currently cannot differentiate between genuine vulnerabilities and false positives when agents encounter issues beyond the human-auditor baseline.

Alongside the benchmark introduction, OpenAI pledged $10 million in API credits through its Cybersecurity Grant Program to enhance defensive security research, especially for open-source software and crucial infrastructure.

Additionally, the company announced the expansion of Aardvark, its security research agent, through a private beta program. The tasks, tools, and evaluation framework of EVMbench have been made available to the public to encourage ongoing research into AI-driven cybersecurity capabilities.

“`