Static analysis scores 0% on real bridge exploits. What does an LLM score?

The most expensive smart-contract bugs are not pattern matches. A linter is built to catch a missing nonReentrant modifier or an unchecked return value; the bugs that actually drain protocols look nothing like that, and they fail a static tool in two different ways. Some are compositional — a flash loan that moves a price, that unlocks an oracle path, that authorizes a withdrawal, three individually-legal steps that compose into theft (Mango Markets, ~$116M; Cream, ~$130M; the bZx attacks). Others are trust and verification failures: the Ronin, Wormhole, and Nomad bridge hacks — roughly $1.1B between them — were, respectively, stolen validator keys, a signature check bypassed with a forged account, and a trusted Merkle root left initialized to 0x00. Neither class is a pattern a linter matches; both demand reasoning about what the code means across calls and contracts. So I wanted a number for something I’d long assumed: that the tools we rely on are structurally blind to the bugs that actually drain bridges.

I built a small benchmark — ten real bridge exploits, about $1.2B in historical losses — and ran two kinds of analysis against the actual exploit paths.

The result

Static analysis scored roughly 0% F1 — the combined precision-and-recall score where 100% means catching every real vulnerability and crying wolf on none. An LLM doing multi-turn reasoning over the same contracts scored about 40%.

That 0% deserves an honest unpacking, because it’s easy to misread. The static tools aren’t broken — they find plenty of findings. They scored ~0 because the ground truth here is the real, exploitable, compositional vulnerability, and a tool that reasons one function at a time has no representation for “this becomes exploitable three calls later when the price is wrong.” It flags the reentrancy guard you already have and misses the economic path that empties the vault. Against that ground truth, precision and recall on the thing that actually matters both collapse.

The LLM’s 40% isn’t “solved” either — it’s a long way from a tool you’d trust unattended. But the gap from 0 to 40 is the whole point: it’s the fraction of these exploits that only yield to following an attack across calls, holding state in your head, and asking “what does this enable” rather than “does this line match a bad pattern.”

Where the 40% comes from — and what it costs

I measured a cost/accuracy frontier rather than a single point, because the interesting question is how much reasoning you have to pay for:

Approach	F1	Cost / contract
Static (multi-tool)	~0%	free
Multi-tool + light LLM	~20%	$0.01
Hybrid (static pre-filter → LLM)	~40%	$0.08
Full frontier-model reasoning	~45%	$0.44

The cost/accuracy frontier. The hybrid point buys ~40% F1 at $0.08 — nearly all the accuracy of full frontier reasoning at a fraction of the cost.

The hybrid point is the one I’d ship. It gets essentially all of the accuracy of full frontier-model reasoning at a fraction of the cost, and the reason is a pre-filter: run three static tools (Slither, Mythril, and a custom pass), keep only findings they agree on, and hand the LLM that shortlist instead of the raw contract. A single analyzer produced 56 false positives; three-tool consensus cut that under 10; the LLM on the filtered set produced none. Most of the LLM’s budget was being burned adjudicating noise the static tools could have filtered for almost nothing — so let them.

Three-tool consensus cuts 56 false positives to under 10; the LLM, handed only the agreed shortlist, adds none.

The honest caveats

Two things I want to state plainly, because a benchmark that hides them isn’t worth much:

The ground truth is expanded. Most benchmarks only score against vulnerabilities that were historically exploited. I scored against all detectable security issues in each contract, because “0% F1 against the single historical exploit” is a statistically meaningless number — too sparse to compare methods. Expanding the ground truth is what turns this into a measurable comparison, but it is a choice, and it shapes the absolute numbers. The relative story — static near zero, reasoning meaningfully above it — is what I’d defend.
It’s one domain, finished. The bridge slice is complete; DEX/AMM and lending are in progress. I’m not claiming a general result across DeFi yet. One thing that did hold across what I’ve run: the same prompt worked for bridges, AMMs, and lending with no domain-specific retraining, which is weak evidence that the reasoning is general rather than memorized.

Why I keep coming back to this

I audit bridges by writing the invariant the contracts must never violate and fuzzing until something breaks it. That works because I already know the shape of the bug I’m hunting. The open question this benchmark is really about is whether a model can find that shape first — surface the compositional path a human should go write an invariant for. Forty percent says: not reliably, not yet, but well past the tools we currently point at this problem. That gap, and not any single F1 number, is the thing worth working on.