Greedy was enough: active learning on top of a pretrained potential

DeepMind’s GNoME found a couple hundred thousand stable inorganic materials by pairing graph neural networks with active learning — a simple loop: train a cheap surrogate, let it pick which candidates are worth an expensive simulation, label those, repeat. The choice of which to pick is the whole game, and I wanted to know how much it matters when the surrogate is already very good.

So I built a small version: a pool of 2,000 candidate structures from the Materials Project, a pretrained CHGNet potential as the surrogate, and a labeling budget of 400 — 20% of the pool. The question: with that budget, how many of the 100 most stable structures can you actually find, and does a clever acquisition strategy beat a dumb one?

The result

Strategy	Top-100 recall	Best found (eV/atom)	Labeled
Random	25%	−4.375	400 / 2000
Greedy (mean)	95%	−4.403	400 / 2000
UCB (uncertainty-aware)	93%	−4.403	400 / 2000

Top-100 recall on a 400/2000 labeling budget. Both active strategies recover ~94% of the best structures; greedy and UCB essentially tie.

Both active strategies recovered 93–95% of the best structures while labeling a fifth of the pool; random sampling got 25%. That part is the expected GNoME-style win — active learning works, and it works hard.

The part I didn’t expect, and the reason I think this is worth writing down: greedy and UCB essentially tied. I went in assuming the uncertainty-aware strategy — pick where the surrogate is both promising and unsure, to balance exploration against exploitation — would pull ahead. It didn’t.

Why the clever method didn’t win

The tie isn’t a null result; it’s a measurement of the surrogate. UCB only beats greedy when the surrogate’s uncertainty carries information greedy is ignoring — when “promising but unsure” candidates turn out to be where the wins hide. CHGNet is a pretrained, physics-informed potential. Its mean prediction of stability is already accurate enough across this pool that there’s very little signal left in its uncertainty for exploration to exploit. The exploitation term alone is almost optimal, so adding an exploration term mostly reorders ties.

In other words: the better your prior, the less your uncertainty estimate buys you. You’d expect UCB to pull ahead in the regime where CHGNet is weak — a chemically unusual pool, far from its training distribution, where mean predictions are shaky and the model’s “I’m not sure” actually means something. On a pool this well-covered by the pretrained backbone, greedy is enough, and paying for Monte-Carlo-Dropout uncertainty estimates is paying for exploration you don’t need.

There’s a humbler reading I can’t fully rule out, and it’s the honest caveat on the result: MC-Dropout is a cheap way to estimate uncertainty and a famously miscalibrated one. Part of the tie might be that UCB never got a fair trial — its “I’m not sure” was noise rather than signal, so the exploration term had nothing real to act on. Distinguishing “uncertainty bought nothing because the mean is already good” from “uncertainty bought nothing because our σ is junk” takes two checks: does the tie survive at real scale, and does the same uncertainty ever win when the model is weak? I ran both.

Does the tie survive at scale?

The 2,000-structure run is a demo, and a single seed. The real test is WBM — the 256,963-structure pool from the Matbench Discovery benchmark, of which 42,825 (16.7%) are actually stable. I ran the same loop there on a 2,200-label budget — 0.9% of the pool — across five random seeds, so this time the spread is measured, not assumed.

Discovery Acceleration Factor on the full 256K-structure WBM pool (5-seed mean, whiskers ± std). Random sits at 1.0 by construction; a perfect oracle would reach ~6.0. Greedy (1.134 ± 0.017) and UCB (1.130 ± 0.026) overlap completely.

The metric is the Discovery Acceleration Factor — how much more often you turn up a stable material than blind screening would. Random is 1.0 by construction; a perfect oracle would hit ~6.0 (one over the prevalence). Greedy lands at 1.134 ± 0.017, UCB at 1.130 ± 0.026. The error bars sit right on top of each other. The tie held — at a hundred times the scale, with the variance finally measured instead of hoped for.

That last part earned its keep. A single seed had put UCB ahead, 1.16 to 1.12 — exactly the kind of gap you’d happily write up as “uncertainty wins.” Five seeds dissolved it. The honest version of this result only exists because I stopped trusting one run. (And note the modesty of the win itself: 1.13×, not 6×. At 0.9% budget a strong-but-imperfect surrogate helps, but it isn’t magic — which is the right frame for asking whether the clever version earns the extra cost.)

Was the uncertainty just noise?

The deeper worry is the one above: maybe MC-Dropout’s σ is so miscalibrated that UCB never had a real exploration signal, and greedy won by default rather than on merit. The clean way to check is to run the same uncertainty machinery on a model that is actually weak — where exploration should pay — and see if it does.

So I did: a 50,000-structure synthetic pool with a graph network trained from scratch, no pretrained prior, genuinely unsure of itself. There, UCB beats greedy by 25 points in top-10 recall, 70% to 45%. Same dropout, same acquisition rule — it just finally had something to act on.

That’s the result that closes the loop. The uncertainty isn’t junk; it carries real signal when the model is weak. So the CHGNet tie isn’t “our σ is broken” — it’s “our prior is good enough that there’s nothing left for σ to find.” Greedy was enough because the surrogate was strong, exactly where you’d predict, and not an inch further.

A bonus lesson: use the foundation model frozen

One thing I expected to help and it consistently hurt: fine-tuning CHGNet on the freshly labeled structures each round. CHGNet was pretrained on 700,000 Materials Project structures; nudging it with a few hundred biased labels is enough to trigger catastrophic forgetting — you trade a broad, well-calibrated prior for a sharp, overfit one, and μ accuracy drops. Frozen won every time. If you’re dropping a foundation-model potential into an active-learning loop, the boring move — leave it alone — is the right one.

The takeaway I keep

This is the same lesson the rest of my work keeps teaching from the other direction: know what your tool’s confidence is actually worth before you build on it. A clever acquisition function on top of a strong pretrained model can quietly reduce to “trust the model,” and the honest experiment is the one that measures whether the cleverness paid for itself. Here it didn’t — and knowing that, and why, is more useful than a win would have been.