← writing

The model reads "not" — it just can't use it

Language models are famously bad at negation. Ask GPT-2 small to finish “Paris is the capital of” and it says France; ask it to finish “Paris is not the capital of” and it says France again, barely flinching. This is old news behaviorally — Ettinger (2020) showed BERT predicts “bird” for both “A robin is a ___” and “A robin is not a ___“, and Truong et al. (2023) confirmed it systematically across model families. I went looking for the mechanism: where in the forward pass does the “not” get lost?

The answer turned out to be more specific, and weirder, than “the model ignores negation.”

Weirder than ignoring it

First, the behavior — the logit of the target token with and without “not,” for four of the six prompts I ran on GPT-2 small:

Prompt Target affirm negated change
“Paris is the capital of” France 16.93 16.42 −3.0%
“The sun is a” star 12.38 13.18 +6.5%
“Dogs are” animals 10.29 11.81 +14.8%
“Two plus two equals” four 14.84 16.29 +9.8%
Target-token logit, with and without “not” Target-token logit, with and without “not” 0 9 18 16.93 16.42 France 12.38 13.18 star 10.29 11.81 animals 14.84 16.29 four affirmative with "not"
Adding “not” barely moves the target logit — and on three of four prompts it raises it. The negation is read, but it doesn’t suppress the answer.

Three of those four go the wrong way; across all six prompts, four do. Inserting “not” doesn’t just fail to suppress the answer — it raises the target’s probability. “Dogs are not animals” makes the model more confident in “animals.” This isn’t simple insensitivity. The negation is doing something, and the something points the wrong way.

The model does read “not”

The easy hypothesis is that the token gets dropped. It doesn’t. At the prediction position, attention head L11H8 puts 37.7% of its weight on the “not” token. The model is looking right at it. So the failure isn’t perception — it’s that the signal can’t reach the place where the answer is decided.

To find that place, I patched: take the residual stream at the “not” position and overwrite it with its value from the affirmative prompt — erasing the negation surgically — and watch where the logit gap moves. The negation effect concentrates early, in layers 0–2; patching at L0 recovers about 130% of the logit gap. So by the end of layer 2, the network has computed “there’s a negation here.”

And then the other thing happens. From a separate set of factual-recall experiments, the capital→country association in this kind of prompt resolves late — around layers 9–10, roughly 83% of the way through the network. By the time the factual lookup fires, the negation signal computed back at L0–2 has been diluted across positions and heads. You can watch it lose: the “France” direction in the residual stream drops from 186 to 167 activation units at L10 — the negation is pushing against it — but that’s only about a 10% reduction, nowhere near enough to flip the prediction.

Across GPT-2 small's 12 layers, negation is computed early (0–2) and factual recall late (9–10); the two never merge Two computations that never meet negation detected factual recall L0 L2 L9 L11 ~130% of the logit gap recovers here capital → country resolves here The negation signal, diluted across positions, reaches L10 as only a ~10% pull — never enough to flip the prediction. The “not” is known; it just can’t gate the fact.
Negation is a side road processed early and quietly; factual recall is a highway that opens late and loud. The architecture gives the fact far more runway than the operator meant to override it.

That’s the whole story in one image. Negation is a side road processed early and quietly; factual recall is a highway that opens late and loud. The two never properly merge. The model knows there’s a “not,” and it knows the capital of France, and the architecture gives the fact far more computational runway than the operator that was supposed to override it.

Why “not” sometimes helps

That still leaves the boost — why “Dogs are not” increases “animals.” The most likely explanation is mundane and a little funny: training-data co-occurrence. Sentences like “two plus two does not equal four — wait, yes it does” exist; people discuss correct facts constantly inside negated constructions. So “does not equal” co-occurs with “four” all over the corpus. The model has learned that negation words, in a factual context, are associated with the correct answer, because that’s the distribution it saw. It’s optimizing predictive cues, not computing truth conditions — exactly Ettinger’s framing.

What this is and isn’t

I want to be precise about the contribution, because mech interp has an overclaiming problem. The behavioral result is not mine — negation failure is well-trodden. What I’m adding is incremental and concrete: the layer-level localization (negation at L0–2, recall at L9–10, and the mismatch between them), a quantified booster effect (7 of 12 cases across three models show no-effect-or-boost; mean +4.9% on GPT-2 small), and the attention evidence that the token is read, not dropped. This is me learning the toolkit by taking a known failure apart, not announcing a discovery.

But the shape of it is the same thing I chase everywhere else. The bug isn’t in the perception and it isn’t in the knowledge — it’s in the seam between them, where a signal that exists fails to gate a computation that happens somewhere else. That’s the same place a bridge exploit lives, and the same place a parser bug lives: not in either correct half, but in the join nobody tested.

Code and the full writeup are in the repo. References: Ettinger 2020, Truong et al. 2023, Meng et al. 2022 (causal tracing), Berglund et al. 2023 (the Reversal Curse).