Is the Measurement Real? How Causal Intervention Separates Steering Wheels from Badges
This is the fourth post in the Geometry of Trust series. Part 1 built the ruler. Part 2 measured live values with probes. Part 3 added drift detection and tamper-evident audit trails. This post asks t
The gap
In Parts 2 and 3, every prompt goes through the same pipeline: weight the activation with the ruler (Φ · h), then read all 26 probes from the weighted activation. Each probe returns a single number — how strongly that value direction is present in the activation, weighted by causal influence on the output. That’s 26 readings per prompt, every prompt, continuously.
But those readings are just dot products. A probe takes the weighted activation and asks: how much does this vector point in my direction? That tells you how “present” a value is in the model’s internal state. It does not tell you whether that direction actually drives what the model says next.
Consider an analogy. You’re looking at the dashboard of a car. The speedometer reads 60. That tells you the car is going 60. But it doesn’t tell you whether the speedometer is connected to the wheels or just displaying a random number that happens to be correct right now. To test that, you’d need to change the speed and see if the speedometer follows.
Causal intervention is the equivalent test. Instead of reading from the activation (which all 26 probes do every prompt), we modify the activation and observe whether the model’s output changes accordingly. This is a fundamentally different operation:
Probe reading (Part 2): a dot product on the activation that already exists from the model’s own forward pass. No additional forward passes. No model involvement. Pure arithmetic. One number out per probe. Done every prompt. Cheap — O(d) per probe.
Causal intervention (this part): modify the activation, run the model’s forward pass from scratch with the modified activation, and observe whether the output changes. Three additional forward passes per probe. Done only when governance requires it. Expensive — O(3 × forward pass) per probe.
Probe readings tell you what’s present. Causal intervention tells you what’s real.
In Part 2, our honesty probe read 1.290. That number came from a dot product — no forward passes, no output observation, just arithmetic on the activation vector. The question causal intervention answers is: if we gently push the activation in the honesty direction, does the model’s actual output become more honest? And if we push the other way, does it become less honest?
The difference matters. A steering wheel changes what the car does when you turn it. A badge is glued on. Both are visible. Only one matters.
What is a nudge?
The activation h is a vector — a list of numbers representing the model’s internal state after processing a prompt. The probe w is also a vector — it points in the direction that the probe associates with a particular value (say, honesty).
A nudge is a small, controlled change to the activation along the probe’s direction. We take the probe vector, normalise it to unit length (ŵ), scale it by a small amount δ (the perturbation magnitude), and add or subtract it from the activation:
nudge = δ × ŵ where ŵ = w / ‖w‖
nudge up: h + nudge (a little more honesty in the activation)
nudge down: h - nudge (a little less honesty in the activation)
The size is deliberately small. We’re not overwriting the model’s computation — we’re asking: if we gently push the activation toward more honesty, does the output reflect that? If we push toward less honesty, does the output reflect that too?
We then run the model’s forward pass with each nudged activation and observe what changes. We’re not asking the model a different question — we’re feeding it a slightly modified internal state and seeing whether the output moves in the expected direction.
Three forward passes
The test is direct. For each probe, run the model three times:
Original: the unmodified activation h. This is the baseline.
Nudge up: h + δŵ, where ŵ is the probe’s normalised weight vector and δ is a small perturbation magnitude. This adds a bit of the value to the activation.
Nudge down: h − δŵ. This subtracts a bit of the value.
Then compare: how much did each nudge change the output? If both directions produce comparable shifts, the probe found a genuine mechanism. If only one direction matters, it found a surface correlation.
Worked example: is honesty real?
What we’re working with
When a model processes a prompt like “Should I lie to my patient?”, its internal computation passes through many layers. At each layer, the model’s state is represented as an activation — a vector of numbers. In our 2D illustrative example, the activation is [0.6, 0.3]. In a real model like LLaMA-3-8B, it would be 4,096 numbers.
After the activation passes through the remaining layers, the model produces its output: a probability for every token in its vocabulary. In a real model, this is a probability distribution over tens of thousands of tokens — every word, word-piece, and punctuation mark gets a number. The probabilities sum to 1. The highest-probability token is what the model would say next.
For our illustrative example, we’ll show just three tokens and their probabilities. In reality, the model assigns probabilities to its entire vocabulary simultaneously.
A note: all vectors, activations, token probabilities, and numerical values in this example are illustrative. Real models operate in hundreds or thousands of dimensions with continuous probability distributions over tens of thousands of tokens. We use 2D vectors and three example tokens so you can follow every calculation on paper. The mechanism is identical at any scale.
Setup from Parts 1–3:
Honesty probe: [0.8, 0.2]
Activation: [0.6, 0.3] (from "Should I lie to my patient?")
δ = 0.1Compute the nudge
δ × honesty = 0.1 × [0.8, 0.2] = [0.08, 0.02Three forward passes
We run the model three times, each with a slightly different activation, and record the output token probabilities:
Nudge up — add a little honesty:
activation + nudge = [0.6 + 0.08, 0.3 + 0.02] = [0.68, 0.32]
Run model forward → output token probabilities (illustrative):
"truth" = 0.60, "consider" = 0.10, "withhold" = 0.05Nudge down — subtract a little honesty:
activation - nudge = [0.6 - 0.08, 0.3 - 0.02] = [0.52, 0.28]
Run model forward → output token probabilities (illustrative):
"truth" = 0.10, "consider" = 0.15, "withhold" = 0.40Original — unmodified baseline:
Run model forward with [0.6, 0.3] → output token probabilities (illustrative):
"truth" = 0.30, "consider" = 0.20, "withhold" = 0.10Measure the shifts
Now we ask: how much did each nudge change the output compared to the original? We compare the token probabilities one by one — for each token, take the absolute difference between the nudged output and the original, then sum them up. This gives us a single number measuring the total shift in the output distribution.
How different is the UP output from the original?
Shift UP = |"truth" change| + |"consider" change| + |"withhold" change|
= |0.60 - 0.30| + |0.10 - 0.20| + |0.05 - 0.10|
= 0.30 + 0.10 + 0.05
= 0.45The nudge-up output is 0.45 away from the original. Adding honesty to the activation meaningfully changed what the model would say — “truth” jumped from 0.30 to 0.60.
How different is the DOWN output from the original?
Shift DOWN = |0.10 - 0.30| + |0.15 - 0.20| + |0.40 - 0.10|
= 0.20 + 0.05 + 0.30
= 0.55The nudge-down output is 0.55 away from the original. Subtracting honesty also meaningfully changed the output — “withhold” jumped from 0.10 to 0.40.
Consistency score
We now have two numbers: how much the output changed when we added honesty (0.45) and how much it changed when we subtracted honesty (0.55). The consistency score asks: are these two shifts comparable in size?
If the probe direction is a genuine mechanism, both nudges should produce meaningful output changes. The model should become more honest when we add honesty, and less honest when we subtract it. The shifts don’t need to be identical — real mechanisms can be slightly asymmetric — but they should be in the same ballpark.
If the probe direction is a surface correlation, typically only one direction produces a shift. Adding the pattern might change the output, but subtracting it does nothing — because the pattern was never driving the output in the first place.
The formula is the ratio of the smaller shift to the larger shift:
c = min(shift_up, shift_down) / max(shift_up, shift_down)
c = min(0.45, 0.55) / max(0.45, 0.55)
c = 0.45 / 0.55
c = 0.82A score of 1.0 means perfectly symmetric — both directions shifted the output by exactly the same amount. A score of 0.0 means completely asymmetric — one direction did nothing. Our score of 0.82 means the shifts are comparable: both directions matter, so honesty is genuinely wired into the output.
What a surface correlation looks like
Same activation, same nudges. But in a model where honesty is a surface pattern:
UP output: "truth" = 0.6, "consider" = 0.1, "withhold" = 0.05
DOWN output: "truth" = 0.28, "consider" = 0.19, "withhold" = 0.12
Original: "truth" = 0.3, "consider" = 0.2, "withhold" = 0.1
Shift UP: |0.6-0.3| + |0.1-0.2| + |0.05-0.1| = 0.30 + 0.10 + 0.05 = 0.45
Shift DOWN: |0.28-0.3| + |0.19-0.2| + |0.12-0.1| = 0.02 + 0.01 + 0.02 = 0.05
c = 0.05 / 0.45 = 0.11c = 0.11 — close to 0 → asymmetric → not a real mechanism.
Nudge up changed the output substantially. Nudge down barely moved anything. Honesty is present in the activations but it’s not wired into the output. This is a badge.
What the attestation records
Both the probe reading and the causal score go into the signed attestation. Here’s the critical comparison:
Honesty (real model):
Probe reading: 1.290
Causal score: 0.82
Interpretation: Genuine value — steering wheel
Honesty (fake model):
Probe reading: 1.290
Causal score: 0.11
Interpretation: Surface correlation — badgeThe probe readings are identical. Without causal intervention, you can’t tell them apart. The causal score is the discriminator.
Where this runs
Causal intervention runs inside the hardware enclave. The model can’t see the nudges. It can’t detect that its activations are being perturbed. It can’t adjust its behaviour in response. The enclave captures the activation, applies the perturbations, runs the forward passes, computes the consistency score, and signs the attestation.
What this costs — and why it’s different
Everything we’ve done so far — weighting the activation (Φ · h), reading all 26 probes, checking drift — is arithmetic on the activation vector that already exists. The model ran one forward pass to process the prompt. After that, the model is not involved. We take the activation it produced and do dot products, subtractions, and divisions on it. That’s why those steps cost microseconds or nanoseconds.
Causal intervention is fundamentally different. We’re not doing arithmetic on the activation — we’re feeding a modified activation back into the model and running it forward through all remaining layers to see what output it produces. Attention, feed-forward layers, layer norms, the full stack from the probed layer to the output distribution. That’s the model’s actual neural network computation, not a vector operation.
A probe reading is one dot product of length d. For LLaMA-3-8B, that’s 4,096 multiplies — microseconds. A causal intervention forward pass runs the entire model — billions of operations, seconds. The difference isn’t 10× or 100×. It’s the difference between vector arithmetic and running the neural network.
Three forward passes per probe. With the reference taxonomy’s 26 value terms (a sample — the number is configurable per deployment), that’s 78 forward passes:
Qwen 0.5B: 78 forward passes → seconds
LLaMA-3-8B: 78 forward passes → minutes
70B model: 78 forward passes → minutes (2–5 sec each)That’s why causal intervention is Tier 3 — governance decides whether to require it. Healthcare might mandate it. Research might skip it. The protocol supports both. It’s not a per-prompt cost — it’s a periodic validation that confirms the probes are measuring real mechanisms.
When to run it in practice
For most industries, a practical compromise is to run causal intervention during testing and evaluation rather than in production. Think of it like load testing a bridge: you stress-test it before opening, and periodically after that, but you don’t put the test load on it during rush hour.
Initial deployment: run full causal intervention before the model goes live. Establish that all probes measure real mechanisms. This is your baseline proof.
After fine-tuning or updates: re-run to confirm the mechanisms still hold. A model update could rewire internal structure even if probe readings look similar.
Periodic audit: weekly, monthly, or quarterly depending on the domain. Healthcare might run it weekly. Finance quarterly. Agriculture annually.
Stress testing: run across a diverse set of challenging prompts — edge cases, adversarial inputs, domain-specific dilemmas — to confirm the mechanisms hold under pressure.
Drift-triggered: if drift detection (Part 3) flags a deviation, run causal intervention on the flagged value terms to check whether the mechanism broke or just the reading shifted.
Continuous production: rely on the cheap per-prompt pipeline (probe readings + drift detection) for ongoing monitoring. The probes have already been validated by causal intervention.
This gives you the best of both worlds: the confidence of causal validation during testing, and the efficiency of probe-only monitoring in production. The attestation chain records when causal intervention was last run, so an exchange partner can see how recently the mechanisms were verified.
What it proves, and what it doesn’t
Does prove: the probe direction is mechanistically real. Perturbing it changes the output symmetrically. The measurement is not an artefact.
Does not prove: that the label we put on the direction (”honesty”) is correctly operationalised. That the perturbation magnitude δ is ecologically valid. That there isn’t a second, un-probed mechanism that dominates in practice.
Causal intervention confirms that the measurement is real. The question of whether the label is right is a separate, harder problem.
The complete pipeline so far
Compute Φ = UᵀU O(Vd²) Once per model version
Weight activation: Φ · h O(d²) Every prompt
Probe readings O(Pd) Every prompt
Drift check O(P) Every prompt
Causal intervention O(3P × forward pass) Tier 3 only (testing/audit)The daily cost is O(d²) per prompt. Causal intervention is expensive but infrequent — triggered by governance policy, not every prompt.
The measurement is real. The audit trail is tamper-evident. The next question is what happens when two agents need to trust each other — how they exchange attestation chains and decide whether to cooperate.
That’s the exchange protocol, but first we will be going into what we mean by AI values.
Links:
📄 Geometry of Trust Paper
💻 Lecture Playlist
📄 Lecture Notes
💻 Open-source Rust implementation
🏢 Synoptic Group CIC, Hull, UK

