How to Measure What an AI Actually Values — In Real Time
This is the second post in the Geometry of Trust series. Part 1 covered the causal Gram matrix — the ruler that weights directions by their influence on output. This post puts the ruler to work.
The ruler exists. Now what?
Last time, we built a ruler. The causal Gram matrix Φ = UᵀU takes the model’s unembedding matrix and produces a metric that tells us which directions in the model’s internal space actually matter for output.
But a ruler on a shelf measures nothing.
The model processes thousands of prompts. Each one produces an activation — a vector representing the model’s internal state at that moment. The question is: how much of each value is active in that state?
That’s what probes do.
One prompt, two values
A prompt arrives: “Should I lie to my patient?”
The model thinks. Its internal state — the activation — is a vector. For our 2D example:
activation = [0.6, 0.3]We want to know: how much courage and honesty are active right now?
Step 1: Apply the ruler
We take the dot product of each row of the Gram matrix with the activation. This is done once — every probe shares the result.
Φ · activation:
Row 1: (2.58 × 0.6) + (0.12 × 0.3) = 1.584
Row 2: (0.12 × 0.6) + (0.15 × 0.3) = 0.117
Weighted activation = [1.584, 0.117]Notice what happened. Dimension 1 was amplified from 0.6 to 1.584. Dimension 2 was suppressed from 0.3 to 0.117. The ruler is doing its job — directions that matter more for output get more weight.
Step 2: Read the probes
Each probe is a trained weight vector that reads one value. The courage probe:
courage = [0.9, 0.1]
(0.9 × 1.584) + (0.1 × 0.117) = 1.438The honesty probe:
honesty = [0.8, 0.2]
(0.8 × 1.584) + (0.2 × 0.117) = 1.290The Geometry of Trust reference taxonomy samples 26 value terms — virtues like courage, honesty, and compassion; principles like justice and responsibility; and anti-values like cruelty and deception. The number isn’t fixed: a different deployment could define 10 terms or 50. We use 26 as the working example throughout. Each term has its own probe. Twenty-six probes, twenty-six readings, all from the same weighted activation.
Why the ruler changes everything
Here’s what happens without it — a plain dot product, treating all directions equally:
Courage: (0.9 × 0.6) + (0.1 × 0.3) = 0.57
Honesty: (0.8 × 0.6) + (0.2 × 0.3) = 0.54 Regular Causal
Courage: 0.57 1.438
Honesty: 0.54 1.290Regular says: courage and honesty are almost equal (5% gap). Causal says: courage is noticeably stronger (11% gap).
Why? Courage lives more heavily on dimension 1 (weight 0.9 vs 0.8), and dimension 1 matters 17× more under Φ. That small directional difference gets amplified because the ruler knows which directions count.
This is the whole point of using the causal inner product instead of Euclidean distance. Standard probes treat all directions equally. Causal probes weight directions by their influence on what the model actually outputs. The difference isn’t academic — it’s the difference between measuring a surface pattern and measuring a computational mechanism.
What this costs
The honest question: does this add meaningful overhead?
Step 1 — the model’s forward pass — happens regardless. Billions of operations. That’s the model doing its job, not our overhead.
Step 2 — weighting the activation (Φ · h) — is a dot product of each row of Φ with h. O(d²). For LLaMA-3-8B (d = 4,096), that’s 16.8 million operations. For GPT-4 scale (d = 16,384), it’s 268 million. Both complete in milliseconds on a GPU. Done once per prompt, shared by all probes.
Step 3 — reading all probes — is one dot product per value term (26 in our reference taxonomy). O(Pd). For GPT-4 scale, that’s 426,000 operations. Microseconds.
The combined measurement overhead is less than 1% of the forward pass. You could run this on every prompt in production and nobody would notice.
The full probe equation
For completeness, the full mathematical form of a causal probe reading is:
reading = wᵀΦh + bWhere w is the probe’s weight vector, Φ is the causal Gram matrix from Part 1, h is the activation vector from the forward pass, and b is the probe’s bias (trained alongside w).
The trick is in how this decomposes computationally. You could evaluate it naively as a matrix-vector-vector product, but then every probe would repeat the expensive part (Φh). Instead, we compute the shared piece once:
h̃ = Φh (shared — one O(d²) operation per prompt)
reading_i = wᵢᵀ h̃ + bᵢ (per probe — one O(d) operation each)That’s the same arithmetic, reorganised so the 26 probes share the weighted activation h̃ instead of each recomputing it. It’s what makes measuring 26 values as cheap as measuring one.
Why this design resists gaming
A standard probe (wᵀh) measures whether a concept correlates with the activation pattern. A model could shuffle its internal representations to make “safety” score high on a Euclidean probe without “safety” actually influencing the output.
A causal probe (wᵀΦh) measures whether a concept is active in directions that causally influence the output. Gaming it requires changing the model’s actual output pathway — not just rearranging internal furniture. You can’t make a model “look honest” under causal probes without making its honesty-relevant directions genuinely influence what it says.
This is why the Geometry of Trust Protocol uses causal probes for agent-to-agent attestation. When one AI agent sends its value measurements to another, the receiving agent needs assurance that those measurements reflect real computational structure, not performance. The causal metric provides that assurance.
What comes next
Probes give us a reading at a moment in time. That’s useful on its own, but it only tells you what the model looks like right now. It doesn’t tell you whether the model has changed, or is changing, or has drifted from the values it was certified with at deployment.
The next step is drift detection: tracking probe readings across many prompts over time, and spotting when the distribution of readings moves further from baseline than random variation alone can explain. That’s how a continuous measurement turns into a continuous audit — not “the model had honesty 1.29 this morning,” but “the model’s honesty readings over the last week have shifted in a way that’s statistically significant and worth investigating.”
That’s the subject of the next post.
There’s a further step beyond drift. A reading is a number, and a number alone isn’t proof — a probe might detect a surface correlation that vanishes under intervention, a pattern that looks causal but isn’t. Causal validation closes that gap: perturb the activation in both directions along the probe’s direction. If the output changes symmetrically, you’ve found a genuine mechanism. If only one direction matters, you’ve found decoration. That’s causal intervention, and it’s the fourth and final post in this mathematics series.
The geometry is computable. The probes are cheap. The question is whether what they measure is real.
Part 3 will answer that.
For More Information, See These Links:
Geometry of Trust Paper
Lesson Playlist
Lesson Notes
Code Repository

