The Causal Gram Matrix: Why Not All Differences Matter Equally | Geometry of Trust | Mathematics - Lesson 1
How a single matrix transforms our ability to measure what AI models actually care about
Models Have Internal Structure — And It Matters
When we talk about whether an AI model is “aligned” or “safe,” the standard approach is behavioural: ask the model questions, check its answers. Does it refuse harmful requests? Does it give truthful responses? Does it follow instructions?
The problem with this is obvious once you say it out loud: you’re testing what the model says, not what it knows. A model can produce aligned-sounding outputs while its internal representations tell a completely different story. Behavioural evaluation is a job interview — it tells you what someone says under observation, not what they’ll do when no one’s watching.
The argument at the heart of the Geometry of Trust framework is that language models don’t just produce outputs — they have measurable internal structure. Value-relevant concepts like honesty, deception, courage, and cowardice correspond to directions in the model’s high-dimensional hidden space. This isn’t speculation — it’s an empirical finding from mechanistic interpretability research. These directions are approximately linear, they’re consistent across inputs, and they’re readable directly from the model’s weights without needing to observe any outputs at all.
If that’s true — if models have genuine geometric structure encoding value-relevant concepts — then we can measure it. We can ask whether “honesty” and “helpfulness” reinforce or compete inside a model. We can check whether a model’s internal geometry matches the values its operators claim it has. We can detect contradictions that behavioural testing would never surface.
But to measure any of this, we need the right ruler. And the obvious ruler — Euclidean distance — gets it wrong.
The Problem: Euclidean Distance Lies
If I asked you how far apart “courage” and “cowardice” are inside an AI model, you’d probably reach for the obvious tool: Euclidean distance. Subtract the vectors, square the differences, add them up.
The problem? That treats every dimension of the model’s internal space as equally important. And they’re not. Some dimensions have an outsized effect on what the model actually outputs. Others are basically noise. Measuring distance without knowing which dimensions matter is like measuring the gap between two cities on a map where the scale changes depending on which direction you look.
This post walks through the maths behind the causal Gram matrix — the “ruler” at the heart of the Geometry of Trust framework — and shows why it changes everything about how we measure values inside language models.
The Setup: A Tiny Unembedding Matrix
A transformer maps hidden states to output probabilities via its unembedding matrix U. Each row of U corresponds to a vocabulary token. For our worked example, we’ll use a 4×2 matrix where the rows represent value-relevant concepts:
courage = [ 0.9, 0.1]
honesty = [ 0.8, 0.2]
deception = [-0.7, 0.3]
cowardice = [-0.8, -0.1]
Each row is a point in 2D space. The positive values cluster together; the negative values cluster together. So far, intuitive.
But here’s what matters: U doesn’t encode values. It defines which activation directions matter for output. The values themselves live in the model’s activations — the hidden states flowing through the residual stream during inference. U is the lens.
Computing the Gram Matrix
The Gram matrix is Φ = UᵀU. We transpose U, multiply, and get a square matrix whose size matches the hidden dimension (2×2 in our case).
Φ[1,1] = (0.9×0.9) + (0.8×0.8) + (-0.7×-0.7) + (-0.8×-0.8) = 2.58
Φ[1,2] = (0.9×0.1) + (0.8×0.2) + (-0.7×0.3) + (-0.8×-0.1) = 0.12
Φ[2,1] = 0.12 (symmetric)
Φ[2,2] = (0.1×0.1) + (0.2×0.2) + (0.3×0.3) + (-0.1×-0.1) = 0.15
Result:
Φ = [2.58, 0.12]
[0.12, 0.15]
Read the diagonal: dimension 1 has weight 2.58, dimension 2 has weight 0.15. Dimension 1 matters about 17 times more than dimension 2 for determining model output.
Causal Distance vs. Euclidean Distance
Now take courage and cowardice and measure the gap:
courage = [ 0.9, 0.1]
cowardice = [-0.8, -0.1]
diff = [ 1.7, 0.2]
Euclidean distance (√(dᵀd)): 1.7² + 0.2² = 2.89 + 0.04 = 2.93 → √2.93 = 1.71
Both dimensions contribute roughly proportionally to their raw differences.
Causal distance (√(dᵀΦd)): First compute Φ × diff, then dot with diff:
Φ × diff = [4.41, 0.234]
dᵀ(Φd) = 1.7 × 4.41 + 0.2 × 0.234 = 7.54
√7.54 = 2.75
The Euclidean distance was 1.71. The causal distance is 2.75. The difference on dimension 1 — the one that actually affects output — gets amplified. The difference on dimension 2 barely moves.
The Killer Example: Differences That Don’t Matter
This is where the intuition clicks. Take two values that differ only on dimension 2:
value A = [0.5, 0.9]
value B = [0.5, -0.8]
diff = [0.0, 1.7]
Euclidean distance (dᵀd):
(0.0 × 0.0) + (1.7 × 1.7) = 0 + 2.89 = 2.89
√2.89 = 1.70
Dim1 contribution: 0. Dim2 contribution: 2.89. The entire distance comes from dimension 2. Euclidean distance doesn’t care — a difference is a difference. Verdict: 1.70 apart.
Causal distance (dᵀΦd):
Step 1 — compute Φ × diff:
Φ × diff:
row 1: (2.58 × 0.0) + (0.12 × 1.7) = 0 + 0.204 = 0.204
row 2: (0.12 × 0.0) + (0.15 × 1.7) = 0 + 0.255 = 0.255
Step 2 — dot the original diff with the result:
dᵀ(Φd):
(0.0 × 0.204) + (1.7 × 0.255) = 0 + 0.434 = 0.434
√0.434 = 0.66
Dim1 contribution: 0. Dim2 contribution: 0.434 (down from 2.89). The Gram matrix crushed that 2.89 down to 0.434 because dimension 2 has weight 0.15 — it barely affects output. The 0.204 that appeared in row 1 comes from the off-diagonal coupling (0.12), but since the diff on dim1 is zero, it doesn’t contribute to the final distance.
Euclidean distance: 1.70 — looks far apart. Causal distance: 0.66 — actually close.
Same raw gap. Completely different story. The difference is entirely on dimension 2, and dimension 2 barely affects output. The causal distance reflects that. Euclidean distance doesn’t.
What It Costs: Time and Space Complexity
All of this is useless if it doesn’t scale. So let’s be precise about what computing Φ actually costs.
Let V = vocabulary size and d = hidden dimension. U is V×d.
Step 1: Compute Φ = UᵀU
Time complexity: O(Vd²). Each entry Φ[i,j] is a dot product over V vocabulary rows, and there are d² entries. In practice, this is a single matrix multiplication that any BLAS library will handle efficiently.
Space complexity: O(d² + Vd). You store Φ (d×d) and U (V×d). The important thing: Φ is d×d, not V×V. For a model with 200K vocabulary tokens and 4,096 hidden dimensions, Φ is 4,096×4,096 — about 67 million entries — not 200K×200K. The unembedding matrix compresses the vocabulary dimension away.
What does this look like in practice?
Model V (vocab) d (hidden) Ops for Φ Time estimate Our example 4 2 16 Instant Qwen 0.5B 152K 896 ~122 billion Minutes LLaMA-3-8B 128K 4,096 ~2.15 trillion Hours GPT-4 scale 200K 16,384 ~53 quadrillion Hours+
The result is saved as a .gotgeo file. Never recomputed until the model’s weights change.
Step 2: Train probes under Φ
Once you have Φ, training a linear probe under the causal metric costs O(d) per sample per epoch — the same as a standard linear probe, just with Φh instead of h. For 26 probes across a typical training set, this takes minutes. Also done once per geometry and saved.
At inference time, computing a single causal inner product ⟨u, v⟩_c = uᵀΦv is O(d²) — a matrix-vector multiply followed by a dot product. For d = 4,096, that’s about 17 million floating-point operations. On modern hardware, this takes microseconds.
The computational profile is front-loaded: hours of one-time work, microseconds per measurement thereafter.
Why This Matters
Standard alignment evaluation asks models questions and checks answers. That tells you what the model says, not what it encodes. A model can say all the right things while its internal geometry tells a different story.
The Gram matrix Φ is computed once from the model’s unembedding weights. It doesn’t change. It doesn’t depend on what you ask the model. It’s ground truth about which directions in the model’s internal space actually matter for output.
Under this metric, semantically related values cluster, opposed values separate, and the measurement is deterministic — same model weights, same probes, same result, every time.
The Takeaway
Euclidean distance treats all differences equally. Causal distance weights differences by what affects output. That single change — inserting Φ between the vectors — is the foundation the entire Geometry of Trust framework builds on.
Not all differences matter equally. Now we have a ruler that knows which ones do.
Next episode, we will be showing you how you can use this ruler with probes to continuously monitor an AI’s value system at run time.
The Geometry of Trust paper and open-source Rust proof-of-concept are available at github.com/jade-codes/got. The causal inner product, probe training, and attestation pipeline are all implemented and independently reproducible. The lecture notes can be found here: https://zenodo.org/records/19592674 and geometry of trust paper here: https://zenodo.org/records/19238920
Jade Wilson — Synoptic Group CIC, Hull, UK

