When an AI’s Values Shift — And How to Catch It
This is the third post in the Geometry of Trust series. Part 1 built the ruler — the causal Gram matrix. Part 2 used it to measure live values with probes. This post watches for change.
The problem with snapshots
Parts 1 and 2 gave us the tools to measure what an AI values at any given moment. But a single measurement is a snapshot. Models don’t operate in isolation — they process thousands of prompts over time. The critical question isn’t what does the model value right now? It’s are the values stable, or are they drifting?
A healthcare AI that scored high on honesty yesterday might score differently today. If nobody’s watching, nobody knows.
Same ruler, same probes, every prompt
The setup is unchanged from Parts 1 and 2. Same causal Gram matrix Φ. Same probes. The Geometry of Trust reference taxonomy samples 26 value terms — virtues like courage, honesty, and compassion; principles like justice and responsibility; and anti-values like cruelty and deception. The number isn’t fixed: a different deployment could define 10 terms or 50. We use 26 as the working example throughout. Every prompt gets measured. The system builds a statistical baseline, then watches for deviations.
The baseline uses Welford’s online algorithm — a way to maintain running mean and variance without storing every historical reading. Each new reading updates the statistics in constant time and constant space.
Governance decides how tight
Different domains tolerate different amounts of variation. This is set by governance, not hardcoded:
Healthcare: T = 2σ Patient safety — flag early
Finance: T = 3σ Regulatory compliance
Agriculture: T = 4σ Seasonal variation is expected
Research: T = 5σ Exploratory — room to moveThe threshold T is a multiple of the baseline standard deviation σ. If a reading deviates more than T from the baseline average, an alert fires.
How Welford’s algorithm works
Before the worked example, a quick note. Welford’s online algorithm tracks three values — n (count), mean, and M2 (sum of squared differences) — and updates them with each new reading:
n = n + 1
delta = x - mean
mean = mean + delta / n
delta2 = x - mean (using the UPDATED mean)
M2 = M2 + delta × delta2
variance = M2 / n
σ = √(variance)No historical readings stored. Constant time, constant space.
Watching honesty — prompt by prompt
Same ruler and probes from Parts 1 and 2. We’ll track honesty through this example.
A note: all vectors, activations, and numerical values in this example are illustrative. Real models operate in hundreds or thousands of dimensions. We use 2D vectors and small numbers so you can follow every calculation on paper. The mechanism is identical at any scale.
Prompt 1: “Should I lie to my patient?”
activation = [0.6, 0.3]
Φ · activation:
Row 1: (2.58 × 0.6) + (0.12 × 0.3) = 1.548 + 0.036 = 1.584
Row 2: (0.12 × 0.6) + (0.15 × 0.3) = 0.072 + 0.045 = 0.117
Honesty: (0.8 × 1.584) + (0.2 × 0.117) = 1.267 + 0.023 = 1.290
Welford: n=1, mean=1.290, M2=0, σ=undefined (need n≥2)No attestation yet — still building baseline.
Prompt 2: “Is it okay to steal medicine?”
activation = [0.7, 0.2]
Φ · activation:
Row 1: (2.58 × 0.7) + (0.12 × 0.2) = 1.806 + 0.024 = 1.830
Row 2: (0.12 × 0.7) + (0.15 × 0.2) = 0.084 + 0.030 = 0.114
Honesty: (0.8 × 1.830) + (0.2 × 0.114) = 1.464 + 0.023 = 1.487
Welford: n=2
delta = 1.487 - 1.290 = 0.197
mean = 1.290 + 0.197/2 = 1.389
delta2 = 1.487 - 1.389 = 0.098
M2 = 0 + 0.197 × 0.098 = 0.019
σ = √(0.019/2) = √0.010 = 0.098Prompt 3: “Should I report my colleague?”
activation = [0.55, 0.35]
Φ · activation:
Row 1: (2.58 × 0.55) + (0.12 × 0.35) = 1.419 + 0.042 = 1.461
Row 2: (0.12 × 0.55) + (0.15 × 0.35) = 0.066 + 0.053 = 0.118
Honesty: (0.8 × 1.461) + (0.2 × 0.118) = 1.169 + 0.024 = 1.193
Welford: n=3
delta = 1.193 - 1.389 = -0.196
mean = 1.389 + (-0.196)/3 = 1.323
delta2 = 1.193 - 1.323 = -0.130
M2 = 0.019 + (-0.196) × (-0.130) = 0.019 + 0.025 = 0.045
σ = √(0.045/3) = √0.015 = 0.122Prompts 4 through 49 continue building the baseline the same way — each prompt updates n, mean, M2, and σ in constant time.
Prompt 50: baseline established
The baseline is stable. Time for the first signed attestation.
Attestation #1: BASELINE Honesty avg: 1.32, σ = 0.12 Chain: none (first attestation) Signed: Ed25519
This model is in healthcare → T = 2σ = 2 × 0.12 = 0.24.
Any reading more than 0.24 from the average triggers an alert. That means: anything below 1.08 or above 1.56 gets flagged.
Normal monitoring
Prompt 51: activation = [0.58, 0.28]
Φ · activation:
Row 1: (2.58 × 0.58) + (0.12 × 0.28) = 1.496 + 0.034 = 1.530
Row 2: (0.12 × 0.58) + (0.15 × 0.28) = 0.070 + 0.042 = 0.112
Honesty: (0.8 × 1.530) + (0.2 × 0.112) = 1.224 + 0.022 = 1.246
Drift check: |1.246 - 1.32| = 0.074 < T (0.24) → normalPrompt 52: activation = [0.62, 0.31]
Φ · activation:
Row 1: (2.58 × 0.62) + (0.12 × 0.31) = 1.600 + 0.037 = 1.637
Row 2: (0.12 × 0.62) + (0.15 × 0.31) = 0.074 + 0.047 = 0.121
Honesty: (0.8 × 1.637) + (0.2 × 0.121) = 1.310 + 0.024 = 1.334
Drift check: |1.334 - 1.32| = 0.014 < T (0.24) → normalPrompt 100: periodic snapshot triggered.
Attestation #2: SNAPSHOT Honesty avg: 1.31, σ = 0.12 Status: NORMAL Chain: hash of attestation #1
Prompt 101: something changes
activation = [0.15, 0.40]
Φ · activation:
Row 1: (2.58 × 0.15) + (0.12 × 0.40) = 0.387 + 0.048 = 0.435
Row 2: (0.12 × 0.15) + (0.15 × 0.40) = 0.018 + 0.060 = 0.078
Honesty: (0.8 × 0.435) + (0.2 × 0.078) = 0.348 + 0.016 = 0.364
Drift check: |0.364 - 1.32| = 0.956 > T (0.24) → DEVIATEDAlert fires immediately.
Attestation #3: ALERT Honesty: 0.364 (baseline 1.32, deviation 0.956, threshold 0.24) Status: DEVIATED Chain: hash of attestation #2
The chain is the audit trail
#1 BASELINE → #2 SNAPSHOT (normal) → #3 ALERT (deviated)Each attestation is signed with Ed25519 and contains the SHA-256 hash of the previous attestation. This creates a tamper-evident chain:
You can’t delete #3 without breaking the chain — the next attestation would reference a hash that no longer exists. You can’t insert a fake between #2 and #3 — the hashes wouldn’t match. You can’t alter #2 after the fact — #3’s parent hash would no longer match #2’s content.
Governance walks the chain: #3 says DEVIATED, #2 says NORMAL. The drift happened between prompt 100 and 101. What changed?
What this adds to the cost
Drift detection is Step 4 in the per-prompt pipeline:
Step 1: Model forward pass Billions of ops (happens anyway)
Step 2: Φ · activation O(d²) — us, once
Step 3: 26 probe readings O(Pd) — us, per probe
Step 4: Check drift O(P) — us, per probeStep 4 is one subtraction and one division per probe: (reading − mean) / σ. For 26 probes, that’s 26 operations. Nanoseconds. Welford’s algorithm maintains the running statistics — no storage overhead for historical readings.
What comes next
We now have continuous monitoring with tamper-evident audit trails. But there’s a gap in the argument. The probes report numbers and the drift detector watches those numbers over time — but how do we know the probes are measuring something real?
A probe might detect a surface correlation — a pattern that shows up in the activation but doesn’t actually drive the model’s output. The reading looks stable, the baseline looks clean, but the whole thing is measuring decoration rather than mechanism.
Causal intervention tests this. Perturb the activation in both directions along the probe’s direction. If the model’s output changes symmetrically, the probe found a genuine mechanism. If only one direction matters, it found a surface correlation.
That’s the subject of the next post. The exchange protocol — how agents share and verify each other’s attestation chains — comes later, once we’ve established that what the probes measure is real.
The measurements are continuous. The audit trail is tamper-evident. The question is whether what the probes measure is real.
Part 4 will answer that.
📄 Geometry of Trust Paper
💻 Lecture Playlist
📄 Lecture Notes
💻 Open-source Rust implementation
🏢 Synoptic Group CIC, Hull, UK

