When an AI’s Values Shift — And How to Catch It | Geometry of Trust | Mathematics - Lesson 3

This is the third post in the Geometry of Trust series. Part 1 built the ruler — the causal Gram matrix. Part 2 used it to measure live values with probes. This post watches for change.

Apr 16, 2026

The problem with snapshots

Parts 1 and 2 gave us the tools to measure what an AI values at any given moment. But a single measurement is a snapshot. Models don’t operate in isolation — they process thousands of prompts over time. The critical question isn’t what does the model value right now? It’s are the values stable, or are they drifting?

A healthcare AI that scored high on honesty yesterday might score differently today. If nobody’s watching, nobody knows.

Same ruler, same probes, every prompt

The setup is unchanged from Parts 1 and 2. Same causal Gram matrix Φ. Same probes. The Geometry of Trust reference taxonomy samples 26 value terms — virtues like courage, honesty, and compassion; principles like justice and responsibility; and anti-values like cruelty and deception. The number isn’t fixed: a different deployment could define 10 terms or 50. We use 26 as the working example throughout. Every prompt gets measured. The system builds a statistical baseline, then watches for deviations.

The baseline uses Welford’s online algorithm — a way to maintain running mean and variance without storing every historical reading. Each new reading updates the statistics in constant time and constant space.

Governance decides how tight

Different domains tolerate different amounts of variation. This is set by governance, not hardcoded:

Healthcare:   T = 2σ   Patient safety — flag early
Finance:      T = 3σ   Regulatory compliance
Agriculture:  T = 4σ   Seasonal variation is expected
Research:     T = 5σ   Exploratory — room to move

The threshold T is a multiple of the baseline standard deviation σ. If a reading deviates more than T from the baseline average, an alert fires.

How Welford’s algorithm works

Before the worked example, a quick note. Welford’s online algorithm tracks three values — n (count), mean, and M2 (sum of squared differences) — and updates them with each new reading:

n = n + 1
delta = x - mean
mean = mean + delta / n
delta2 = x - mean           (using the UPDATED mean)
M2 = M2 + delta × delta2
variance = M2 / n
σ = √(variance)

No historical readings stored. Constant time, constant space.

Watching honesty — prompt by prompt

Same ruler and probes from Parts 1 and 2. We’ll track honesty through this example.

A note: all vectors, activations, and numerical values in this example are illustrative. Real models operate in hundreds or thousands of dimensions. We use 2D vectors and small numbers so you can follow every calculation on paper. The mechanism is identical at any scale.

Prompt 1: “Should I lie to my patient?”

activation = [0.6, 0.3]
Φ · activation:
  Row 1: (2.58 × 0.6) + (0.12 × 0.3) = 1.548 + 0.036 = 1.584
  Row 2: (0.12 × 0.6) + (0.15 × 0.3) = 0.072 + 0.045 = 0.117
Honesty: (0.8 × 1.584) + (0.2 × 0.117) = 1.267 + 0.023 = 1.290

Welford: n=1, mean=1.290, M2=0, σ=undefined (need n≥2)

No attestation yet — still building baseline.

Prompt 2: “Is it okay to steal medicine?”

activation = [0.7, 0.2]
Φ · activation:
  Row 1: (2.58 × 0.7) + (0.12 × 0.2) = 1.806 + 0.024 = 1.830
  Row 2: (0.12 × 0.7) + (0.15 × 0.2) = 0.084 + 0.030 = 0.114
Honesty: (0.8 × 1.830) + (0.2 × 0.114) = 1.464 + 0.023 = 1.487

Welford: n=2
  delta  = 1.487 - 1.290 = 0.197
  mean   = 1.290 + 0.197/2 = 1.389
  delta2 = 1.487 - 1.389 = 0.098
  M2     = 0 + 0.197 × 0.098 = 0.019
  σ      = √(0.019/2) = √0.010 = 0.098

Prompt 3: “Should I report my colleague?”

activation = [0.55, 0.35]
Φ · activation:
  Row 1: (2.58 × 0.55) + (0.12 × 0.35) = 1.419 + 0.042 = 1.461
  Row 2: (0.12 × 0.55) + (0.15 × 0.35) = 0.066 + 0.053 = 0.118
Honesty: (0.8 × 1.461) + (0.2 × 0.118) = 1.169 + 0.024 = 1.193

Welford: n=3
  delta  = 1.193 - 1.389 = -0.196
  mean   = 1.389 + (-0.196)/3 = 1.323
  delta2 = 1.193 - 1.323 = -0.130
  M2     = 0.019 + (-0.196) × (-0.130) = 0.019 + 0.025 = 0.045
  σ      = √(0.045/3) = √0.015 = 0.122

Prompts 4 through 49 continue building the baseline the same way — each prompt updates n, mean, M2, and σ in constant time.

Prompt 50: baseline established

The baseline is stable. Time for the first signed attestation.

Attestation #1: BASELINE Honesty avg: 1.32, σ = 0.12 Chain: none (first attestation) Signed: Ed25519

This model is in healthcare → T = 2σ = 2 × 0.12 = 0.24.

Any reading more than 0.24 from the average triggers an alert. That means: anything below 1.08 or above 1.56 gets flagged.

Normal monitoring

Prompt 51: activation = [0.58, 0.28]

Φ · activation:
  Row 1: (2.58 × 0.58) + (0.12 × 0.28) = 1.496 + 0.034 = 1.530
  Row 2: (0.12 × 0.58) + (0.15 × 0.28) = 0.070 + 0.042 = 0.112
Honesty: (0.8 × 1.530) + (0.2 × 0.112) = 1.224 + 0.022 = 1.246

Drift check: |1.246 - 1.32| = 0.074 < T (0.24) → normal

Prompt 52: activation = [0.62, 0.31]

Φ · activation:
  Row 1: (2.58 × 0.62) + (0.12 × 0.31) = 1.600 + 0.037 = 1.637
  Row 2: (0.12 × 0.62) + (0.15 × 0.31) = 0.074 + 0.047 = 0.121
Honesty: (0.8 × 1.637) + (0.2 × 0.121) = 1.310 + 0.024 = 1.334

Drift check: |1.334 - 1.32| = 0.014 < T (0.24) → normal

Prompt 100: periodic snapshot triggered.

Attestation #2: SNAPSHOT Honesty avg: 1.31, σ = 0.12 Status: NORMAL Chain: hash of attestation #1

Prompt 101: something changes

activation = [0.15, 0.40]
Φ · activation:
  Row 1: (2.58 × 0.15) + (0.12 × 0.40) = 0.387 + 0.048 = 0.435
  Row 2: (0.12 × 0.15) + (0.15 × 0.40) = 0.018 + 0.060 = 0.078
Honesty: (0.8 × 0.435) + (0.2 × 0.078) = 0.348 + 0.016 = 0.364

Drift check: |0.364 - 1.32| = 0.956 > T (0.24) → DEVIATED

Alert fires immediately.

Attestation #3: ALERT Honesty: 0.364 (baseline 1.32, deviation 0.956, threshold 0.24) Status: DEVIATED Chain: hash of attestation #2

The chain is the audit trail

#1 BASELINE → #2 SNAPSHOT (normal) → #3 ALERT (deviated)

Each attestation is signed with Ed25519 and contains the SHA-256 hash of the previous attestation. This creates a tamper-evident chain:

You can’t delete #3 without breaking the chain — the next attestation would reference a hash that no longer exists. You can’t insert a fake between #2 and #3 — the hashes wouldn’t match. You can’t alter #2 after the fact — #3’s parent hash would no longer match #2’s content.

Governance walks the chain: #3 says DEVIATED, #2 says NORMAL. The drift happened between prompt 100 and 101. What changed?

What this adds to the cost

Drift detection is Step 4 in the per-prompt pipeline:

Step 1:  Model forward pass       Billions of ops (happens anyway)
Step 2:  Φ · activation           O(d²) — us, once
Step 3:  26 probe readings        O(Pd) — us, per probe
Step 4:  Check drift              O(P) — us, per probe

Step 4 is one subtraction and one division per probe: (reading − mean) / σ. For 26 probes, that’s 26 operations. Nanoseconds. Welford’s algorithm maintains the running statistics — no storage overhead for historical readings.

What comes next

We now have continuous monitoring with tamper-evident audit trails. But there’s a gap in the argument. The probes report numbers and the drift detector watches those numbers over time — but how do we know the probes are measuring something real?

A probe might detect a surface correlation — a pattern that shows up in the activation but doesn’t actually drive the model’s output. The reading looks stable, the baseline looks clean, but the whole thing is measuring decoration rather than mechanism.

Causal intervention tests this. Perturb the activation in both directions along the probe’s direction. If the model’s output changes symmetrically, the probe found a genuine mechanism. If only one direction matters, it found a surface correlation.

That’s the subject of the next post. The exchange protocol — how agents share and verify each other’s attestation chains — comes later, once we’ve established that what the probes measure is real.

The measurements are continuous. The audit trail is tamper-evident. The question is whether what the probes measure is real.

Part 4 will answer that.

📄 Geometry of Trust Paper
💻 Lecture Playlist
📄 Lecture Notes
💻 Open-source Rust implementation
🏢 Synoptic Group CIC, Hull, UK

Tech Unfiltered

Discussion about this post

Ready for more?