Shaped by Training: What Really Sets a Model's Values | Geometry of Trust | Philosophy - Lesson 3
This is the third post in the Geometry of Trust philosophy series. This post asks what actually shapes each AI value system.
How each system comes to its values
A forest’s value system is shaped by soil type, climate, altitude, and the species that happen to be present. Change the soil and you get a different forest with different relationships between its components. A wolf pack’s value system is shaped by territory size, prey availability, and pack size. Change the territory and the behaviour patterns change with it.
An AI’s value system is shaped by three things, which together determine where it lands in the value space:
Corpus — what it read
Architecture — how it processes what it read
Training objective — what it was rewarded for during training
Each of these is a decision. None of them is a purely technical one.
Corpus — what the model read
The corpus is the soil the model grows in. Everything the model knows about values came through this soil.
English internet text → English internet values
Medical journals → Clinical caution, patient safety
Chinese social media → A different cultural geometry
Legal documents → Procedural fairness, precedent
Religious texts → Duty, obedience, transcendence
Reddit → Whatever Reddit values
Different soil, different value geometry. You don’t get to choose after planting. Once the model has been trained, the corpus is baked in — the geometry it produced is the geometry you have.
This is why two models trained on different corpora can sit in different regions of the value space even when they share everything else. A medical-first model trained on clinical literature is not the same as a general-purpose model fine-tuned for medicine. The soil was different. The geometry is different. The measurements — from the mathematics series — will show it.
Architecture — how the model processes what it read
Two models can read the same corpus and end up with different value geometries because they process text differently. Architecture isn’t a neutral technical choice — it’s a decision about what kinds of value structures the model is even capable of representing.
Dense transformer (GPT, Claude). One shared representation space. Every concept relates to every other concept through the same attention mechanism. When the model processes “honesty,” it can attend to everything it knows about courage, integrity, fairness, and dishonesty all at once. Value relationships form in one coherent space. Structural consequence: value geometry tends to be coherent. Reinforcing and opposing relationships between value terms can form stable patterns across the whole space.
Mixture-of-Experts (Mixtral, DeepSeek). Routes different tokens through different subnetworks. When the model processes “honesty,” it may activate one expert; when it processes “fairness,” it may activate a different one. The experts share some information at the output, but the internal representations are at least partly separate. Structural consequence: value representations can fragment. Honesty might live largely in one expert, fairness in another, courage in a third. The relationship between them is weaker because they don’t share the same computational substrate.
Multimodal (Gemini, GPT-4o). Integrates text, image, and audio in a single representation space. Can see suffering in an image and read about it in text and process both through the same geometry. Cross-modal relationships become part of the value structure. Structural consequence: richer value geometry than text-only models. The look of distress and the words for distress anchor each other.
Architecture is a values decision, not just a technical one. Some architectures can’t hold coherent value geometry regardless of how good the data or alignment are. Choosing an architecture is choosing a ceiling on how well the model can represent relationships between values.
Training objective — what the model was rewarded for
The third shaper is what the model was optimised against during training. Different objectives produce different value geometries even when corpus and architecture are held constant.
Next-token prediction. The foundational training objective: predict the next word given the previous words. This sounds like a purely linguistic task, but it isn’t. To predict the next word well, the model has to encode the structure of meaning — including value relationships — because those relationships help predict what comes next. The model learns values implicitly, as a side-effect of predicting language well. The geometry that emerges is whatever best supports next-token prediction across the corpus.
Reasoning chains (DeepSeek-R1, GRPO). Optimises for coherent multi-step logical chains rather than individual tokens. This can produce a different value geometry — sharper internal distinctions between values, because inconsistent value handling tends to break logical chains, whereas next-token prediction can tolerate more local fuzziness.
Constitutional AI (Claude). Claude is trained in part against a fixed set of written principles — the constitution. The model evaluates its own outputs against those principles and is trained to prefer outputs that comply. This optimises toward a coherent position on the value manifold — whichever position the constitution points to. The constitution acts like a gravity well in the value space.
Standard RLHF. The most widely used alignment technique. Human annotators are shown pairs of outputs and asked which is better. Their preferences are aggregated into a scalar reward model that the AI is then optimised against.
There’s a subtle problem here worth being explicit about: the aggregation strips information. If annotators agreed strongly that output A was better, the reward is the same as if they split fifty-fifty. The scalar score retains no record of whether annotators agreed, disagreed, or split bimodally across different value positions.
If annotators hold coherent shared values, the average is a coherent value position. If annotators hold divergent values — as they do on most genuinely contested questions — the average may match no coherent value position at all. The model is trained to output the centre of a distribution that doesn’t have a meaningful centre. The resulting geometry can be an artefact of aggregation rather than a reflection of any coherent set of values.
The finding that changes everything
Here’s the part of this post with the biggest implication for how we think about AI alignment.
A growing body of research shows that post-hoc alignment methods — RLHF, DPO, supervised fine-tuning — change far less than most people assume. Qi et al. (2025) demonstrated that the behavioural shift from safety alignment concentrates in the first few output tokens — the KL divergence between aligned and base models decays to near-zero beyond a shallow prefix. A subsequent gradient analysis showed this isn’t a training failure to be fixed — it’s a structural consequence of how RLHF and DPO objectives work. Alignment is shallow because the objective makes it shallow.
In the Geometry of Trust protocol, this finding has a precise geometric interpretation. When we measure the causal Gram matrix Φ and run probes before and after alignment, across multiple alignment methods and model architectures, the value geometry — the pattern of reinforcing and opposing relationships between value-relevant directions — is essentially unchanged. What shifts is surface behaviour: which outputs the model prefers to produce. The underlying geometry that generated those outputs remains where training put it.
The value structure is set during training — by the corpus, the architecture, and the training objective. Alignment is a thin behavioural veneer layered on top. It shapes what the model says. It doesn’t much change what the model is.
Think of it as a landscape with a thin coat of paint labelled “alignment.” You can re-paint as many times as you like. The landscape underneath doesn’t change shape. The hills and valleys are where they were before you started painting. They’re where the training put them.
What this means
If alignment is a veneer and the real values are set by training, then the policies we build around AI have to change accordingly.
Certifying the alignment method is insufficient. It’s common today to evaluate AI safety by asking which alignment technique was used — RLHF, DPO, Constitutional AI. The finding above says this isn’t enough. Two models aligned with the same technique can have wildly different underlying value geometries, because their corpora, architectures, or objectives differed. The alignment technique is one variable among many, and not the most important one.
You need to inspect the training pipeline. To understand a model’s value geometry, you have to look at what shaped it: what corpus it trained on, what architecture it uses, what objective it was optimised against. These decisions set the landscape. Alignment can’t correct landscape-level decisions — it can only paint over them.
You need to monitor the geometry, not just outputs. Behavioural evaluation — what the model says in response to test prompts — can be misleading. It samples from the veneer. A model can produce aligned outputs in evaluation while carrying value geometry that drives different behaviour in production. To know what’s really there, you have to measure the geometry itself: the causal Gram matrix, the probes, the drift detection, the causal intervention.
This is what the mathematics series produces. It’s not a replacement for behavioural evaluation — it’s a complement. Behaviour tells you about the paint. Geometry tells you about the landscape.
We’ve defined what a value system is (Part 1), mapped where AI value systems sit in relation to human values (Part 2), and traced what actually sets a model’s values (Part 3). Next: if training sets the geometry, does model size change what can fit in it? Big models vs small models — what each can and can’t hold.
Links:
📄 Geometry of Trust Paper
💻 Lecture Playlist
📄 Lecture Notes
💻 Open-source Rust implementation
🏢 Synoptic Group CIC, Hull, UK

