Mechanistic interpretability
Looking inside the model
Once V13.2 hit its plateau we ran a sister project — a battery of probing experiments to ask: what has the network actually learned? Five experiments, ~600–2000 board states each. Here are the ones that changed our mental model.
EXPERIMENT 1 · CHANNEL ABLATION
The model leans on Token 3 more than the others
Zero out one input channel at a time and measure how much the policy distribution shifts (KL divergence). On 600 stratified states with the multi-legal filter — i.e. only states where the network actually has a choice — Token 3's channel comes out at KL ≈ 0.85, more than 2× the next-highest token (T1 at 0.38).
This was a surprise. The four tokens are interchangeable under the rules, so a symmetric encoder shouldn't single one out. The asymmetry is real and points at the input encoding itself — directly motivating V13.5's token-symmetric collapse experiment in the queue above.
EXPERIMENT 3 · LINEAR PROBES
What concepts the 128-dim feature vector encodes
Train a logistic regression on the network's GAP features to decode hand-labelled concepts. Numbers are balanced accuracy on a held-out test set; baseline is the chance-level for that label distribution.
- Game phase (early / mid / late)79%vs 33% baseline
- Number of tokens out of base75%vs 33%
- Will I win this game?73%vs 51%
- Closest token to home35%vs 38% — not encoded
- Home-stretch token count48%vs 73% — anti-encoded
Strategic context (phase, lead, who's winning) lives clearly in the features. Per-token spatial concepts (which token is closest, how many are nearly home) don't — the network appears to re-derive these from the input each forward pass rather than maintaining them in the residual stream.
EXPERIMENT 2 · DICE SENSITIVITY
It's a reactive lookup, not a planner
Hold the board fixed, sweep the dice channel through 1–6, see how the chosen token shifts. ~78% of states flip the preferred token when the dice value changes — and the JS divergence between roll-1 and roll-6 distributions is large.
The network behaves as f(board, dice) → action, with dice values acting as broadcast modifiers rather than something integrated into a temporal plan. Same pattern across V6, V10, and now V13.2 — annealed PPO didn't change it. Tree-search-style planning would look very different.
EXPERIMENTS 4–5 · CAPACITY USE
Every channel is alive, but few do real work
Layer-knockout (skip individual ResBlocks) and channel-activation (which of the 160 channels actually fire) ran together. Result: 0 globally dead channels at any threshold — every channel produces some activation. But channel-importance is heavily long-tailed: a handful of channels dominate the policy gradient, and the bulk are weakly redundant.
The model isn't wasting parameters in the obvious sense (no dead neurons), but it's also not packing them densely. There's likely room to compress 10× without losing strength — which is partly why V13.5's smaller-input variant matched V13.2 at one-third the parameter count.