Mechanistic interpretability
Looking inside the model
A sister project — a battery of probing experiments asking: what has the network actually learned? Originally built around V13.2 (five experiments, ~600–2000 board states each); now re-run against V13.5 to verify the symmetric-encoder bet is mechanically active, not just numerically tied.
V13.5 · CHANNEL ABLATION RE-RUN
The rank-routing is mechanically real
Re-ran channel-ablation on V13.5 with 600 stratified states. The four "Tok→Rank" planes (the constant channels that route the rank-indexed output back to which token to actually move) are the dominant channels globally — Policy KL 0.60–0.76, higher than any other channel. In late-game the pattern shifts: own-token-count and the leader-token rank mask take over (KL 0.68–1.01), and safe-zone reasoning becomes critical for landing.
The symmetric-encoder bet isn't just a numerical tie — it's an active mechanism the model uses on almost every decision. Phase-specialized policy: rank routing in opening, leader-token + safe-zone reasoning in endgame. The kind of structure V13.5 was designed to expose.
V13.2 · CHANNEL ABLATION (original finding)
The model leans on Token 3 more than the others
Zero out one input channel at a time and measure how much the policy distribution shifts (KL divergence). On 600 stratified states with the multi-legal filter — i.e. only states where the network actually has a choice — Token 3's channel comes out at KL ≈ 0.85, more than 2× the next-highest token (T1 at 0.38).
This was a surprise. The four tokens are interchangeable under the rules, so a symmetric encoder shouldn't single one out. The asymmetry was real and pointed at the input encoding itself — directly motivating V13.5's token-symmetric collapse, which is now deployed.
EXPERIMENT 3 · LINEAR PROBES
What concepts the 128-dim feature vector encodes
Train a logistic regression on the network's GAP features to decode hand-labelled concepts. Numbers are balanced accuracy on a held-out test set; baseline is the chance-level for that label distribution.
- Game phase (early / mid / late)79%vs 33% baseline
- Number of tokens out of base75%vs 33%
- Will I win this game?73%vs 51%
- Closest token to home35%vs 38% — not encoded
- Home-stretch token count48%vs 73% — anti-encoded
Strategic context (phase, lead, who's winning) lives clearly in the features. Per-token spatial concepts (which token is closest, how many are nearly home) don't — the network appears to re-derive these from the input each forward pass rather than maintaining them in the residual stream.
EXPERIMENT 2 · DICE SENSITIVITY
It's a reactive lookup, not a planner
Hold the board fixed, sweep the dice channel through 1–6, see how the chosen token shifts. ~78% of states flip the preferred token when the dice value changes — and the JS divergence between roll-1 and roll-6 distributions is large.
The network behaves as f(board, dice) → action, with dice values acting as broadcast modifiers rather than something integrated into a temporal plan. Same pattern across V6, V10, and now V13.2 — annealed PPO didn't change it. Tree-search-style planning would look very different.
EXPERIMENTS 4–5 · CAPACITY USE
Every channel is alive, but few do real work
Layer-knockout (skip individual ResBlocks) and channel-activation (which of the 160 channels actually fire) ran together. Result: 0 globally dead channels at any threshold — every channel produces some activation. But channel-importance is heavily long-tailed: a handful of channels dominate the policy gradient, and the bulk are weakly redundant.
The model isn't wasting parameters in the obvious sense (no dead neurons), but it's also not packing them densely. There's likely room to compress 10× without losing strength — which is partly why V13.5's smaller-input variant matched V13.2 at one-third the parameter count.