An eight-month side project · Oct 2025 → May 2026

AlphaLudo

A neural network that learned to play Ludo from scratch.

3 million parameters trained over eight months and dozens of experiments — now running entirely inside your browser. No server, no API calls, no waiting. Pick up the dice and see if you can beat it.

Inspired by AlphaGo · TD-Gammon · AlphaZero

8mo of iteration
36 experiments tried
3M parameters in the model
~80% wins vs scripted opponents
51.7% wins vs the previous AlphaLudo (V13.5 vs V13.2, 3,000-game H2H)
50%
50%

AI's predicted winner

You vs AlphaLudo
Your Turn

    Click to roll

    or press Space

      Behind the model

      Eight months of iteration.

      From a naive baseline that couldn't tell its own four tokens apart, to a 3-million-parameter network that beats every earlier version of itself. The short tour of how we got there — and what didn't work along the way.

      From V1 to V13.2

      The architecture timeline

      1. Oct 2025 · The first try The naive baseline

        The model saw the board as eight stacked black-and-white maps — but it couldn't tell its own four tokens apart. They all collapsed into one blob, so the AI was guessing which piece to move. It lost a lot.

      2. Dec 2025 · Hand-holding Engineered features

        We started feeding the network "tactical hints" we computed by hand — danger maps, capture opportunities, safe landing squares. It got better, but plateaued at 73–77% wins against scripted opponents and stopped improving.

      3. Mar 2026 · Attention Breaking through the plateau

        Added a "token attention" layer — letting the network reason about its four tokens as separate entities, with awareness of how often each had been ignored. First model to consistently win >80% against scripted opponents.

      4. Apr 2026 · Previous Less is more (V13.2)

        Stripped most of the hand-engineered features back out and gave the network mostly raw board positions plus 3 static board hints (safe cells, home stretches). Beat every earlier version of itself by a small but statistically real edge. Held this site for two weeks as the strongest version — until V13.5.

      5. May 2026 · Current The four tokens are interchangeable (V13.5)

        Every encoder so far had given each of the four own tokens its own input channel — forcing the network to learn from scratch that the rules treat them identically. V13.5 collapses them into a single count-per-cell channel and re-routes the model's "which token to move" output through a rank-indexed gather. Same parameter budget, same training pipeline, same opponent pool — 51.7% wins over 3,000 games vs V13.2 (95% CI ±1.8pp), and 90.4% vs the competing V13.4 temporal experiment. The first version to clear the V13-class plateau. This is the model you play against.

      End-to-end

      How AlphaLudo learns

      1. 01 Bootstrap

        Generate millions of practice games between scripted bots — heuristic, aggressive, defensive, expert. The network learns by watching them play.

      2. 02 Imitate the best teacher

        The new student network is trained to copy the previous best AlphaLudo's decisions. By the end of this stage it already plays as well as the teacher.

      3. 03 Self-play reinforcement

        The student plays thousands of games against itself and various opponents, gradually adjusting its strategy to win more. Once it's consistently strong, we add the previous AlphaLudo versions back in as sparring partners.

      4. 04 Fix the bad habits

        Watching it play, we noticed specific failure modes — leaving a laggard token at base, walking into capture range. Reward penalties were added to discourage these, then tuned by trial and error.

      5. 05 The honest test

        Win rate against scripted bots saturates around 80%, so it stops being useful as a measure. We compare versions directly — 10,000 games each, head to head. That's the only test that distinguishes the strongest models.

      From the journal

      Three lessons we won't unlearn

      Failed

      "Mathematically clean" rewards can be poison

      An early reward-shaping scheme looked elegant on paper but quietly subtracted a tiny amount of reward every turn. Over a 150-move game it added up to about a fifth of a "loss" — the model became convinced every game was unwinnable. Took 155,000 games to figure out what was happening.

      In long games, even tiny systematic biases compound. Always check what the reward looks like end-to-end.

      Worked

      Loud rewards beat clean rewards

      We tried scaling intermediate rewards 5× smaller, reasoning that the final win/loss signal should dominate. Win rate cratered from 67% to 33% over 125,000 games. In a dice game the random variance is so loud that quiet signals just get drowned out.

      In stochastic games, intermediate rewards must be loud enough to cut through the dice noise — not just mathematically pretty.

      Worked

      The encoder was the bottleneck — not the model

      For most of the project we believed the 80–83% plateau was about the opponent pool. Three architectural designs (CNN + attention, pure CNN, a temporal transformer over 8-turn history) all sat at the same ceiling. Then V13.5 broke it — by attacking the input instead of the model. Collapsing the four own-token channels into a single permutation-symmetric count view (and matching it with a rank-indexed output) was the unlock. Same parameters, same training pipeline.

      When three different architectures hit the same ceiling, the bottleneck is upstream of the architecture. We were giving the model an asymmetric view of a symmetric game.

      Mechanistic interpretability

      Looking inside the model

      A sister project — a battery of probing experiments asking: what has the network actually learned? Originally built around V13.2 (five experiments, ~600–2000 board states each); now re-run against V13.5 to verify the symmetric-encoder bet is mechanically active, not just numerically tied.

      V13.5 · CHANNEL ABLATION RE-RUN

      The rank-routing is mechanically real

      Re-ran channel-ablation on V13.5 with 600 stratified states. The four "Tok→Rank" planes (the constant channels that route the rank-indexed output back to which token to actually move) are the dominant channels globally — Policy KL 0.60–0.76, higher than any other channel. In late-game the pattern shifts: own-token-count and the leader-token rank mask take over (KL 0.68–1.01), and safe-zone reasoning becomes critical for landing.

      The symmetric-encoder bet isn't just a numerical tie — it's an active mechanism the model uses on almost every decision. Phase-specialized policy: rank routing in opening, leader-token + safe-zone reasoning in endgame. The kind of structure V13.5 was designed to expose.

      V13.2 · CHANNEL ABLATION (original finding)

      The model leans on Token 3 more than the others

      Zero out one input channel at a time and measure how much the policy distribution shifts (KL divergence). On 600 stratified states with the multi-legal filter — i.e. only states where the network actually has a choice — Token 3's channel comes out at KL ≈ 0.85, more than 2× the next-highest token (T1 at 0.38).

      This was a surprise. The four tokens are interchangeable under the rules, so a symmetric encoder shouldn't single one out. The asymmetry was real and pointed at the input encoding itself — directly motivating V13.5's token-symmetric collapse, which is now deployed.

      EXPERIMENT 3 · LINEAR PROBES

      What concepts the 128-dim feature vector encodes

      Train a logistic regression on the network's GAP features to decode hand-labelled concepts. Numbers are balanced accuracy on a held-out test set; baseline is the chance-level for that label distribution.

      • Game phase (early / mid / late)79%vs 33% baseline
      • Number of tokens out of base75%vs 33%
      • Will I win this game?73%vs 51%
      • Closest token to home35%vs 38% — not encoded
      • Home-stretch token count48%vs 73% — anti-encoded

      Strategic context (phase, lead, who's winning) lives clearly in the features. Per-token spatial concepts (which token is closest, how many are nearly home) don't — the network appears to re-derive these from the input each forward pass rather than maintaining them in the residual stream.

      EXPERIMENT 2 · DICE SENSITIVITY

      It's a reactive lookup, not a planner

      Hold the board fixed, sweep the dice channel through 1–6, see how the chosen token shifts. ~78% of states flip the preferred token when the dice value changes — and the JS divergence between roll-1 and roll-6 distributions is large.

      The network behaves as f(board, dice) → action, with dice values acting as broadcast modifiers rather than something integrated into a temporal plan. Same pattern across V6, V10, and now V13.2 — annealed PPO didn't change it. Tree-search-style planning would look very different.

      EXPERIMENTS 4–5 · CAPACITY USE

      Every channel is alive, but few do real work

      Layer-knockout (skip individual ResBlocks) and channel-activation (which of the 160 channels actually fire) ran together. Result: 0 globally dead channels at any threshold — every channel produces some activation. But channel-importance is heavily long-tailed: a handful of channels dominate the policy gradient, and the bulk are weakly redundant.

      The model isn't wasting parameters in the obvious sense (no dead neurons), but it's also not packing them densely. There's likely room to compress 10× without losing strength — which is partly why V13.5's smaller-input variant matched V13.2 at one-third the parameter count.

      As of May 11, 2026

      What's shipped, what didn't work, what's next

      Updated as each run lands a verdict.

      DEPLOYED · this site

      V13.5 · token-symmetric encoder

      The four own tokens are interchangeable under the rules, but every prior encoder gave each its own input channel. V13.5 collapses them to a single count-per-cell view; the output is re-routed via rank index to recover token IDs at decision time. Same parameter budget (~3M), same training pipeline as V13.2.

      Head-to-head results: 51.7% / 48.3% vs V13.2 over 3,000 games (this site's previous version, 95% CI ±1.8pp) · 53.5% vs V12.2 over 1,000 · 90.4% vs the V13.4 temporal experiment over 2,000. First model in the project to clear the V13-class plateau.

      DIDN'T WORK · resolved

      V13.4 · temporal transformer over 8-turn history

      4-layer transformer over the last 8 turns of game state, on top of the V13.2 CNN trunk. The hypothesis was that opponent-pattern signal in recent history would add value beyond a stateless single-frame view.

      SL plateaued at the same 80–82% as every other architecture; RL on top didn't move it. Final verdict from a clean 2,000-game head-to-head vs V13.5: V13.4 won 9.6%. Adding history added no signal — possibly because the dice randomness means the relevant context is already fully captured in the current state.

      NEXT · search teacher

      Depth-1 expectimax as auxiliary training target

      Across V13.2, V13.5_SL, and V13.5_RL the same 84–85% in-pool eval ceiling has held — a strong signal that no opponent in the current pool can teach the model anything new. The infrastructure for a search-based teacher signal (auxiliary KL between the policy and a depth-1 expectimax target) already exists in the trainer.

      Next experiment: turn it on for a fraction of moves, see if it provides the "stronger than self" gradient signal the pool can't.

      PARKED · compute-bound

      Full MCTS / AlphaZero from random init

      Tried a stripped-down variant — shallow expectimax + the current network as leaf evaluator, distilled into a fresh student. Student lost 89/10 to its teacher. That's not a refutation; it's a sign that 2-ply search over the same network we're trying to beat doesn't generate strong-enough targets. Full AlphaZero (deep MCTS + millions of self-play from random init) is the right test, parked until there's a compute budget for it.

      Want to see the model in action?

      ▶ Play AlphaLudo

      Inspirations & dead ends

      The lineage.

      AlphaLudo borrowed liberally and parked one idea for compute reasons. Here's the full reading list, in chronological order.

      Inspiration · 2016 · DeepMind

      AlphaGo

      The whole project started here. AlphaLudo borrows the AlphaGo recipe almost wholesale — a network that predicts both the best move and how likely you are to win, trained first by imitating a strong teacher and then by playing millions of games against itself.

      If you've never seen the documentary, watch it. It's still the best one-hour explanation of why this whole field exists.

      1992 IBM · Tesauro

      TD-Gammon

      The original "neural net plays a dice game at world-class level" result, written when most of the modern field didn't exist yet. Thirty years later, AlphaLudo rediscovered Tesauro's central lesson the hard way: in dice games, the small rewards along the way matter more than the final win/loss signal. Scale them down too far and learning collapses.

      Read on Wikipedia →
      2017 Tried, then rejected

      AlphaZero

      AlphaZero's full recipe — train from a random network, generate millions of self-play games with deep MCTS search at every move, distill the search-improved policy back into the network — is the obvious next step after AlphaGo. We didn't run that loop. The blocker isn't the idea; it's the compute. Generating millions of search-augmented games on a single GPU is months of wall-clock time we don't have.

      We did try a stripped-down variant: shallow expectimax search using our best existing network as the leaf evaluator, distilled into a fresh student. That student lost 89 / 10 to its teacher — not a refutation of AlphaZero, but a sign that 2-ply search over the same network we're trying to beat doesn't generate strong-enough targets. With a stronger leaf evaluator (or a real compute budget for full self-play search), it's still on the table.

      Original paper →
      2017 Zaheer et al.

      DeepSets

      A 2017 idea that lets a neural network reason about a "set" of things — like the four tokens you control — without caring what order they're in. We used this for one experiment in AlphaLudo, building a much smaller network with no convolutional layers at all. It hit the same ceiling as the bigger models, which is what convinced us the model itself wasn't the limit.

      arXiv:1703.06114 →
      2017 Schulman et al.

      PPO

      The reinforcement-learning algorithm doing the heavy lifting in every AlphaLudo run. PPO is the workhorse of modern RL — boring, reliable, well-understood. We didn't try to be clever with the optimiser; the interesting part of AlphaLudo is what we feed into it, not how we update the weights.

      arXiv:1707.06347 →

      Project meta

      About AlphaLudo.

      An eight-month side project on what it actually takes to learn Ludo from raw self-play. Built end-to-end — engine, training, mech-interp, and this site.

      Runtime

      How this page works

      • Game engine: hand-written C++ compiled to WebAssembly via Emscripten
      • Inference: ONNX Runtime Web (single-threaded WASM build)
      • Frontend: vanilla ES modules, no framework, no bundler in dev
      • Hosting: Cloudflare Pages (static), no server, no telemetry
      • Total payload: ~50 MB, dominated by the ONNX model + ORT runtime

      Model

      What the AI is

      • ~3 million parameters, all running locally on your machine
      • Token-symmetric ResNet — 10 layers × 128 channels, with a rank-indexed output head
      • Sees own and opp token counts per cell (not per-token), the dice value, a few static board markers, plus rank-routing planes
      • Four outputs: which token to move, an estimate of who's winning, how long the game has left, and per-token progress toward home
      • Trained by SL distillation from earlier V13.x models, then PPO self-play with a small progress-shaping reward

      By the numbers

      Eight months of training, summarised

      ~14Mgames of self-play across all versions
      ~10Mteacher games used for imitation
      36labelled experiments
      14distinct architectures tried
      8generations of input encoder
      3major dead ends documented

      Privacy

      What we collect

      Nothing. There is no backend. Your moves never leave your browser. The page loads Google Fonts and (on the Lineage page) one YouTube embed via youtube-nocookie.com; that's the only third-party traffic. No analytics, no cookies, no telemetry.

      Ready?

      Play the model

      The network is already loaded. Pick up the dice.

      ▶ Play AlphaLudo