An eight-month side project · Oct 2025 → May 2026

AlphaLudo

A neural network that learned to play Ludo from scratch.

3 million parameters trained over eight months and dozens of experiments — now running entirely inside your browser. No server, no API calls, no waiting. Pick up the dice and see if you can beat it.

Inspired by AlphaGo · TD-Gammon · AlphaZero

8mo of iteration
36 experiments tried
3M parameters in the model
~80% wins vs scripted opponents
~52% wins vs the previous AlphaLudo (10,000-game test)
50%
50%

AI's predicted winner

You vs AlphaLudo
Your Turn

    Click to roll

    or press Space

      Behind the model

      Eight months of iteration.

      From a naive baseline that couldn't tell its own four tokens apart, to a 3-million-parameter network that beats every earlier version of itself. The short tour of how we got there — and what didn't work along the way.

      From V1 to V13.2

      The architecture timeline

      1. Oct 2025 · The first try The naive baseline

        The model saw the board as eight stacked black-and-white maps — but it couldn't tell its own four tokens apart. They all collapsed into one blob, so the AI was guessing which piece to move. It lost a lot.

      2. Dec 2025 · Hand-holding Engineered features

        We started feeding the network "tactical hints" we computed by hand — danger maps, capture opportunities, safe landing squares. It got better, but plateaued at 73–77% wins against scripted opponents and stopped improving.

      3. Mar 2026 · Attention Breaking through the plateau

        Added a "token attention" layer — letting the network reason about its four tokens as separate entities, with awareness of how often each had been ignored. First model to consistently win >80% against scripted opponents.

      4. May 2026 · Current Less is more

        We stripped most of the hand-engineered features back out and gave the network mostly raw board positions. It beats every earlier version of itself. Over 10,000 head-to-head games against the previous best AlphaLudo, this version wins about 52% of the time — a small but statistically real edge. This is the model you play against.

      End-to-end

      How AlphaLudo learns

      1. 01 Bootstrap

        Generate millions of practice games between scripted bots — heuristic, aggressive, defensive, expert. The network learns by watching them play.

      2. 02 Imitate the best teacher

        The new student network is trained to copy the previous best AlphaLudo's decisions. By the end of this stage it already plays as well as the teacher.

      3. 03 Self-play reinforcement

        The student plays thousands of games against itself and various opponents, gradually adjusting its strategy to win more. Once it's consistently strong, we add the previous AlphaLudo versions back in as sparring partners.

      4. 04 Fix the bad habits

        Watching it play, we noticed specific failure modes — leaving a laggard token at base, walking into capture range. Reward penalties were added to discourage these, then tuned by trial and error.

      5. 05 The honest test

        Win rate against scripted bots saturates around 80%, so it stops being useful as a measure. We compare versions directly — 10,000 games each, head to head. That's the only test that distinguishes the strongest models.

      From the journal

      Three lessons we won't unlearn

      Failed

      "Mathematically clean" rewards can be poison

      An early reward-shaping scheme looked elegant on paper but quietly subtracted a tiny amount of reward every turn. Over a 150-move game it added up to about a fifth of a "loss" — the model became convinced every game was unwinnable. Took 155,000 games to figure out what was happening.

      In long games, even tiny systematic biases compound. Always check what the reward looks like end-to-end.

      Worked

      Loud rewards beat clean rewards

      We tried scaling intermediate rewards 5× smaller, reasoning that the final win/loss signal should dominate. Win rate cratered from 67% to 33% over 125,000 games. In a dice game the random variance is so loud that quiet signals just get drowned out.

      In stochastic games, intermediate rewards must be loud enough to cut through the dice noise — not just mathematically pretty.

      Insight

      The architecture isn't the bottleneck

      We tried three completely different network designs — one with attention, one pure convolutional, one with no spatial structure at all. All three plateaued at the same 80–83% win rate. Whatever's holding us back, it isn't the shape of the model.

      More parameters and fancier layers were never going to help. The ceiling lives somewhere else — probably in the training opponents we have access to.

      Mechanistic interpretability

      Looking inside the model

      Once V13.2 hit its plateau we ran a sister project — a battery of probing experiments to ask: what has the network actually learned? Five experiments, ~600–2000 board states each. Here are the ones that changed our mental model.

      EXPERIMENT 1 · CHANNEL ABLATION

      The model leans on Token 3 more than the others

      Zero out one input channel at a time and measure how much the policy distribution shifts (KL divergence). On 600 stratified states with the multi-legal filter — i.e. only states where the network actually has a choice — Token 3's channel comes out at KL ≈ 0.85, more than 2× the next-highest token (T1 at 0.38).

      This was a surprise. The four tokens are interchangeable under the rules, so a symmetric encoder shouldn't single one out. The asymmetry is real and points at the input encoding itself — directly motivating V13.5's token-symmetric collapse experiment in the queue above.

      EXPERIMENT 3 · LINEAR PROBES

      What concepts the 128-dim feature vector encodes

      Train a logistic regression on the network's GAP features to decode hand-labelled concepts. Numbers are balanced accuracy on a held-out test set; baseline is the chance-level for that label distribution.

      • Game phase (early / mid / late)79%vs 33% baseline
      • Number of tokens out of base75%vs 33%
      • Will I win this game?73%vs 51%
      • Closest token to home35%vs 38% — not encoded
      • Home-stretch token count48%vs 73% — anti-encoded

      Strategic context (phase, lead, who's winning) lives clearly in the features. Per-token spatial concepts (which token is closest, how many are nearly home) don't — the network appears to re-derive these from the input each forward pass rather than maintaining them in the residual stream.

      EXPERIMENT 2 · DICE SENSITIVITY

      It's a reactive lookup, not a planner

      Hold the board fixed, sweep the dice channel through 1–6, see how the chosen token shifts. ~78% of states flip the preferred token when the dice value changes — and the JS divergence between roll-1 and roll-6 distributions is large.

      The network behaves as f(board, dice) → action, with dice values acting as broadcast modifiers rather than something integrated into a temporal plan. Same pattern across V6, V10, and now V13.2 — annealed PPO didn't change it. Tree-search-style planning would look very different.

      EXPERIMENTS 4–5 · CAPACITY USE

      Every channel is alive, but few do real work

      Layer-knockout (skip individual ResBlocks) and channel-activation (which of the 160 channels actually fire) ran together. Result: 0 globally dead channels at any threshold — every channel produces some activation. But channel-importance is heavily long-tailed: a handful of channels dominate the policy gradient, and the bulk are weakly redundant.

      The model isn't wasting parameters in the obvious sense (no dead neurons), but it's also not packing them densely. There's likely room to compress 10× without losing strength — which is partly why V13.5's smaller-input variant matched V13.2 at one-third the parameter count.

      As of May 8, 2026

      Currently in flight

      What's training right now, what's queued, and what's parked. Updated when each run lands a verdict.

      RUNNING — cloud GPU

      V13.4 RL · temporal transformer extension

      Adds a 4-layer transformer over the last 8 turns of game context on top of the V13.2 CNN trunk. Hypothesis: opponent-pattern signal in recent history adds value beyond a stateless single-frame view.

      SL phase finished at the same 80–82% plateau as every other architecture. RL continuation in progress — at 25,000 self-play games, latest 3,000-game eval is 81.1%, right on the SL ceiling. Verdict gate is head-to-head vs the deployed V13.2; chain-1 (9,400 games) tied 50.6 / 50.2. Watching the next 50,000 games for any meaningful break.

      QUEUED — local GPU

      V13.5 · token-symmetric encoder

      Different architectural bet from V13.4 — instead of more layers, fewer assumptions. The four own tokens are interchangeable under the rules, but every encoder so far has given each token its own input channel, forcing the network to learn the symmetry from data. V13.5 collapses them to a single count-per-cell channel.

      Proof-of-concept at one-third the parameters of V13.2 already statistically tied V13.2 in head-to-head play. Matched-capacity full-size run is queued behind V13.4 — if it ties or beats V13.2 at the same parameter count, the symmetry hypothesis is confirmed and it becomes the next baseline.

      PARKED — compute-bound

      MCTS / AlphaZero distillation

      Tried a stripped-down AlphaZero variant — shallow expectimax search using the current best network as the leaf evaluator, distilled into a fresh student. Student lost 89 / 10 to its teacher. That's not a refutation of AlphaZero; it's a sign that 2-ply search over the same network we're trying to beat doesn't generate strong-enough targets. Full AlphaZero (deep MCTS + millions of self-play games from random init) is on the table whenever there's a real compute budget for it.

      Want to see the model in action?

      ▶ Play AlphaLudo

      Inspirations & dead ends

      The lineage.

      AlphaLudo borrowed liberally and parked one idea for compute reasons. Here's the full reading list, in chronological order.

      Inspiration · 2016 · DeepMind

      AlphaGo

      The whole project started here. AlphaLudo borrows the AlphaGo recipe almost wholesale — a network that predicts both the best move and how likely you are to win, trained first by imitating a strong teacher and then by playing millions of games against itself.

      If you've never seen the documentary, watch it. It's still the best one-hour explanation of why this whole field exists.

      1992 IBM · Tesauro

      TD-Gammon

      The original "neural net plays a dice game at world-class level" result, written when most of the modern field didn't exist yet. Thirty years later, AlphaLudo rediscovered Tesauro's central lesson the hard way: in dice games, the small rewards along the way matter more than the final win/loss signal. Scale them down too far and learning collapses.

      Read on Wikipedia →
      2017 Tried, then rejected

      AlphaZero

      AlphaZero's full recipe — train from a random network, generate millions of self-play games with deep MCTS search at every move, distill the search-improved policy back into the network — is the obvious next step after AlphaGo. We didn't run that loop. The blocker isn't the idea; it's the compute. Generating millions of search-augmented games on a single GPU is months of wall-clock time we don't have.

      We did try a stripped-down variant: shallow expectimax search using our best existing network as the leaf evaluator, distilled into a fresh student. That student lost 89 / 10 to its teacher — not a refutation of AlphaZero, but a sign that 2-ply search over the same network we're trying to beat doesn't generate strong-enough targets. With a stronger leaf evaluator (or a real compute budget for full self-play search), it's still on the table.

      Original paper →
      2017 Zaheer et al.

      DeepSets

      A 2017 idea that lets a neural network reason about a "set" of things — like the four tokens you control — without caring what order they're in. We used this for one experiment in AlphaLudo, building a much smaller network with no convolutional layers at all. It hit the same ceiling as the bigger models, which is what convinced us the model itself wasn't the limit.

      arXiv:1703.06114 →
      2017 Schulman et al.

      PPO

      The reinforcement-learning algorithm doing the heavy lifting in every AlphaLudo run. PPO is the workhorse of modern RL — boring, reliable, well-understood. We didn't try to be clever with the optimiser; the interesting part of AlphaLudo is what we feed into it, not how we update the weights.

      arXiv:1707.06347 →

      Project meta

      About AlphaLudo.

      An eight-month side project on what it actually takes to learn Ludo from raw self-play. Built end-to-end — engine, training, mech-interp, and this site.

      Runtime

      How this page works

      • Game engine: hand-written C++ compiled to WebAssembly via Emscripten
      • Inference: ONNX Runtime Web (single-threaded WASM build)
      • Frontend: vanilla ES modules, no framework, no bundler in dev
      • Hosting: Cloudflare Pages (static), no server, no telemetry
      • Total payload: ~50 MB, dominated by the ONNX model + ORT runtime

      Model

      What the AI is

      • ~3 million parameters, all running locally on your machine
      • Convolutional network — looks at the board as a 15×15 image with extra channels for token positions and dice value
      • Three outputs: which token to move, an estimate of who's winning, and how long the game has left
      • Trained by imitating an earlier AlphaLudo and then sharpening through self-play

      By the numbers

      Eight months of training, summarised

      ~14Mgames of self-play across all versions
      ~10Mteacher games used for imitation
      36labelled experiments
      14distinct architectures tried
      8generations of input encoder
      3major dead ends documented

      Privacy

      What we collect

      Nothing. There is no backend. Your moves never leave your browser. The page loads Google Fonts and (on the Lineage page) one YouTube embed via youtube-nocookie.com; that's the only third-party traffic. No analytics, no cookies, no telemetry.

      Ready?

      Play the model

      The network is already loaded. Pick up the dice.

      ▶ Play AlphaLudo