An eight-month side project · Oct 2025 → May 2026

AlphaLudo

A neural network that learned to play Ludo from scratch.

3 million parameters trained over eight months and dozens of experiments — now running entirely inside your browser. No server, no API calls, no waiting. Pick up the dice and see if you can beat it.

▶ Play AlphaLudo Read the research →

Inspired by AlphaGo · TD-Gammon · AlphaZero

8mo of iteration

36 experiments tried

3M parameters in the model

~80% wins vs scripted opponents

51.7% wins vs the previous AlphaLudo (V13.5 vs V13.2, 3,000-game H2H)

Where to start

Three paths in

Interactive

Play the model →

Take on the AI in a real game of Ludo. The whole network runs locally in your browser — no backend, no waiting, your moves never leave your machine.

The story

Eight months in one row →

From a naive baseline that couldn't tell its own four tokens apart, through a hand-engineered encoder, all the way to a stripped-back design that beats every earlier attempt.

Inspirations

Standing on the shoulders →

AlphaGo lit the spark. TD-Gammon wrote the playbook for games with dice. AlphaZero is the obvious next step but needs more compute than this side project has. Here's the full reading list.

🎮

Human side

No games played yet

Tokens Home 0 / 4

Waiting…

Timeline

Game Log

50%

AI's predicted winner

You vs AlphaLudo

Your Turn

Click to roll

or press Space

🤖

Model side

AlphaLudo AI

Tokens Home 0 / 4

Waiting…

AI's Predicted Win Chance

50% AI vs You 50%

Move probabilities (last decision)

Behind the model

Eight months of iteration.

From a naive baseline that couldn't tell its own four tokens apart, to a 3-million-parameter network that beats every earlier version of itself. The short tour of how we got there — and what didn't work along the way.

From V1 to V13.2

The architecture timeline

Oct 2025 · The first try The naive baseline
The model saw the board as eight stacked black-and-white maps — but it couldn't tell its own four tokens apart. They all collapsed into one blob, so the AI was guessing which piece to move. It lost a lot.
Dec 2025 · Hand-holding Engineered features
We started feeding the network "tactical hints" we computed by hand — danger maps, capture opportunities, safe landing squares. It got better, but plateaued at 73–77% wins against scripted opponents and stopped improving.
Mar 2026 · Attention Breaking through the plateau
Added a "token attention" layer — letting the network reason about its four tokens as separate entities, with awareness of how often each had been ignored. First model to consistently win >80% against scripted opponents.
Apr 2026 · Previous Less is more (V13.2)
Stripped most of the hand-engineered features back out and gave the network mostly raw board positions plus 3 static board hints (safe cells, home stretches). Beat every earlier version of itself by a small but statistically real edge. Held this site for two weeks as the strongest version — until V13.5.
May 2026 · Current The four tokens are interchangeable (V13.5)
Every encoder so far had given each of the four own tokens its own input channel — forcing the network to learn from scratch that the rules treat them identically. V13.5 collapses them into a single count-per-cell channel and re-routes the model's "which token to move" output through a rank-indexed gather. Same parameter budget, same training pipeline, same opponent pool — 51.7% wins over 3,000 games vs V13.2 (95% CI ±1.8pp), and 90.4% vs the competing V13.4 temporal experiment. The first version to clear the V13-class plateau. This is the model you play against.

End-to-end

How AlphaLudo learns

01 Bootstrap
Generate millions of practice games between scripted bots — heuristic, aggressive, defensive, expert. The network learns by watching them play.
02 Imitate the best teacher
The new student network is trained to copy the previous best AlphaLudo's decisions. By the end of this stage it already plays as well as the teacher.
03 Self-play reinforcement
The student plays thousands of games against itself and various opponents, gradually adjusting its strategy to win more. Once it's consistently strong, we add the previous AlphaLudo versions back in as sparring partners.
04 Fix the bad habits
Watching it play, we noticed specific failure modes — leaving a laggard token at base, walking into capture range. Reward penalties were added to discourage these, then tuned by trial and error.
05 The honest test
Win rate against scripted bots saturates around 80%, so it stops being useful as a measure. We compare versions directly — 10,000 games each, head to head. That's the only test that distinguishes the strongest models.

From the journal

Three lessons we won't unlearn

Failed

"Mathematically clean" rewards can be poison

An early reward-shaping scheme looked elegant on paper but quietly subtracted a tiny amount of reward every turn. Over a 150-move game it added up to about a fifth of a "loss" — the model became convinced every game was unwinnable. Took 155,000 games to figure out what was happening.

In long games, even tiny systematic biases compound. Always check what the reward looks like end-to-end.

Worked

Loud rewards beat clean rewards

We tried scaling intermediate rewards 5× smaller, reasoning that the final win/loss signal should dominate. Win rate cratered from 67% to 33% over 125,000 games. In a dice game the random variance is so loud that quiet signals just get drowned out.

In stochastic games, intermediate rewards must be loud enough to cut through the dice noise — not just mathematically pretty.

Worked

The encoder was the bottleneck — not the model

For most of the project we believed the 80–83% plateau was about the opponent pool. Three architectural designs (CNN + attention, pure CNN, a temporal transformer over 8-turn history) all sat at the same ceiling. Then V13.5 broke it — by attacking the input instead of the model. Collapsing the four own-token channels into a single permutation-symmetric count view (and matching it with a rank-indexed output) was the unlock. Same parameters, same training pipeline.

When three different architectures hit the same ceiling, the bottleneck is upstream of the architecture. We were giving the model an asymmetric view of a symmetric game.

Mechanistic interpretability

Looking inside the model

A sister project — a battery of probing experiments asking: what has the network actually learned? Originally built around V13.2 (five experiments, ~600–2000 board states each); now re-run against V13.5 to verify the symmetric-encoder bet is mechanically active, not just numerically tied.

V13.5 · CHANNEL ABLATION RE-RUN

The rank-routing is mechanically real

Re-ran channel-ablation on V13.5 with 600 stratified states. The four "Tok→Rank" planes (the constant channels that route the rank-indexed output back to which token to actually move) are the dominant channels globally — Policy KL 0.60–0.76, higher than any other channel. In late-game the pattern shifts: own-token-count and the leader-token rank mask take over (KL 0.68–1.01), and safe-zone reasoning becomes critical for landing.

The symmetric-encoder bet isn't just a numerical tie — it's an active mechanism the model uses on almost every decision. Phase-specialized policy: rank routing in opening, leader-token + safe-zone reasoning in endgame. The kind of structure V13.5 was designed to expose.

V13.2 · CHANNEL ABLATION (original finding)

The model leans on Token 3 more than the others

Zero out one input channel at a time and measure how much the policy distribution shifts (KL divergence). On 600 stratified states with the multi-legal filter — i.e. only states where the network actually has a choice — Token 3's channel comes out at KL ≈ 0.85, more than 2× the next-highest token (T1 at 0.38).

This was a surprise. The four tokens are interchangeable under the rules, so a symmetric encoder shouldn't single one out. The asymmetry was real and pointed at the input encoding itself — directly motivating V13.5's token-symmetric collapse, which is now deployed.

EXPERIMENT 3 · LINEAR PROBES

What concepts the 128-dim feature vector encodes

Train a logistic regression on the network's GAP features to decode hand-labelled concepts. Numbers are balanced accuracy on a held-out test set; baseline is the chance-level for that label distribution.

Game phase (early / mid / late)79%vs 33% baseline
Number of tokens out of base75%vs 33%
Will I win this game?73%vs 51%
Closest token to home35%vs 38% — not encoded
Home-stretch token count48%vs 73% — anti-encoded

Strategic context (phase, lead, who's winning) lives clearly in the features. Per-token spatial concepts (which token is closest, how many are nearly home) don't — the network appears to re-derive these from the input each forward pass rather than maintaining them in the residual stream.

EXPERIMENT 2 · DICE SENSITIVITY

It's a reactive lookup, not a planner

Hold the board fixed, sweep the dice channel through 1–6, see how the chosen token shifts. ~78% of states flip the preferred token when the dice value changes — and the JS divergence between roll-1 and roll-6 distributions is large.

The network behaves as f(board, dice) → action, with dice values acting as broadcast modifiers rather than something integrated into a temporal plan. Same pattern across V6, V10, and now V13.2 — annealed PPO didn't change it. Tree-search-style planning would look very different.

EXPERIMENTS 4–5 · CAPACITY USE

Every channel is alive, but few do real work

Layer-knockout (skip individual ResBlocks) and channel-activation (which of the 160 channels actually fire) ran together. Result: 0 globally dead channels at any threshold — every channel produces some activation. But channel-importance is heavily long-tailed: a handful of channels dominate the policy gradient, and the bulk are weakly redundant.

The model isn't wasting parameters in the obvious sense (no dead neurons), but it's also not packing them densely. There's likely room to compress 10× without losing strength — which is partly why V13.5's smaller-input variant matched V13.2 at one-third the parameter count.

As of May 11, 2026

What's shipped, what didn't work, what's next

Updated as each run lands a verdict.

DEPLOYED · this site

V13.5 · token-symmetric encoder

The four own tokens are interchangeable under the rules, but every prior encoder gave each its own input channel. V13.5 collapses them to a single count-per-cell view; the output is re-routed via rank index to recover token IDs at decision time. Same parameter budget (~3M), same training pipeline as V13.2.

Head-to-head results: 51.7% / 48.3% vs V13.2 over 3,000 games (this site's previous version, 95% CI ±1.8pp) · 53.5% vs V12.2 over 1,000 · 90.4% vs the V13.4 temporal experiment over 2,000. First model in the project to clear the V13-class plateau.

DIDN'T WORK · resolved

V13.4 · temporal transformer over 8-turn history

4-layer transformer over the last 8 turns of game state, on top of the V13.2 CNN trunk. The hypothesis was that opponent-pattern signal in recent history would add value beyond a stateless single-frame view.

SL plateaued at the same 80–82% as every other architecture; RL on top didn't move it. Final verdict from a clean 2,000-game head-to-head vs V13.5: V13.4 won 9.6%. Adding history added no signal — possibly because the dice randomness means the relevant context is already fully captured in the current state.

NEXT · search teacher

Depth-1 expectimax as auxiliary training target

Across V13.2, V13.5_SL, and V13.5_RL the same 84–85% in-pool eval ceiling has held — a strong signal that no opponent in the current pool can teach the model anything new. The infrastructure for a search-based teacher signal (auxiliary KL between the policy and a depth-1 expectimax target) already exists in the trainer.

Next experiment: turn it on for a fraction of moves, see if it provides the "stronger than self" gradient signal the pool can't.

PARKED · compute-bound

Full MCTS / AlphaZero from random init

Tried a stripped-down variant — shallow expectimax + the current network as leaf evaluator, distilled into a fresh student. Student lost 89/10 to its teacher. That's not a refutation; it's a sign that 2-ply search over the same network we're trying to beat doesn't generate strong-enough targets. Full AlphaZero (deep MCTS + millions of self-play from random init) is the right test, parked until there's a compute budget for it.

Want to see the model in action?

▶ Play AlphaLudo

Inspirations & dead ends

The lineage.

AlphaLudo borrowed liberally and parked one idea for compute reasons. Here's the full reading list, in chronological order.

Inspiration · 2016 · DeepMind

AlphaGo

The whole project started here. AlphaLudo borrows the AlphaGo recipe almost wholesale — a network that predicts both the best move and how likely you are to win, trained first by imitating a strong teacher and then by playing millions of games against itself.

If you've never seen the documentary, watch it. It's still the best one-hour explanation of why this whole field exists.

1992 IBM · Tesauro

TD-Gammon

The original "neural net plays a dice game at world-class level" result, written when most of the modern field didn't exist yet. Thirty years later, AlphaLudo rediscovered Tesauro's central lesson the hard way: in dice games, the small rewards along the way matter more than the final win/loss signal. Scale them down too far and learning collapses.

Read on Wikipedia →

2017 Tried, then rejected

AlphaZero

AlphaZero's full recipe — train from a random network, generate millions of self-play games with deep MCTS search at every move, distill the search-improved policy back into the network — is the obvious next step after AlphaGo. We didn't run that loop. The blocker isn't the idea; it's the compute. Generating millions of search-augmented games on a single GPU is months of wall-clock time we don't have.

We did try a stripped-down variant: shallow expectimax search using our best existing network as the leaf evaluator, distilled into a fresh student. That student lost 89 / 10 to its teacher — not a refutation of AlphaZero, but a sign that 2-ply search over the same network we're trying to beat doesn't generate strong-enough targets. With a stronger leaf evaluator (or a real compute budget for full self-play search), it's still on the table.

Original paper →

2017 Zaheer et al.

DeepSets

A 2017 idea that lets a neural network reason about a "set" of things — like the four tokens you control — without caring what order they're in. We used this for one experiment in AlphaLudo, building a much smaller network with no convolutional layers at all. It hit the same ceiling as the bigger models, which is what convinced us the model itself wasn't the limit.

arXiv:1703.06114 →

2017 Schulman et al.

PPO

The reinforcement-learning algorithm doing the heavy lifting in every AlphaLudo run. PPO is the workhorse of modern RL — boring, reliable, well-understood. We didn't try to be clever with the optimiser; the interesting part of AlphaLudo is what we feed into it, not how we update the weights.

arXiv:1707.06347 →

Project meta

About AlphaLudo.

An eight-month side project on what it actually takes to learn Ludo from raw self-play. Built end-to-end — engine, training, mech-interp, and this site.

Runtime

How this page works

Game engine: hand-written C++ compiled to WebAssembly via Emscripten
Inference: ONNX Runtime Web (single-threaded WASM build)
Frontend: vanilla ES modules, no framework, no bundler in dev
Hosting: Cloudflare Pages (static), no server, no telemetry
Total payload: ~50 MB, dominated by the ONNX model + ORT runtime

Model

What the AI is

~3 million parameters, all running locally on your machine
Token-symmetric ResNet — 10 layers × 128 channels, with a rank-indexed output head
Sees own and opp token counts per cell (not per-token), the dice value, a few static board markers, plus rank-routing planes
Four outputs: which token to move, an estimate of who's winning, how long the game has left, and per-token progress toward home
Trained by SL distillation from earlier V13.x models, then PPO self-play with a small progress-shaping reward

By the numbers

Eight months of training, summarised

~14Mgames of self-play across all versions

~10Mteacher games used for imitation

36labelled experiments

14distinct architectures tried

8generations of input encoder

3major dead ends documented

Privacy

What we collect

Nothing. There is no backend. Your moves never leave your browser. The page loads Google Fonts and (on the Lineage page) one YouTube embed via youtube-nocookie.com; that's the only third-party traffic. No analytics, no cookies, no telemetry.

Connect

Code & contact

Built by Sumit Pal as a side project. The full training pipeline — engine, model code, every experiment, the journal — is open-source. Reach out if you want to chat about it.

Ready?

Play the model

The network is already loaded. Pick up the dice.