"Mathematically clean" rewards can be poison
An early reward-shaping scheme looked elegant on paper but quietly subtracted a tiny amount of reward every turn. Over a 150-move game it added up to about a fifth of a "loss" — the model became convinced every game was unwinnable. Took 155,000 games to figure out what was happening.
In long games, even tiny systematic biases compound. Always check what the reward looks like end-to-end.