I've always had a sneaking suspicion that I suck at endgames. Certainly, I can remember many endgames that I've painfully screwed up. But if there's one thing I've learned from poker, it's that our memory of what happened over many games can differ wildly from reality when checked against hard data.

It shouldn't be too hard to check my intuition about being bad at the endgame against reality. Here's the plan:

  1. Download my latest 1000 blitz games from lichess.
  2. Use an engine to evaluate the position after each move.
  3. Plot the difference of the evaluations move-by-move. This should allow me to see whether, on average, my position improved or got worse at each stage of the game.
import chess
import chess.engine
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
import requests

Download Games

I'll take advantage of the lichess API to grab my last 1000 blitz games. If you want to use this notebook to analyze your own games, simply fill in your own lichess username. The token is a lichess personal API token. This is optional, but recommended, because otherwise downloading 1000 games is pretty slow. You could also use fewer games, of course, but more games is better because the move-by-move data is pretty noisy.

username = 'CheckRaiseMate'
token = 'F645u9pRLoxrEuHZ'
def get_pgn(username, params={}, token=None):
    headers = {} if token is None else {'Authorization': 'Bearer ' + token}
    r = requests.get(f'https://lichess.org/api/games/user/{username}', params=params, headers=headers)
    pgn = r.text
    return pgn
params = dict(
    perfType='blitz',
    max=1000
)
pgn = get_pgn(username, params=params, token=token)
with open(f'{username}.pgn', 'w') as f:
    f.write(pgn)
games = pgn.split('\n\n\n')[:-1]
len(games)
1000

Analyze Games

To evaluate the games I'll use the latest version of Stockfish. Fortunately, the python-chess library has an engine module that makes it easy to work with the engine in Python.

First I define a helper function to check which color I was in each game. This is important, because since the analysis is from my point-of-view, in games where I was black I'll need to flip the engine evaluation.

def get_color(username, game):
    wp = re.compile('White "(.+?)"')
    result = wp.search(game)
    white = result.group(1)
    
    bp = re.compile('Black "(.+?)"')
    result = bp.search(game)
    black = result.group(1)
    
    if white == username:
        return 0
    elif black == username:
        return 1
    else:
        return None

Then I go through all the games and get the engine evaluation after each move. There are a few tricks here...

I decided to only run the evaluation after my own moves. It wasn't clear to me if this would be better than evaluating after my opponent's moves as well, but it does cut the number of positions I need to evaluate in half and it takes awhile to evaluate every move in 1000 games.

Then you need to set some parameters for the engine. The first is how much time to give it to think about each position. 0.1 seconds seems like a decent balance between accuracy and speed. I don't think ultra-precise evaluations are that important when you're looking at broad trends across many games.

Then there's mate score and max score. The engine gives evaluations in centipawns (100 centipawns = 1 pawn) and you need to choose a numeric value to use for a checkmate position - presumably a very large value - which is the mate score. I also felt that differences in scores above a certain threshold aren't very meaningful. For example, a +10 and +20 position are both completely winning, but a difference of 10 in the evaluation is huge and really messes up summary statistics. For that reason, I clipped the evaluations at +/- 1000 centipawns (10 pawns). I ended up doing more to address this problem, but more on that later.

def evaluate_game(game, engine, username, time=0.1, max_score=1000):
    board = chess.Board()
    color = get_color(username, game)
    
    moves = game.split('\n')[-1]
    moves = re.sub("[0-9]*\.", "", moves).split()[:-1] # remove move numbers
    
    evals = []
    
    for i, move in enumerate(moves):
        board.push_san(move)
        if i%2 == color: # get eval only after my move
            info = engine.analyse(board, chess.engine.Limit(time=time))
            eval = info['score'].white().score(mate_score=max_score)
            eval = np.clip(eval, -max_score, max_score)
            if color == 1: eval *= -1
            eval /= 100
            evals.append(eval)
            
    return evals
engine_path = "/usr/local/Cellar/stockfish/12/bin/stockfish"
engine = chess.engine.SimpleEngine.popen_uci(engine_path)

all_evals = []

for i, game in enumerate(games):
    if i%100==0: print(f"Game {i}")
    evals = evaluate_game(game, engine, username)
    all_evals.append(evals)
    
engine.quit()
Game 0
Game 100
Game 200
Game 300
Game 400
Game 500
Game 600
Game 700
Game 800
Game 900

To make analysis easier, I put all the evaluations in a pandas dataframe. Each column is one game, the rows are moves, and each cell is an evaluation.

df = pd.concat([pd.Series(e) for e in all_evals], axis=1)
df.shape
(121, 1000)
df.head()
0 1 2 3 4 5 6 7 8 9 ... 990 991 992 993 994 995 996 997 998 999
0 -0.27 0.32 0.31 -0.30 0.16 0.24 0.22 -0.30 -0.25 0.04 ... 0.10 -0.10 -0.24 0.08 -0.33 0.05 0.17 -0.50 0.18 -0.30
1 -0.17 0.35 0.69 -0.10 0.23 0.25 0.26 -0.40 -0.15 0.27 ... 0.12 -0.18 -0.57 0.08 -0.22 -0.33 0.09 -0.29 0.13 -0.46
2 -0.21 0.63 0.68 -0.16 0.70 0.37 0.61 -0.44 0.01 0.23 ... 0.21 -0.37 -0.49 0.00 -0.31 -0.36 0.12 -0.44 0.34 -0.42
3 -0.39 0.90 1.13 -0.28 0.54 0.24 0.48 -0.18 -0.14 1.21 ... 0.25 0.01 -0.56 0.08 -0.05 -0.21 0.03 -0.19 0.69 -0.75
4 -0.19 0.72 0.80 -0.22 0.44 0.36 0.47 -0.34 0.21 1.18 ... 0.23 0.08 -0.65 0.21 0.47 0.04 0.08 -0.10 0.43 -0.90

5 rows × 1000 columns

df.to_csv('evals.csv', index=False)

First let's look at the mean across all columns. This is the average evaluation by move number across all games.

df.mean(axis=1)[:60].plot();

It starts at 0.0 (equal position) and gradually goes up. This makes sense, because I win more games than I lose on lichess, so in general the evaluation should go up as the game goes on. Maybe I would improve at chess more quickly if I adjusted the seek parameters to force the server to pair me against higher rated opponents so I would lose more often, but that's a different story.

At any rate, what we're really interested in is the difference by move.

df.diff().mean(axis=1)[:50].plot();

It seems quite noisy (it's not really plausible that I'm great at playing move 20 but terrible at move 21, or whatever). This makes sense, as a single blunder that changes the evaluation by a large number like 10 could really impact the mean. I could try to address that by using the median.

df.diff().median(axis=1)[:50].plot();
/Users/nate/.pyenv/versions/3.9.0b5/lib/python3.9/site-packages/numpy/lib/nanfunctions.py:993: RuntimeWarning: All-NaN slice encountered
  result = np.apply_along_axis(_nanmedian1d, axis, a, overwrite_input)

Hmm, still quite jagged. Since I'm really interested in trends across phases of the game, not individual moves, it would make sense to apply some smoothing. One way to do that would be to use a rolling mean.

df.diff().mean(axis=1)[:60].rolling(10).mean().plot();

The trend actually goes up as the game goes on! Maybe I'm not as bad at the endgame as I thought?

But thinking about it more, if you win more often than you lose, there should be an upward trend not only in the evaluation, but also in the difference. This is the expected development as you convert a large advantage: +3 becomes +5 becomes +8, etc. While converting a big advantage is certainly better than blowing it, it's not really what I had in mind when I set out to evaluate my endgame play. It also makes the data sensitive to factors I have little control over, like when my opponent resigns.

Additionally, I was still concerned about large but meaningless differences in evaluations. For example, there isn't much difference between a +5 and +10 position in practical terms, but a difference of 5 will impact the mean quite a lot. I could adjust the clip value from 10 to 5, but that's still not ideal. I'd like a +5 and +10 position to register as different, just not so different. I'd also like the scaling to continue all the way down to 0. I think the difference between a +0 and +2 position is bigger than the difference between +2 and +4. Ideally, I'd like something that:

  • Can take any number, positive or negative, as input.
  • Is more sensitive to differences closer to 0.
  • Still registers differences far from 0, but not as much.

Is that sigmoid's music I hear???

Why yes, sigmoid would seem to be perfect actually. It fits all our requirements and squishes the input into the range 0-1. Conveniently, this can be interpreted as expected points: 0 = certain loss, 1 = certain win. In fact, this is how AlphaZero evaluates positions, on a 0-1 scale representing expected score.

It will also handle the converting-a-win scenario very nicely. As long as we stay at a big plus score it won't really care, but if we blow it and get into a worse position, that will register as a big change.

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

Let's just plot the function as a quick visual reminder of what sigmoid does.

x = np.arange(-10, 10, 0.1)
plt.plot(x, sigmoid(x));

It's very sensitive to differences around 0, and very insensitive to differences far from 0, which is exactly what we want.

df_sigmoid = sigmoid(df)

Now I'll do the same plots I did with the raw evaluations.

df_sigmoid.mean(axis=1)[:50].plot();
df_sigmoid.diff().mean(axis=1)[:50].plot();
df_sigmoid.diff().mean(axis=1)[:50].rolling(10).mean().plot();

Okay, this seems to show I do get weaker as the game goes on! But maybe I tortured the data until I got the result that I wanted?

Takeaways

  • The results were somewhat inconclusive: the raw data did not seem to show me getting weaker as the game went on, but after the sigmoid transformation they did. I'm inclined to trust the sigmoid version more for the reasons discussed above, but my self-assessment of my endgame skills may have been overblown: the difference, if it exists, was not large or obvious.
  • When using summary statistics to compare engine evaluations across games, big-but-meaningless evaluation differences in clearly winning/clearly losing positions are a big problem. Using a sigmoid transformation is a promising way to combat this.