Introduction

The hard work is done. The move generation module is running and ready to generate moves for our chess games. Yay!

But there’s no time to celebrate yet. The move generator might be functional but it’s also unreliable. Tons of bugs might have been introduced as a result of missing rules and edge cases, or code executing in unexpected ways.

Testing is a broad topic in software development and multiple techniques and approaches exist. However, one stands out of all of them. In this devlog, we’ll introduce perft testing, a popular method to test move generation for chess programs.

Why perft testing?

At this point, it’s no surprise for us that Chess holds tons of possibilities despite its simplicity.

A known fact about Chess is that there are more possible different positions on the board (including legal and illegal) than atoms in the observable universe. The number of possible positions is known as the Shannon Number[1] and it’s around 10^120. That number is so big that you could spend multiple lifetimes arranging every position on the board and you still wouldn’t be finished.

Don’t try it. Taken from [2]

The worst part is that this number grows exponentially after a few moves. This table shows the number of positions that can be achieved by performing X number of moves:

Number of movesNumber of possible positions
120
2400
38,902
4197,281
54,865,609
6119,060,324
73,195,901,860
884,998,978,956
92,439,530,234,167
1069,352,859,712,417
112,097,651,003,696,806

The point of testing is to ensure the move generation algorithm generates the correct amount of moves for every possible position. These absurd quantities make it impossible to test the algorithm manually for all these positions without going insane first.

But there’s no need to feel intimidated! The astonishing breakthroughs in AI and quantum computing have shown us that computers excel at processing mountains of information at lightning speed. We can play on this strength and hand that task to the computer. This way, we can quickly find which positions generate problems for our move generation algorithm so we can fix it.

What is perft testing?

A performance test (perft, for short) is an algorithm that counts the total number of positions achievable by making a certain number of moves for a given initial board configuration. An intuitive way to understand it is to visualize the game’s progression as a tree data structure.

--- title: Chess as a tree --- flowchart TB subgraph "root (depth 0)" pos_0((Initial Pos)) end subgraph "parent (depth 1)" pos1((Pos1)) end pos2((Pos2)) pos3((Pos3)) subgraph "children (depth 2)" pos4((Pos4)) pos5((Pos5)) pos6((Pos6)) end pos7((Pos7)) pos8((Pos8)) pos9((Pos9)) pos10((Pos10)) pos_0 ---|e4| pos1 pos_0 ---|Nf3| pos2 pos_0 ---|d4| pos3 pos1 ---|e5| pos4 pos1 ---|nc6| pos5 pos1 ---|b6| pos6 pos2 ---|Nc6| pos7 pos2 ---|a5| pos8 pos3 ---|d5| pos9 pos3 ---|nh6| pos10

The root of the tree represents the initial board configuration. Each node of the tree represents a certain chess position. The connection between nodes represents the transition between two positions by making a move. A child node is the result of performing a move in the parent node. Therefore, a parent position will have as many child positions as possible legal moves. The depth of the tree represents the number of moves performed by each side.

For example, from the initial position, we can reach pos4 by playing e4 and then e5, which results in a depth of 2 because we played two moves.

To count all positions, we’ll visit each node on the tree using a depth-first pre-order search [4]. Then, we’ll count the number of children each node has and add them up. This way, we’ll effectively count all possible positions after performing a certain number of moves.

Alt text
Depth-first pre-order tree traversal. Taken from [4]

Following the previous example, we start in the initial position and count 3 child positions. Then, we move to pos1 and count another 3 children. Then, we move to pos4, pos5, and pos6, which have no children. Then, we move to pos2, which has 2 children… You get the idea. The total number of positions for depth 2 is 10 for this example.

Unfortunately, we can’t go as deep as we want without suffering from performance issues. This is because the number of positions we have to traverse increases quickly.

You might ask, what do we do with the number that results from this process? Good question! The number alone doesn’t tell us much about the accuracy of our program. The point of the test is to compare our results against a consensus that a bunch of smart guys and developers have reached before. We can find these consensual values all over the internet and for many different starting positions.

Algorithm

The basic algorithm to perform a depth-first pre-order traversal of a tree looks like this:

function traverse(node){
    for(let child of node.children){
        traverse(child);
    }
}

The for-loop goes through the children of each node. Then, we make a recursive call of the method passing each child node, which takes the search one level deeper into the tree.

For our specific case, this would look like this:

    /**
     *
     * @param {Board} board
     * @param {number} depth
     */
    function perftTest(board) {

        let moves = board.generateMoves(playingColor);
        for (let move of moves) {

            board.makeMove(move);
            playingColor = OppositePieceColor(playingColor);

            perftTest(board);

            board.unmakeMove();
            playingColor = OppositePieceColor(playingColor);
        }

        return numberOfPositions;
    }

Every time we enter the function, which happens once for every node, we generate all the possible moves for this position. Then, we access the child positions by making each of these moves on the board. Once we finish on one level, we unmake the move to return to the parent position.

Now we just need extra code to count the positions:

    /**
     * @param {Board} board
     * @param {number} depth
     * @param {E_PieceColor} playingColor
     */
    function perftTest(board, depth, playingColor = E_PieceColor.White) {
        if (depth == 0) {
            return 1;
        }

        let moves = board.generateMoves(playingColor);
        let numberOfPositions = 0;

        for (let move of moves) {
            board.makeMove(move);
            playingColor = OppositePieceColor(playingColor);
            let positions = perftTest(board, depth - 1, playingColor);
            numberOfPositions += positions;
            board.unmakeMove();
            playingColor = OppositePieceColor(playingColor);
        }

        return numberOfPositions;
    }

The depth value decreases with each level. Once it hits the bottom, the depth equals 0 and we return immediately to avoid going further, adding one to the number of positions. Every node adds one incrementally until we count all nodes in the tree, which is equal to the total number of positions.

The routine also receives the color that plays first because the number of positions when white plays first might differ from when black plays first.

A practical guide to perft testing

All this theory looks nice and fancy. But how does it look in practice?

Here’s a step-by-step guide you can follow:

Step 1 - Collect a bunch of different test positions

The more positions we can test our algorithm against, the better. Testing against the standard initial position is a good starting point. However, we’ll run out of performance before we reach those rare positions that occur after several moves. For that reason, we need to collect a bunch of different positions first. Extra points if those positions helped other people find bugs.

Here are some resources where you can find test positions:

Something you need to consider is that some of these positions have initial conditions. For example, castling rights might be initially lost, or an en-passant capture might be legal in the first move. These conditions are usually described in the FEN notation [7] of that position.

As I’m writing this, our program receives a FEN notation without initial conditions when constructing a board. Therefore, it assumes that any chess position starts with no right to en-passant. In other words, none of the pawns made a front jump in the previous move. Also, it automatically calculates castling rights based on the current state of the board and dismisses any past moves.

For this reason, we’ll use positions that do not depend on any initial conditions but just on the initial board configuration.

In the future, we might add some code that receives a complete FEN representation and adjusts the initial conditions of the board accordingly. That way, we can test even more positions.

Step 2 - Find the results for each position

The next step is to ask the Internet what the correct number of moves is for each position and depth. We’ll use the tables in the Chess Programming Wiki[5] for various positions.

Stockfish

Unfortunately, there won’t be results for any position our mad mind can imagine.

However, the wonders of AI come to the rescue again. There are currently dozens of AI-powered chess programs whose move generation algorithms have been widely tested and are well-trusted.

One example of them is Stockfish, a powerful and fast chess engine. Apart from easily beating world-class chess champions, this program offers the possibility to generate perft test results for any position and depth.

You can download it here [8].

The commands we’ll be using are:

  • position: Sets the position we want to analyze
  • d: Draws a board with the position we are analyzing.
  • go perft: Performs a perft test at a certain depth.

You can find further documentation on the line commands here [9].

Let’s look at the output of a perft test done by Stockfish.

Alt text
Example of Stockfish output

First, we set the position to a standard one with the position standard command. Then, we run a perft test of depth 3 with go perft 3.

We observe that stockfish provides a series of apparently nonsensical letters and numbers. Letters represent moves that can be performed as the starting move from the initial position. The number of positions that ramificate from each move is indicated after it. If we look back at the tree example, this tells us basically how many children and grandchildren certain position has after playing the first move.

%%{init: {'themeVariables': { 'fontSize': '20px'}}}%% flowchart TB pos_0((Standard Position)) subgraph "420 child positions" pos3((Pos 3)) pos31((Pos 3.1)) pos311(("...")) pos32((Pos 3.2)) pos321(("Pos 3.2.1")) pos322(("...")) pos33(("...")) pos331(("...")) end subgraph "420 child positions" pos2((Pos 2)) pos21((Pos 2.1)) pos211(("...")) pos22((Pos 2.2)) pos221(("Pos 2.2.1")) pos222(("...")) pos23(("...")) pos231(("...")) end subgraph "380 child positions" pos1((Pos 1)) pos11((Pos 1.1)) pos111(("...")) pos12((Pos 1.2)) pos121(("Pos 1.2.1")) pos122(("...")) pos13(("...")) pos131(("...")) end pos_0 ---|a2a3| pos1 pos_0 ---|b2b3| pos2 pos_0 ---|c2c3| pos3 pos1 --- pos11 pos1 --- pos13 pos1 --- pos12 pos11 --- pos111 pos12 --- pos121 pos12 --- pos122 pos13 --- pos131 pos2 --- pos21 pos2 --- pos22 pos2 --- pos23 pos21 --- pos211 pos22 --- pos221 pos22 --- pos222 pos23 --- pos231 pos3 --- pos31 pos3 --- pos32 pos3 --- pos33 pos31 --- pos311 pos32 --- pos321 pos32 --- pos322 pos33 --- pos331

For example, from a standard position where white plays first, we can play as the first move a2a3, b2b3, c2c3, etc, which is basically moving the pawns one square forward. a2a3 generates 380 positions until depth 3. b2b3 generates 420 positions until depth 3. c2c3 generates 420 positions until depth 3, etc.

This feature is really nice because we can track down moves that are creating trouble.

Step 3 - Write a function that runs the test for a given position

We’ll create an object that stores the results for each position.

var testPositions = {
    standard: {
        fen: 'rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR',
        positions: [20, 400, 8902, 197281]
    },
    pos2_kiwipet: {
        fen: 'r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R',
        positions: [48, 2039, 97862, 4085603]
    },
    pos3: {
        fen: '8/2p5/3p4/KP5r/1R3p1k/8/4P1P1/8',
        positions: [14, 191, 2812, 43238, 674624]
    },
    pos4: {
        fen: 'r3k2r/Pppp1ppp/1b3nbN/nP6/BBP1P3/q4N2/Pp1P2PP/R2Q1RK1',
        positions: [6, 264, 9467, 422333]
    },
    pos5: {
        fen: 'rnbq1k1r/pp1Pbppp/2p5/8/2B5/8/PPP1NnPP/RNBQK2R',
        positions: [44, 1486, 62379, 2103487]
    },
    pos6: {
        fen: 'r4rk1/1pp1qppp/p1np1n2/2b1p1B1/2B1P1b1/P1NP1N2/1PP1QPPP/R4RK1',
        positions: [46, 2079, 89890, 3894594]
    }
}

The positions array holds the number of positions for a certain depth. The first element is the number of positions at depth 1, the second is the number of positions at depth 2, and so on. I limited my search between depth 4 and depth 5 because that’s what my crappy computer could handle before perishing.

F in the chat. Taken from [10]

Now, I’ll let you create the function that runs these tests;

function runAllPerftTests(testPosition) {
    let board = new Board(testPosition.fen);
    for (let i = 0; i < testPosition.positions.length; i++) {
        perftTest(board, i + 1);
    }
}

We create the board using its FEN notation and then run the test for each depth.

The reason why we are going depth by depth is to spot which one is creating trouble. Since game’s progression is essentially a tree structure, errors in a parent position will cascade to later depths of the tree. For example, if we observe that the numbers start to mismatch after depth 4, we know that depth 3 generates extra or missing moves.

Step 4 - Write code that outputs perft results

Now we have to actually print out the results of the test so we can compare them to the consensus.

The output format should be designed to facilitate the comparison with other sources. To compare against results in the wiki, we can follow this format:

Alt text
Console output format for perft results

Since we will be using Stockfish as another reliable source of results, we’ll also generate output with a similar format to the engine:

Alt text
Console output with Stockfish format

Where we want to write and store the output is another thing we have to consider. We could write the results into a simple text file, connect and transfer the data to another application, or upload it to an online database. That depends on our goal and the workflow we want to achieve.

To keep things simple, we’ll just print the results in the browser console.

Code

For the table-like output, we’ll add some code to our runAllPerftTests function.

function runAllPerftTests(testPosition) {
    let board = new Board(testPosition.fen);
    for (let i = 0; i < testPosition.positions.length; i++) {
        let result = perftTest(board, i + 1); 
        console.log("Depth: " + (i + 1) + "  Result: " + result + "  Expected: " + testPosition.positions[i]);
    }
}

Then, for the stockfish-like output, we add more code to our perftTest function.

/**
 * @param {Board} board 
 * @param {number} depth 
 * @param {boolean} debug 
 * @param {E_PieceColor} playingColor 
 * @returns total number of positions for given depth
 * 
 */
function perftTest(board, depth, debug = false, playingColor = E_PieceColor.White) {
    if (depth == 0) {
        return 1;
    }

    let moves = board.generateMoves(playingColor);
    let numberOfPositions = 0;

    for (let move of moves) {
        board.makeMove(move);
        playingColor = OppositePieceColor(playingColor);
        let positions = perftTest(board, depth - 1, false, playingColor);
        numberOfPositions += positions;
        board.unmakeMove();
        playingColor = OppositePieceColor(playingColor);

        if (debug) {
            console.log(MoveToString(move) + " " + positions + "\n");
        }
    }

    return numberOfPositions;
}

We include a boolean flag called debug that determines whether the results should be printed out. This flag will always be false after depth 1 (that is, after making the first move). This way, we only print results for the set of starting moves.

The MoveToString function converts the numerical representation of rank and files to their algebraic notation used by stockfish.

/**
 * @param {Move} move 
 * @returns move in algebraic notation
 */
function MoveToString(move) {
    let startFileLetter = FileToLetter(move.startFile);
    let endFileLetter = FileToLetter(move.endFile);
    return startFileLetter + move.startRank + endFileLetter + move.endRank;
}

Final notes

There you have it! Now you know the fundamentals to understand perft testing and you have a route to add it into your program.

I want to mention a couple of things before wrapping up this devlog.

Performance

As mentioned before, you’ll eventually run into a situation where your computer takes a lot of time (or totally freezes) to calculate results at a high depth. That is normal. Even computers have their limits.

To speed things up, your best bet is to optimize the code in the move generation module. Many optimization techniques are possible. Caching [11] is a common practice to increase performance by calculating something once and reusing it when needed.

However, I’d advise against optimizing your code until it’s completely necessary. And by completely necessary I mean that your program can’t go further than depth 3 or 4 without experiencing long delays. Most optimization techniques make code more complex and less readable, which is a no-no if you want to keep your program manageable.

How accurate is this test?

An inherent flaw of perft testing is that you can never be 100% sure your code is right because we’re not testing the code directly but just its outputs. There might be a case where a program generates the correct amount of moves despite having errors. Some of the moves generated won’t be legal even though the numbers match.

However, if we do things right, this scenario is unlikely. It’s just a matter of probabilities. The probability that a wrong program hits the exact number of moves multiple times for different positions is really small. That’s why it’s recommended to test against various positions to a decent depth.

References