Perft Testing P.1: Fundamentals

July 5, 2024

Introduction

The hard work is done. The move generation module is running and ready to generate moves for our chess games. Yay!

But there’s no time to celebrate yet. The move generator might be functional but it’s also unreliable. Tons of bugs might have been introduced as a result of missing rules, edge cases, or code executing in unexpected ways.

Testing is a broad topic in software development and there are multiple techniques and approaches. However, in the chess programming community, one stands out of all of them. In this devlog, we’ll introduce perft testing, a popular method to test move generation algorithms for chess programs.

Why perft testing?

At this point, it’s no surprise for us that Chess holds tons of possibilities despite its simplicity.

A known fact about Chess is that the number of possible positions on the board (including legal and illegal), called the Shannon Number [1], is around 10^120. That’s higher than the number of atoms in the observable universe. That number is so big that you could spend multiple lifetimes arranging every position on the board and you still wouldn’t be finished.

Jubei Chan Withering GIFfrom Jubei Chan GIFs
Don’t try it. Taken from [2]

The worst part is that this number grows exponentially after a few moves. This table shows the number of positions that can be achieved by performing X number of moves:

Number of moves	Number of possible positions
1	20
2	400
3	8,902
4	197,281
5	4,865,609
6	119,060,324
7	3,195,901,860
8	84,998,978,956
9	2,439,530,234,167
10	69,352,859,712,417
11	2,097,651,003,696,806

The point of testing is to ensure the program generates the correct amount of legal moves for every possible position. The astronomical quantity of positions make it impossible to test the algorithm manually without going insane first.

But there’s no need to feel intimidated! The astonishing breakthroughs in AI and quantum computing have shown us that computers excel at processing mountains of information at lightning speed. We can play on this strength and hand that task to the computer. This way, we can quickly find which positions generate problems for our move generation algorithm so we can fix it.

What is perft testing?

A performance test (perft, for short) is an algorithm that counts the total number of positions achievable by making a certain number of moves for a given initial board configuration. An intuitive way to understand it is to visualize the game’s progression as a tree data structure.

--- title: Chess as a tree --- flowchart TB subgraph "root (depth 0)" pos_0((Initial Pos)) end subgraph "parent (depth 1)" pos1((Pos1)) end pos2((Pos2)) pos3((Pos3)) subgraph "children (depth 2)" pos4((Pos4)) pos5((Pos5)) pos6((Pos6)) end pos7((Pos7)) pos8((Pos8)) pos9((Pos9)) pos10((Pos10)) pos_0 ---|e4| pos1 pos_0 ---|Nf3| pos2 pos_0 ---|d4| pos3 pos1 ---|e5| pos4 pos1 ---|nc6| pos5 pos1 ---|b6| pos6 pos2 ---|Nc6| pos7 pos2 ---|a5| pos8 pos3 ---|d5| pos9 pos3 ---|nh6| pos10

The root of the tree represents the initial board configuration. Each node of the tree represents a certain chess position. The connection between nodes represents the transition between two positions by making a move. A child node is the result of performing a move in the parent node. Therefore, a parent position will have as many child positions as possible legal moves. The depth of the tree represents the number of moves performed by each side.

For example, from the initial position, we can reach pos4 by playing e4 and then e5, which results in a depth of 2 because we played two moves.

To count all positions, we’ll visit each node on the tree using a depth-first pre-order search [4], and we’ll count every node we go through.


Depth-first pre-order tree traversal. Taken from [4]

Following the previous example, we start in the initial position and count 3 child positions. Then, we move to pos1 and count another 3 children. Then, we move to pos4, pos5, and pos6, which have no children. Then, we move to pos2, which has 2 children… You get the idea. The total number of positions for depth 2 is 10 for this example.

Unfortunately, we can’t go as deep as we want without suffering performance issues. This is because the number of positions we have to traverse increases quickly.

You might ask, what do we do with the number that results from this process? Good question! The number alone doesn’t tell us much about the accuracy of our program. The point of the test is to compare our results against a consensus that a bunch of smart guys and developers have reached before. We can find these consensual values all over the internet and for many different starting positions.

Algorithm

The basic algorithm to perform a depth-first pre-order traversal of a tree looks like this:

function traverse(node){
    for(let child of node.children){
        traverse(child);
    }
}

The for-loop goes through the children of each node. Then, we make a recursive call of the method passing each child node, which takes the search deeper into the tree by one level.

For our specific case, this would look like this:

    /**
     *
     * @param {Board} board
     * @param {number} depth
     */
    function perftTest(board) {

        let moves = board.generateMoves(playingColor);
        for (let move of moves) {

            board.makeMove(move);
            playingColor = OppositePieceColor(playingColor);

            perftTest(board);

            board.unmakeMove();
            playingColor = OppositePieceColor(playingColor);
        }

        return numberOfPositions;
    }

Every time we enter the function, which happens once for every node, we generate all the possible moves for this position. Then, we access the child positions by applying each of these moves on the board. Once we finish on one level, we unmake the move to return to the parent position.

Now we just need extra code to count the positions:

    /**
     * @param {Board} board
     * @param {number} depth
     * @param {E_PieceColor} playingColor
     */
    function perftTest(board, depth, playingColor = E_PieceColor.White) {
        if (depth == 0) {
            return 1;
        }

        let moves = board.generateMoves(playingColor);
        let numberOfPositions = 0;

        for (let move of moves) {
            board.makeMove(move);
            playingColor = OppositePieceColor(playingColor);
            let positions = perftTest(board, depth - 1, playingColor);
            numberOfPositions += positions;
            board.unmakeMove();
            playingColor = OppositePieceColor(playingColor);
        }

        return numberOfPositions;
    }

The depth value decreases with each level. Once it hits the bottom, the depth equals 0 and we return immediately to avoid going further, adding one to the number of positions. Every node adds one incrementally until we count all nodes in the tree, which is equal to the total number of positions.

The routine also receives the color that plays first because the number of positions when white plays first might differ from when black plays first.

A practical guide to perft testing

All this theory looks nice and fancy. But how does it look in practice?

Here’s a step-by-step guide you can follow:

Step 1 - Collect a bunch of test positions

The more positions we can test our algorithm against, the better. Testing against the standard initial position is a good starting point. However, we’ll run out of performance before we reach those rare positions that occur after several moves. For that reason, we need to collect different positions. Extra points if those positions helped other people find bugs.

Here are some resources where you can find test positions:

Something to keep in mind is that some positions have initial conditions. For example, castling rights might be lost from the start, or an en-passant capture might be legal in the first move. These conditions are usually described in the FEN notation [7] of that position.

As I’m writing this, our program receives a FEN notation without initial conditions when constructing a board. Therefore, it assumes that any chess position starts with no right to en-passant. In other words, none of the pawns made a front jump in the previous move. Also, it automatically calculates castling rights based on the current state of the board and dismisses any past moves.

For this reason, we’ll use positions that do not depend on any initial conditions and just on the initial board configuration.

In the future, we might add some code that receives a complete FEN representation and adjusts the initial conditions of the board accordingly. That way, we can test even more positions.

Step 2 - Find the results for each position

The next step is to ask the Internet what the correct number of moves is for each position and depth. We’ll use the tables in the Chess Programming Wiki[5] for various positions.

Stockfish

Unfortunately, there won’t be results on the Internet for any position our mind can imagine.

However, the wonders of AI come to the rescue again. There are currently dozens of AI-powered chess programs whose move generation algorithms have been widely tested and are well-trusted.

One example of them is Stockfish, a powerful and fast chess engine. Besides easily beating world-class chess champions, this program offers the possibility to generate perft test results for any position and depth.

You can download it here [8].

The commands we’ll be using are:

position: Sets the position we want to analyze
d: Draws a board with the position we are analyzing.
go perft: Performs a performance test at a certain depth.

You can find further documentation on the line commands here [9].

Let’s look at the output of a perft test done by Stockfish.


Example of Stockfish output

First, we set the position to a standard one with the position standard command. Then, we run a perft test of depth 3 with go perft 3.

Stockfish provides a series of apparently nonsensical letters and numbers. Letters represent moves that can be performed as the starting move from the initial position. The number of positions that ramificate from each move is indicated after it. If we look back at the tree example, this tells us basically how many children and grandchildren certain position has after playing the first move.

%%{init: {'themeVariables': { 'fontSize': '20px'}}}%% flowchart TB pos_0((Standard Position)) subgraph "420 child positions" pos3((Pos 3)) pos31((Pos 3.1)) pos311(("...")) pos32((Pos 3.2)) pos321(("Pos 3.2.1")) pos322(("...")) pos33(("...")) pos331(("...")) end subgraph "420 child positions" pos2((Pos 2)) pos21((Pos 2.1)) pos211(("...")) pos22((Pos 2.2)) pos221(("Pos 2.2.1")) pos222(("...")) pos23(("...")) pos231(("...")) end subgraph "380 child positions" pos1((Pos 1)) pos11((Pos 1.1)) pos111(("...")) pos12((Pos 1.2)) pos121(("Pos 1.2.1")) pos122(("...")) pos13(("...")) pos131(("...")) end pos_0 ---|a2a3| pos1 pos_0 ---|b2b3| pos2 pos_0 ---|c2c3| pos3 pos1 --- pos11 pos1 --- pos13 pos1 --- pos12 pos11 --- pos111 pos12 --- pos121 pos12 --- pos122 pos13 --- pos131 pos2 --- pos21 pos2 --- pos22 pos2 --- pos23 pos21 --- pos211 pos22 --- pos221 pos22 --- pos222 pos23 --- pos231 pos3 --- pos31 pos3 --- pos32 pos3 --- pos33 pos31 --- pos311 pos32 --- pos321 pos32 --- pos322 pos33 --- pos331

For example, from a standard position where white plays first, we can play a2a3, b2b3, or c2c3 (among more moves) as the first move, which is basically moving the pawns one square forward. a2a3 generates 380 positions until depth 3. b2b3 generates 420 positions until depth 3. c2c3 generates 420 positions until depth 3, and so on.

This feature is nice because we can track down moves that are creating trouble.

Step 3 - Write a function that runs the test for a given position

We’ll create an object that stores the results for each position.

var testPositions = {
    standard: {
        fen: 'rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR',
        positions: [20, 400, 8902, 197281]
    },
    pos2_kiwipet: {
        fen: 'r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R',
        positions: [48, 2039, 97862, 4085603]
    },
    pos3: {
        fen: '8/2p5/3p4/KP5r/1R3p1k/8/4P1P1/8',
        positions: [14, 191, 2812, 43238, 674624]
    },
    pos4: {
        fen: 'r3k2r/Pppp1ppp/1b3nbN/nP6/BBP1P3/q4N2/Pp1P2PP/R2Q1RK1',
        positions: [6, 264, 9467, 422333]
    },
    pos5: {
        fen: 'rnbq1k1r/pp1Pbppp/2p5/8/2B5/8/PPP1NnPP/RNBQK2R',
        positions: [44, 1486, 62379, 2103487]
    },
    pos6: {
        fen: 'r4rk1/1pp1qppp/p1np1n2/2b1p1B1/2B1P1b1/P1NP1N2/1PP1QPPP/R4RK1',
        positions: [46, 2079, 89890, 3894594]
    }
}

The positions variable holds an array with the number of positions for a each depth. The first element is the number of positions at depth 1, the second is the number of positions at depth 2, and so on. I limited my search between depth 4 and depth 5 because that’s what my crappy computer could handle before perishing.

Overheat Computer GIFfrom Overheat GIFs
F in the chat. Taken from [10]

The function to run the test looks ike this:

function runAllPerftTests(testPosition) {
    let board = new Board(testPosition.fen);
    for (let i = 0; i < testPosition.positions.length; i++) {
        perftTest(board, i + 1);
    }
}

We create the board using its FEN notation and then run the test for each depth.

The reason why we are going depth by depth is to spot which one is creating trouble. Since the game’s progression is essentially a tree structure, errors in a parent position will cascade to later depths of the tree. For example, if we observe that the numbers start to mismatch after depth 4, we know that depth 3 generates extra or missing moves.

Step 4 - Write code that outputs perft results

Now we have to actually print out the results of the test so we can compare them to the consensus.

The output format should be designed to facilitate the comparison with other sources. To compare against results in the wiki, we can follow this format:


Console output format for perft results

Since we will be using Stockfish as another reliable source of results, we’ll also generate output with a similar format to the engine:


Console output with Stockfish format

Another thing we have to consider is where to write and store the output . We could write the results into a simple text file, transfer the data to another application, or upload it to an online database. That depends on our goal and the workflow we want to achieve.

To keep things simple, we’ll just print the results in the browser console.

Code

For the table-like output, we’ll add some code to our runAllPerftTests function.

function runAllPerftTests(testPosition) {
    let board = new Board(testPosition.fen);
    for (let i = 0; i < testPosition.positions.length; i++) {
        let result = perftTest(board, i + 1); 
        console.log("Depth: " + (i + 1) + "  Result: " + result + "  Expected: " + testPosition.positions[i]);
    }
}

Then, for the stockfish-like output, we add more code to our perftTest function.

/**
 * @param {Board} board 
 * @param {number} depth 
 * @param {boolean} debug 
 * @param {E_PieceColor} playingColor 
 * @returns total number of positions for given depth
 * 
 */
function perftTest(board, depth, debug = false, playingColor = E_PieceColor.White) {
    if (depth == 0) {
        return 1;
    }

    let moves = board.generateMoves(playingColor);
    let numberOfPositions = 0;

    for (let move of moves) {
        board.makeMove(move);
        playingColor = OppositePieceColor(playingColor);
        let positions = perftTest(board, depth - 1, false, playingColor);
        numberOfPositions += positions;
        board.unmakeMove();
        playingColor = OppositePieceColor(playingColor);

        if (debug) {
            console.log(MoveToString(move) + " " + positions + "\n");
        }
    }

    return numberOfPositions;
}

We include a boolean flag called debug that determines whether the results should be printed out. This flag will always be false after depth 1 (that is, after making the first move). This way, we only print results for the set of starting moves.

The MoveToString function converts the numerical representation of rank and files to their algebraic notation used by stockfish.

/**
 * @param {Move} move 
 * @returns move in algebraic notation
 */
function MoveToString(move) {
    let startFileLetter = FileToLetter(move.startFile);
    let endFileLetter = FileToLetter(move.endFile);
    return startFileLetter + move.startRank + endFileLetter + move.endRank;
}

Final notes

There you have it! Now you know the fundamentals to understand perft testing and you have a route to add it into your program.

I want to mention a couple of things before wrapping up this devlog.

Performance

As mentioned before, you’ll eventually run into a situation where your computer takes a lot of time (or totally freezes) to calculate results at a high depth. That is normal. Even computers have their limits.

To speed things up, your best bet is to optimize the code in the move generation module. Many optimization techniques are possible. Caching [11] is a common practice to increase performance by calculating something once and reusing it when needed.

However, I’d advise against optimizing your code until it’s completely necessary. And by completely necessary I mean that your program can’t go further than depth 3 or 4 without experiencing long delays. Most optimization techniques make code more complex and less readable, which is a no-no if you want to keep your program manageable.

How accurate is this test?

An inherent flaw of perft testing is that you can never be 100% sure your code is right because we’re not testing the code directly but just its outputs. There might be a case where a program generates the correct amount of moves despite having errors. Some of the moves generated won’t be legal even though the numbers match.

However, if we do things right, this scenario is unlikely. It’s just a matter of probabilities. The probability that a wrong program hits the exact number of moves multiple times for different positions is really small. That’s why it’s recommended to test against various positions and to a decent depth.

References

[1] Wikimedia Foundation. (2024, May 19). Shannon number. Wikipedia. https://en.wikipedia.org/wiki/Shannon_number
[2] Baaaaaats. (2021, November 15). Jubei Chan Withering Gif. Tenor. https://tenor.com/es/view/jubei-chan-withering-jubei-chan-the-ninja-girl-wasting-away-aging-gif-23790527
[3] Introduction to tree data structure. GeeksforGeeks. (2024a, June 6). https://www.geeksforgeeks.org/introduction-to-tree-data-structure/
[4] Tree traversal techniques. GeeksforGeeks. (2024b, June 28). https://www.geeksforgeeks.org/tree-traversals-inorder-preorder-and-postorder
[5] Perft results. Perft Results - Chessprogramming wiki. (2024, March 15). https://www.chessprogramming.org/Perft_Results
[6] Jones, P. (2021, June 10). Test positions for debugging chess engines. formatted as JSON array of Fen Strings. Github Gist. https://gist.github.com/peterellisjones/8c46c28141c162d1d8a0f0badbc9cff9
[7] Fen (Forsyth-Edwards notation) - chess terms. Chess.com. (n.d.). https://www.chess.com/terms/fen-chess
[8] Download Stockfish 16.1. Stockfish. https://stockfishchess.org/download/
[9] UCI & Commands. Stockfish Wiki. (2023, December 14). https://official-stockfish.github.io/docs/stockfish-wiki/UCI-&-Commands.html
[10] Skeggus. (2020, May 26). Overheat computer GIF - overheat computer explosion . Tenor. https://tenor.com/view/overheat-computer-explosion-boom-spark-gif-17323916
[11] Wikimedia Foundation. (2022, July 24). Cache (computing). Wikipedia. https://simple.wikipedia.org/wiki/Cache_(computing)