Deep Reinforcement Learning Framework | Library

Fruit API Tutorials
Reinforcement Learning Basics

Tic Tac Toe

In this tutorial, we will train an agent to play Tic Tac Toe by using Temporal Difference learning. This tutorial can be used as a supplementary material for the book "Reinforcement Learning: An Introduction" (Sutton, Barto) - Chapter 1. Specifically, the agent is trained with an opponent, who plays randomly. A self-play strategy can be implemented in a similar way.

Although Tic Tac Toe (3x3) is quite simple, its state space is reasonably large. The information of Tic Tac Toe can be found here: "A simple upper bound for the size of the state space is 3^9 = 19,683. (There are three states for each cell and nine cells.) This count includes many illegal positions, such as a position with five crosses and no noughts, or a position in which both players have a row of three. A more careful count, removing these illegal positions, gives 5,478. And when rotations and reflections of positions are considered identical, there are only 765 essentially different positions. A simple upper bound for the size of the game tree is 9! = 362,880. (There are nine positions for the first move, eight for the second, and so on.) This includes illegal games that continue after one side has won. A more careful count gives 255,168 possible games. When rotations and reflections of positions are considered the same, there are only 26,830 possible games."

To make the program extendable, it is essential to create a dynamic state-value table. We include all possible states of the environment when using a static table. In contrast, a new state is only added to the dynamic table while the agent is in training. In Python, a dictionary can be used as a dynamic state-value table. One square of the board can be in three states: empty (0), nought (1), or cross (2), denoted as a(j), where a(j) = {1, 2, 3} and j is the index of j-th square in the board. Therefore, a state of a board can be calculated as below:

The agent is trained with a "random player". Each step of the agent is selected based on the Epsilon-Greedy strategy. The state-value table is updated in every step. As a reminder, the old state-value is replaced with the new one by using the following update rule:

The source code can be found here https://github.com/garlicdevs/Fruit-API/blob/master/fruit/samples/rl_basics/chapter_1.py

Contact us at hello@fruitlab.org or join our community at https://www.facebook.com/groups/fruitlab/ to post any questions.