DEV Community

Cover image for Reinforcement Learning / Q Learning Basics with Tic Tac Toe
Harsh Agnihotri
Harsh Agnihotri

Posted on

Reinforcement Learning / Q Learning Basics with Tic Tac Toe

Hi Fam, on my journey of learning AI & ML, since I am too dumb to just make "AI Learns to walk" out of nowhere, I had to start somewhere basic, so this is my 1st project on Reinforcement Learning. I hope to teach Reinforcement Learning basics and the Math Algorithm that I used to achieve this

Only Liberary I used is Numpy, No Abstract implementation of Algorithm
Code: Github

Reinforcement Learning (RL) Basics

Then we start training the agent, and the agent learns by

  1. performing random actions
  2. learns from what that does to the environment (we decide a reward for whatever happened).
  3. Update the Q table with the outcome
  4. Take the next step

Overtime agent keeps on reducing the randomness and relying more on Q Table for the best course of action

We start with a table for all possible actions and give them the same value and call it ✨Q table✨

So basically, the agent will

  • Takes an action
  • Receives a reward
  • Updates its knowledge to make better decisions next time

Goal: Maximise reward from each cycle over time


Core RL Terms

For Tic Tac Toe, we will be working with:

Concept Meaning
Agent The AI
Environment The game (board + rules)
State (s) Current board state + move history
Action (a) Placing mark in a cell
Reward (r) +1 win, -1 loss, 0 otherwise
Policy (π) Strategy to pick actions

Q-Learning: a value-based Reinforcement Learning algorithm.

we start with: Q(s, a)

which signify

"How good is it to take action 'a ' in state 's'?"
or
Reward 'q' for action 'a' in state 's'

✨ The Q-Learning Formula ✨

I promise it's not that hard:

Q(s,a) = Q(s,a) + \alpha \left( r + \gamma \max Q(s',a') - Q(s,a) \right)
Enter fullscreen mode Exit fullscreen mode
Term Meaning
Q(s,a) Current Estimation
α (alpha) Learning rate
r Reward received
γ (gamma) Discount factor
max Q(s',a') Best future reward

Simpler Explanation

  • Q(s,a): current set reward in Q table for action 'a' at state 's'
  • α: rate at which learning from action will change the Q table
  • r: possible reward for possible action (obv)
  • γ (gamma): how much u care for what the next action (after this one) matters

So we update Q table using:

New Value = Old Value + Learning Rate × (Target - Old Value)
Enter fullscreen mode Exit fullscreen mode

Where: Target = reward + future reward


Exploration & Exploitation - Stages of learning

To learn effectively, the agent must balance:

  • Exploration → Try random moves
  • Exploitation → Use best known move

for this we use epsilon-greedy strategy:

if random.random() < epsilon:
    action = random.choice(moves)  # explore
else:
    action = best_known_action     # exploit
Enter fullscreen mode Exit fullscreen mode

Over time:

epsilon = max(0.05, epsilon * decay)
Enter fullscreen mode Exit fullscreen mode

Here, decay will help reduce randomness over time using an epsilon value so the agent becomes less random and more intelligent.

State Representation

We encode the game state as a string:

def encode_state(self):
    board_state = "".join(self.board)
    p = ",".join(map(str, self.player_moves))
    a = ",".join(map(str, self.ai_moves))
    return f"{board_state}|{p}|{a}"
Enter fullscreen mode Exit fullscreen mode

This ensures:

  • Board position is captured
  • Sliding move history is preserved

Q-Table Implementation - Code Part

We store learned values in a dictionary:

self.q = {}

def get_q(self, state):
    if state not in self.q:   # initialisation of Q table
        self.q[state] = np.zeros(9)
    return self.q[state]
Enter fullscreen mode Exit fullscreen mode

Each state maps to: 9 possible actions → 9 Q-values

Action Selection

Choosing the best action:

def choose_action(self, state, moves, epsilon):

    if random.random() < epsilon:
        return random.choice(moves)

    q_values = self.get_q(state)

    return max(moves, key=lambda m: q_values[m])
Q-Value Update (Core Learning Step)
q_values = agent.get_q(state)
next_q = agent.get_q(next_state)

if done:
    target = reward
else:
    target = reward + gamma * np.max(next_q)

q_values[action] += alpha * (target - q_values[action])
Enter fullscreen mode Exit fullscreen mode

This is the exact implementation of the Q-learning formula.

Training Loop

The agent learns through self-play:

for episode in range(episodes):

    env.reset()
    done = False

    while not done:

        state = env.encode_state()
        moves = env.available_moves()

        action = agent.choose_action(state, moves, epsilon)

        reward, done = env.step(action)

        next_state = env.encode_state()

        # Update Q-table
Enter fullscreen mode Exit fullscreen mode

Reward Design

Reward shaping is critical:

Game State Value
Win +1
Loss -1
Invalid Move -0.2
Step 0

This guides the agent toward:

  • Winning
  • Avoiding illegal moves
  • Playing efficiently

Summary

  • RL learns through trial and error
  • Q-learning builds a lookup table of experience
  • No dataset is required — learning happens via self-play
  • Even simple algorithms can produce strong game-playing agents

What's Next?

This project uses Tabular Q-Learning, which works because the state space is small.

To go further:

  • Deep Q Learning (Neural Networks)
  • Experience Replay
  • Policy Gradient Methods
  • Multi-agent training systems

This project helped me understand that AI is not magic or just abstract libraries — it's just math, iteration, and a lot of trial and error.


Top comments (0)