Reinforcement Learning / Q Learning Basics with Tic Tac Toe

Harsh Agnihotri — Sat, 11 Apr 2026 08:26:35 +0000

Hi Fam, on my journey of learning AI & ML, since I am too dumb to just make "AI Learns to walk" out of nowhere, I had to start somewhere basic, so this is my 1st project on Reinforcement Learning. I hope to teach Reinforcement Learning basics and the Math Algorithm that I used to achieve this

Only Liberary I used is Numpy, No Abstract implementation of Algorithm
Code: Github

Reinforcement Learning (RL) Basics

Then we start training the agent, and the agent learns by

performing random actions
learns from what that does to the environment (we decide a reward for whatever happened).
Update the Q table with the outcome
Take the next step

Overtime agent keeps on reducing the randomness and relying more on Q Table for the best course of action

We start with a table for all possible actions and give them the same value and call it ✨Q table✨

So basically, the agent will

Takes an action
Receives a reward
Updates its knowledge to make better decisions next time

Goal: Maximise reward from each cycle over time

Core RL Terms

For Tic Tac Toe, we will be working with:

Concept	Meaning
Agent	The AI
Environment	The game (board + rules)
State (s)	Current board state + move history
Action (a)	Placing mark in a cell
Reward (r)	+1 win, -1 loss, 0 otherwise
Policy (π)	Strategy to pick actions

Q-Learning: a value-based Reinforcement Learning algorithm.

we start with: Q(s, a)

which signify

"How good is it to take action 'a ' in state 's'?"
or
Reward 'q' for action 'a' in state 's'

✨ The Q-Learning Formula ✨

I promise it's not that hard:

Q(s,a) = Q(s,a) + \alpha \left( r + \gamma \max Q(s',a') - Q(s,a) \right)

Term	Meaning
`Q(s,a)`	Current Estimation
`α (alpha)`	Learning rate
`r`	Reward received
`γ (gamma)`	Discount factor
`max Q(s',a')`	Best future reward

Simpler Explanation

Q(s,a): current set reward in Q table for action 'a' at state 's'

α: rate at which learning from action will change the Q table

r: possible reward for possible action (obv)

γ (gamma): how much u care for what the next action (after this one) matters

So we update Q table using:

New Value = Old Value + Learning Rate × (Target - Old Value)

Where: Target = reward + future reward

Exploration & Exploitation - Stages of learning

To learn effectively, the agent must balance:

Exploration → Try random moves
Exploitation → Use best known move

for this we use epsilon-greedy strategy:

if random.random() < epsilon:
    action = random.choice(moves)  # explore
else:
    action = best_known_action     # exploit

Over time:

epsilon = max(0.05, epsilon * decay)

Here, decay will help reduce randomness over time using an epsilon value so the agent becomes less random and more intelligent.

State Representation

We encode the game state as a string:

def encode_state(self):
    board_state = "".join(self.board)
    p = ",".join(map(str, self.player_moves))
    a = ",".join(map(str, self.ai_moves))
    return f"{board_state}|{p}|{a}"

This ensures:

Board position is captured
Sliding move history is preserved

Q-Table Implementation - Code Part

We store learned values in a dictionary:

self.q = {}

def get_q(self, state):
    if state not in self.q:   # initialisation of Q table
        self.q[state] = np.zeros(9)
    return self.q[state]

Each state maps to: 9 possible actions → 9 Q-values

Action Selection

Choosing the best action:

def choose_action(self, state, moves, epsilon):

    if random.random() < epsilon:
        return random.choice(moves)

    q_values = self.get_q(state)

    return max(moves, key=lambda m: q_values[m])
Q-Value Update (Core Learning Step)
q_values = agent.get_q(state)
next_q = agent.get_q(next_state)

if done:
    target = reward
else:
    target = reward + gamma * np.max(next_q)

q_values[action] += alpha * (target - q_values[action])

This is the exact implementation of the Q-learning formula.

Training Loop

The agent learns through self-play:

for episode in range(episodes):

    env.reset()
    done = False

    while not done:

        state = env.encode_state()
        moves = env.available_moves()

        action = agent.choose_action(state, moves, epsilon)

        reward, done = env.step(action)

        next_state = env.encode_state()

        # Update Q-table

Reward Design

Reward shaping is critical:

Game State	Value
`Win`	+1
`Loss`	-1
`Invalid Move`	-0.2
`Step`	0

This guides the agent toward:

Winning
Avoiding illegal moves
Playing efficiently

Summary

RL learns through trial and error
Q-learning builds a lookup table of experience
No dataset is required — learning happens via self-play
Even simple algorithms can produce strong game-playing agents

What's Next?

This project uses Tabular Q-Learning, which works because the state space is small.

To go further:

Deep Q Learning (Neural Networks)
Experience Replay
Policy Gradient Methods
Multi-agent training systems

This project helped me understand that AI is not magic or just abstract libraries — it's just math, iteration, and a lot of trial and error.

DEV Community: Harsh Agnihotri