Hi Fam, on my journey of learning AI & ML, since I am too dumb to just make "AI Learns to walk" out of nowhere, I had to start somewhere basic, so this is my 1st project on Reinforcement Learning. I hope to teach Reinforcement Learning basics and the Math Algorithm that I used to achieve this
Only Liberary I used is Numpy, No Abstract implementation of Algorithm
Code: Github
Reinforcement Learning (RL) Basics
Then we start training the agent, and the agent learns by
- performing random actions
- learns from what that does to the environment (we decide a reward for whatever happened).
- Update the Q table with the outcome
- Take the next step
Overtime agent keeps on reducing the randomness and relying more on Q Table for the best course of action
We start with a table for all possible actions and give them the same value and call it ✨Q table✨
So basically, the agent will
- Takes an action
- Receives a reward
- Updates its knowledge to make better decisions next time
Goal: Maximise reward from each cycle over time
Core RL Terms
For Tic Tac Toe, we will be working with:
| Concept | Meaning |
|---|---|
| Agent | The AI |
| Environment | The game (board + rules) |
| State (s) | Current board state + move history |
| Action (a) | Placing mark in a cell |
| Reward (r) | +1 win, -1 loss, 0 otherwise |
| Policy (π) | Strategy to pick actions |
Q-Learning: a value-based Reinforcement Learning algorithm.
we start with: Q(s, a)
which signify
"How good is it to take action 'a ' in state 's'?"
or
Reward 'q' for action 'a' in state 's'
✨ The Q-Learning Formula ✨
I promise it's not that hard:
Q(s,a) = Q(s,a) + \alpha \left( r + \gamma \max Q(s',a') - Q(s,a) \right)
| Term | Meaning |
|---|---|
Q(s,a) |
Current Estimation |
α (alpha) |
Learning rate |
r |
Reward received |
γ (gamma) |
Discount factor |
max Q(s',a') |
Best future reward |
Simpler Explanation
- Q(s,a): current set reward in Q table for action 'a' at state 's'
- α: rate at which learning from action will change the Q table
- r: possible reward for possible action (obv)
- γ (gamma): how much u care for what the next action (after this one) matters
So we update Q table using:
New Value = Old Value + Learning Rate × (Target - Old Value)
Where: Target = reward + future reward
Exploration & Exploitation - Stages of learning
To learn effectively, the agent must balance:
- Exploration → Try random moves
- Exploitation → Use best known move
for this we use epsilon-greedy strategy:
if random.random() < epsilon:
action = random.choice(moves) # explore
else:
action = best_known_action # exploit
Over time:
epsilon = max(0.05, epsilon * decay)
Here, decay will help reduce randomness over time using an epsilon value so the agent becomes less random and more intelligent.
State Representation
We encode the game state as a string:
def encode_state(self):
board_state = "".join(self.board)
p = ",".join(map(str, self.player_moves))
a = ",".join(map(str, self.ai_moves))
return f"{board_state}|{p}|{a}"
This ensures:
- Board position is captured
- Sliding move history is preserved
Q-Table Implementation - Code Part
We store learned values in a dictionary:
self.q = {}
def get_q(self, state):
if state not in self.q: # initialisation of Q table
self.q[state] = np.zeros(9)
return self.q[state]
Each state maps to: 9 possible actions → 9 Q-values
Action Selection
Choosing the best action:
def choose_action(self, state, moves, epsilon):
if random.random() < epsilon:
return random.choice(moves)
q_values = self.get_q(state)
return max(moves, key=lambda m: q_values[m])
Q-Value Update (Core Learning Step)
q_values = agent.get_q(state)
next_q = agent.get_q(next_state)
if done:
target = reward
else:
target = reward + gamma * np.max(next_q)
q_values[action] += alpha * (target - q_values[action])
This is the exact implementation of the Q-learning formula.
Training Loop
The agent learns through self-play:
for episode in range(episodes):
env.reset()
done = False
while not done:
state = env.encode_state()
moves = env.available_moves()
action = agent.choose_action(state, moves, epsilon)
reward, done = env.step(action)
next_state = env.encode_state()
# Update Q-table
Reward Design
Reward shaping is critical:
| Game State | Value |
|---|---|
Win |
+1 |
Loss |
-1 |
Invalid Move |
-0.2 |
Step |
0 |
This guides the agent toward:
- Winning
- Avoiding illegal moves
- Playing efficiently
Summary
- RL learns through trial and error
- Q-learning builds a lookup table of experience
- No dataset is required — learning happens via self-play
- Even simple algorithms can produce strong game-playing agents
What's Next?
This project uses Tabular Q-Learning, which works because the state space is small.
To go further:
- Deep Q Learning (Neural Networks)
- Experience Replay
- Policy Gradient Methods
- Multi-agent training systems
This project helped me understand that AI is not magic or just abstract libraries — it's just math, iteration, and a lot of trial and error.
Top comments (0)