AI Driven Snake Game using Deep Q Learning

Ref: https://www.geeksforgeeks.org/ai-driven-snake-game-using-deep-q-learning/
Introduction: project นี้เกี่ยวกับ Reinforcement Learning ที่จะให้ฝึกงู(เกมงู)ให้กินอาหารที่มีอยู่ใน the environment

gif ตัวอย่างแสดงเพื่อให้เข้าใจว่าเรากำลังจะสร้างอะไร

ข้อกำหนดเบื้องต้นสำหรับ project นี้คือ:

Reinforcement Learning
Deep Learning (Dense Neural Network)
Pygame

วิธีสร้างเกมงู 2 มิติโดยใช้ pygame
link : https://www.geeksforgeeks.org/snake-game-in-python-using-pygame-module/
หลังจากสร้างเกมงูได้แล้ว ตอนนี้จะมุ่งเน้นไปที่วิธีการประยุกต์ Reinforcement Learning กับเกมงู

3 Modules ที่เราต้องสร้างใน project นี้

1. The Environment (ตัวเกม)
2. The Model (Reinforcement model สำหรับการทำนายการเคลื่อนที่)
3. The Agent (ตัวกลางระหว่าง The Environment กับ The Model )

Algorithm:

เรามีงูและอาหารที่เกิดแบบสุ่ม

คำนวณสถานะของงูโดยใช้ค่า 11 ค่า และหากเงื่อนไขใดเป็นจริงให้ ค่านั้นเป็น 1 ถ้าไม่ใช่ให้เป็น 0

 1. ตรงไปอันตราย
 2. เลี้ยวขวาอันตราย
 3. เลี้ยวซ้ายอันตราย
 4. ทิศซ้าย
 5. ทิศขวา
 6. ทิศบน
 7. ทิศล่าง
 8. อาหารอยู่ทางซ้าย
 9. อาหารอยู่ทางขวา
 10. อาหารอยู่บน
 11. อาหารอยู่ล่าง

ตัวอย่างการคำนวณสถานะ โดยค่าทั้งหมดจะคำนวณจากส่วนหัวของงู

หลังจากได้ค่าของแต่ละสถานะแล้ว agent จะส่งค่าเหล่านี้ไปยัง model เพื่อดำเนินการขั้นต่อไป
ลำดับถัดมาคือการคำนวณ Reward
- กินอาหาร : +10
- Game Over : -10
- อื่นๆ : 0
อัปเดตค่า Q (ซึ่งจะกล่าวถึงในภายหลัง) และฝึก model
หลังจากวิเคราะห์ Algorithm แล้ว ตอนนี้เราต้องสร้างไอเดียเพื่อดำเนินการ Algorithm นี้ด้วยการ coding

The Model:

Models ทำงานอย่างไร?

เมื่อเกมเริ่ม ค่า Q จะถูกสุ่ม
ระบบได้รับสถานะเป็น s
ระบบจะดำเนินการแบบสุ่มหรือใช้ neural network นั้นขึ้นอยู่กับ s ซึ่งในช่วงแรกของการฝึก ระบบมักจะเลือกการกระทำแบบสุ่มเพื่อเพิ่มการสำรวจให้ได้มากที่สุด และภายหลังระบบจะอาศัย neural network มากขึ้นเรื่อยๆ
เมื่อ AI เลือกและดำเนินการการกระทำ, the environment จะให้รางวัล จากนั้น agent จะเข้าสู่สถานะใหม่และอัพเดตค่า Q ตามสมการของ Bellman

นอกจากนี้ สำหรับแต่ละการเคลื่อนไหวมันจะเก็บสถานะแรกเริ่ม การกระทำ, สถานะที่ได้รับหลังจากดำเนินการนั้น, รางวัลที่ได้รับ, และไม่ว่าเกมจะจบลงหรือไม่ ข้อมูลนี้จะถูกสุ่มตัวอย่างในภายหลังเพื่อฝึก neural network การดำเนินการนี้เรียกว่า Replay Memory
การดำเนินการสองครั้งสุดท้ายนี้จะเกิดขึ้นซ้ำจนกว่าจะตรงตามเงื่อนไขที่กำหนด (ตัวอย่างเงื่อนไข: เกมจบลง)

หัวใจของ project นี้คือ model ที่กำลังจะฝึก เพราะความถูกต้องของการเดินของงูนั้นจะขึ้นอยู่กับคุณภาพของ model ที่สร้างขึ้น

Part-I

สร้าง class ชื่อ Linear_Qnet สำหรับการเริ่มต้น the linear neural network
ฟังก์ชัน forward ใช้เพื่อรับ input (เวกเตอร์สถานะ 11 ตัว) และส่งผ่านไปยัง Neural network และใช้ฟังก์ชันการเปิดใช้งาน relu และให้ out put กลับมา เช่น การขยับของเวกเตอร์ขนาด 1 x 3 (นี่คือฟังก์ชันการคาดคะเนที่ agent จะเรียกใช้)
ฟังก์ชัน save ใช้เพื่อบันทึก model ที่ผ่านการฝึกอบรมเพื่อใช้ในอนาคต

class Linear_QNet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.linear1 = nn.Linear(input_size, hidden_size)
        self.linear2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = F.relu(self.linear1(x))
        x = self.linear2(x)
        return x

    def save(self, file_name='model_name.pth'):
        model_folder_path = 'Path'
        file_name = os.path.join(model_folder_path, file_name)
        torch.save(self.state_dict(), file_name)

Part-II

สร้างคลาส QTrainer
- การตั้งค่า learning rate สำหรับ optimizer
- ค่า Gamma เป็น discount rate ที่ใช้ในสมการ Bellman
- สร้าง Adam optimizer เพื่ออัปเดต weight และอคติ
- criterion คือ the Mean squared loss ฟังก์ชัน
ฟังก์ชัน Train_step
- PyTorch ใช้งานได้กับ tensors เท่านั้น ดังนั้นจึงต้องแปลงอินพุตทั้งหมดเป็น tensors
- เรามี Training short memory ดังนั้นเราจะผ่านเพียงค่าเดียว สถานะ การกระทำ รางวัล การเคลื่อนไหว ดังนั้นเราต้องแปลงให้เป็นเวกเตอร์ เราจึงใช้ unsqueezed function
- รับสถานะจาก model และคำนวณค่า Q ใหม่โดยใช้สูตรด้านล่าง: Q_new = รางวัล + แกมมา * สูงสุด (ค่า Q ถัดไปที่คาดการณ์ไว้)
- คำนวณ mean squared error ระหว่างค่า Q ใหม่และค่า Q ก่อนหน้า และแพร่ error ย้อนกลับค่าที่มีส่วนเกี่ยวข้อง สำหรับการอัปเดต weight

class QTrainer:
    def __init__(self,model,lr,gamma):
        #Learning Rate for Optimizer
        self.lr = lr
        #Discount Rate
        self.gamma = gamma
        #Linear NN defined above.
        self.model = model
        #optimizer for weight and biases updation
        self.optimer = optim.Adam(model.parameters(),lr = self.lr)
        #Mean Squared error loss function
        self.criterion = nn.MSELoss()


    def train_step(self,state,action,reward,next_state,done):
        state = torch.tensor(state,dtype=torch.float)
        next_state = torch.tensor(next_state,dtype=torch.float)
        action = torch.tensor(action,dtype=torch.long)
        reward = torch.tensor(reward,dtype=torch.float)

        # if only one parameter to train , then convert to tuple of shape (1, x)
        if(len(state.shape) == 1):
            #(1, x)
            state = torch.unsqueeze(state,0)
            next_state = torch.unsqueeze(next_state,0)
            action = torch.unsqueeze(action,0)
            reward = torch.unsqueeze(reward,0)
            done = (done, )

        # 1. Predicted Q value with current state
        pred = self.model(state)
        target = pred.clone()
        for idx in range(len(done)):
            Q_new = reward[idx]
            if not done[idx]:
                Q_new = reward[idx] +
                self.gamma * torch.max(self.model(next_state[idx]))
            target[idx][torch.argmax(action).item()] = Q_new
        # 2. Q_new = reward + gamma * max(next_predicted Qvalue)
        #pred.clone()
        #preds[argmax(action)] = Q_new
        self.optimer.zero_grad()
        loss = self.criterion(target,pred)
        loss.backward() # backward propagation of loss

        self.optimer.step()

The Agent

รับสถานะปัจจุบันของงูจาก the Environment

def get_state(self, game):
    head = game.snake[0]
    point_l = Point(head.x - BLOCK_SIZE, head.y)
    point_r = Point(head.x + BLOCK_SIZE, head.y)
    point_u = Point(head.x, head.y - BLOCK_SIZE)
    point_d = Point(head.x, head.y + BLOCK_SIZE)

    dir_l = game.direction == Direction.LEFT
    dir_r = game.direction == Direction.RIGHT
    dir_u = game.direction == Direction.UP
    dir_d = game.direction == Direction.DOWN

    state = [
        # Danger Straight
        (dir_u and game.is_collision(point_u))or
        (dir_d and game.is_collision(point_d))or
        (dir_l and game.is_collision(point_l))or
        (dir_r and game.is_collision(point_r)),

        # Danger right
        (dir_u and game.is_collision(point_r))or
        (dir_d and game.is_collision(point_l))or
        (dir_u and game.is_collision(point_u))or
        (dir_d and game.is_collision(point_d)),

        # Danger Left
        (dir_u and game.is_collision(point_r))or
        (dir_d and game.is_collision(point_l))or
        (dir_r and game.is_collision(point_u))or
        (dir_l and game.is_collision(point_d)),

        # Move Direction
        dir_l,
        dir_r,
        dir_u,
        dir_d,

        # Food Location
        game.food.x < game.head.x, # food is in left
        game.food.x > game.head.x, # food is in right
        game.food.y < game.head.y, # food is up
        game.food.y > game.head.y # food is down
    ]
    return np.array(state, dtype=int)

Call model เพื่อรับสถานะต่อไปของงู

def get_action(self, state):
    # random moves: tradeoff explotation / exploitation
    self.epsilon = 80 - self.n_game
    final_move = [0, 0, 0]
    if(random.randint(0, 200) < self.epsilon):
        move = random.randint(0, 2)
        final_move[move] = 1
    else:
        state0 = torch.tensor(state, dtype=torch.float).cuda()
        prediction = self.model(state0).cuda() # prediction by model
        move = torch.argmax(prediction).item()
        final_move[move] = 1
    return final_move

เล่นขั้นตอนที่คาดการณ์โดย model ใน the environment
บันทึกสถานะปัจจุบัน การกระทำที่ทำ และ reward
Train model จากการกระทำที่ทำ และ reward ที่ได้มาจาก the Environment (Training short memory)

def train_short_memory(self, state, action, reward, next_state, done):
    self.trainer.train_step(state, action, reward, next_state, done)

หากเกมจบลงเนื่องจากการชนกำแพงหรือร่างกาย ให้ฝึก model ตามการเคลื่อนไหวทั้งหมดที่ทำจนถึงตอนนี้และรีเซ็ต the environment(Training Long memory) Training in a batch size of 1000.

def train_long_memory(self):
    if (len(self.memory) > BATCH_SIZE):
        mini_sample = random.sample(self.memory, BATCH_SIZE)
    else:
        mini_sample = self.memory
    states, actions, rewards, next_states, dones = zip(*mini_sample)
    self.trainer.train_step(states, actions, rewards, next_states, dones)

Output:

ใน Run เกมนี้ก่อนอื่นให้สร้าง environment ใน anaconda prompt หรือ (แพลตฟอร์มใดก็ได้) จากนั้นติดตั้งโมดูลที่จำเป็น เช่น Pytorch (สำหรับ DQ Learning Model), Pygame (สำหรับภาพของเกม) และโมดูลพื้นฐานอื่น ๆ
จากนั้นเรียกใช้ไฟล์ agent.py ใน environment ที่เพิ่งสร้างขึ้น จากนั้นการฝึกจะเริ่มต้นขึ้น และคุณจะเห็น GUI สองอันต่อไปนี้ อันหนึ่งสำหรับความคืบหน้าของการฝึก และอีกอันสำหรับเกมงูที่เล่นโดย AI
หลังจากได้คะแนนที่กำหนดแล้ว คุณสามารถออกจากเกมได้ และ model ที่คุณเพิ่งฝึกจะถูกเก็บไว้ใน path ที่คุณกำหนดไว้ในฟังก์ชันบันทึกของ models.py

หลังจากนี้ คุณสามารถใช้ model ที่ผ่านการฝึกอบรมนี้โดยเพียงแค่เปลี่ยนโค้ดในไฟล์ agent.py ดังที่แสดงด้านล่าง:

self.model.load_state_dict(torch.load('PATH'))

Initial Epochs VS After 100th Epochs
Initial EpochsAfter 100th Epochs

สรุปผล
เราสามารถนำ Deep Q Learning มาสร้าง AI ที่เล่นเกมงูได้เองโดยผ่านการฝึกใน environment ของตัวเกม ซึ่งในการฝึกก็มีทั้งการสุ่มและการใช้ Deep Q Learning เนื่องจากการสุ่มจะมีโอกาสที่เจอรูปแบบต่างๆได้มากกว่า ส่วน Deep Q Learning จะเลือกอันที่ optimal ที่สุดจากข้อมูลที่ถูก train มา

เย้!! 🎉
63120501033 พุฒิพงศ์ ป่าไม้ทอง
Source Code: https://github.com/vedantgoswami/SnakeGameAI