Natnicha24

Posted on Apr 10, 2023

Artificial Intelligence (AI) K-Means Clustering

#clustering #kmean #ai

Clustering เป็นการจัดกลุ่มของปัญญาประดิษฐ์ โดยจะทำการนำข้อมูลที่มีความคล้ายกันจัดอยู่ในกลุ่มเดียวกัน การทำ Clustering เป็นการจัดกลุ่มแบบ "ไม่มีต้นแบบผลลัพธ์"

การทำ Clustering มีประโยชน์อย่างมาก เช่น การเก็บข้อมูลเพื่อทำ image processing , การค้นหาข้อมูลความรู้ในแหล่งข้อมูลจำนวนมาก การ Clustering จะทำให้ข้อมูลเป็นหมวดหมู่และเป็นระเบียบ

ในการทำ Clustering มี Algorithms ให้เลือกหลากหลายอย่าง เช่น K-Means Clustering , Density-Based Clustering , Mean-Shift Clustering , Hierarchical Clustering แต่ในบทความนี้เราจะมาดูตัวอย่างของ Algorithms ของ K-Means Clustering กัน

หลักการของการทำ K-Means Clustering นั้นคือการกำหนดจำนวนกลุ่มและหาจุด centroid ของแต่ละกลุ่มออกมา หลังจากนั้นเราจะเอาข้อมูลมาวางลงบนกราฟ หากข้อมูลใดใกล้กับจุด centroid ใด ก็จะถือว่าเป็นข้อมูลของกลุ่มนั้น แต่ยังไม่จบเพียงแค่นั้นหลังจากที่มีการจัดข้อมูลแล้ว ก็จะมีการย้ายจุด centroid และทำการจัดกลุ่มใหม่ การย้ายจุด centroid จะทำให้เกิดการย้ายกลุ่มของข้อมูล ซึ่งกระบวนการเหล่านี้จะเกิดซ้ำไปเรื่อยๆจนกว่าข้อมูลทั้งหมดนั้นจะไม่เกิดการย้ายกลุ่ม

หลังจากทำความเข้าใจในเรื่องของหลักการกันไปแล้วเราลองมาลงมือทำ K-Means Clustering กันดีกว่า

สำหรับข้อมูลที่จะนำมาใช้ในครั้งนี้ เป็นข้อมูลการคาดเดา final grade ของนักเรียนที่ศึกษาอยู่ในโรงเรียนมัธยมในประเทศโปรตุเกส ซึ่งเป็นข้อมูลที่มาจาก UCL Machine Learning Repository และทุกคนสามารถดาวน์โหลดข้อมูลนี้มาใช้ได้ทางลิงค์ที่แนบไว้ด้านล่าง

->DATASET LINK

โดยข้อมูลที่นำมาใช้งานจะประกอบไปด้วย
1.จำนวนแถว 396 แถว หมายถึง จำนวนนักเรียน 396 คน
2.จำนวนคอลัมน์ 33 คอลัมน์ โดยมีคำอธิบายแต่ละคอลัมน์ดังนี้

school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
sex - student's sex (binary: 'F' - female or 'M' - male)
age - student's age (numeric: from 15 to 22)
address - student's home address type (binary: 'U' - urban or 'R' - rural)
famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)
Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)
Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
guardian - student's guardian (nominal: 'mother', 'father' or 'other')
traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
failures - number of past class failures (numeric: n if 1<=n<3, else 4)
schoolsup - extra educational support (binary: yes or no)
famsup - family educational support (binary: yes or no)
paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
activities - extra-curricular activities (binary: yes or no)
nursery - attended nursery school (binary: yes or no)
higher - wants to take higher education (binary: yes or no)
internet - Internet access at home (binary: yes or no)
romantic - with a romantic relationship (binary: yes or no)
famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
freetime - free time after school (numeric: from 1 - very low to 5 - very high)
goout - going out with friends (numeric: from 1 - very low to 5 - very high)
Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
health - current health status (numeric: from 1 - very bad to 5 - very good)
absences - number of school absences (numeric: from 0 to 93)
G1 - first period grade (numeric: from 0 to 20)
G2 - second period grade (numeric: from 0 to 20)
G3 - final grade (numeric: from 0 to 20, output target)

ถ้าทุกคนเข้าใจในข้อมูลแล้วเราจะเริ่มการทำ Clustering ให้ข้อมูลนี้กัน :)

ขั้นที่1 : ดาวน์โหลดข้อมูล
-ก่อนที่ทุกคนจะเริ่มทำการ Clustering ได้จะต้องโหลดข้อมูลมาก่อน ซึ่งสามารถเขียนโค้ดได้ตามนี้เลย หรือทุกคนจะเปลี่ยน url เป็นลิงค์ที่อยู่ข้อมูลใน github ของทุกคนเองก็ได้

import pandas as pd
url = 'https://raw.githubusercontent.com/Natnicha24/grade-ai/master/student-mat.csv'
studentData = pd.read_csv(url, sep = ',')
studentData.head()

-ซึ่งเมื่อทุกคนกดรันแล้วจะได้ผลเหมือนตัวอย่างดังนี้

ขั้นที่2 : การจัดการข้อมูล
-เนื่องจากที่อธิบายไปข้างต้นว่าข้อมูลที่เรานำมาใช้กันในครั้งนี้มีจำนวนคอลัมน์ค่อนข้างมาก เราสามารถเลือกเฉพาะข้อมูลที่เกี่ยวข้องเพื่อให้ดูข้อมูลได้ง่ายขึ้น
-ตามตัวอย่างโค้ดด้านล่างคือการเลือกคอลัมน์ทั้งหมด 8 คอลัมน์ ได้แก่ school,sex,age,failures,absences,G1,G2,G3

studentData =studentData[['school' , 'sex' , 'age' , 'failures' , 'absences' , 'G1' , 'G2' , 'G3',
      ]]

studentData.head()

-เมื่อรันแล้วจะได้ตัวอย่างดังภาพ

ขั้นที่3 : หาค่า k
-ขั้นตอนนี้ถือว่าเราจะเริ่มการทำ k-means clustering กันแล้ว ซึ่งการทำ k-means clustering นั้นจำเป็นจะต้องรู้จำนวนของกลุ่ม(k) โดยเราสามารถกำหนดจำนวนกลุ่มด้วยตัวเองแบบสุ่มก็ได้หรือจะทำการหาจำนวนกลุ่มที่สมควรจะแบ่งโดยเขียนตามโค้ดได้ ดังต่อไปนี้

เลือกข้อมูลทั้งหมด2ตัวเป็นข้อมูลที่เราจะใช้เป็น แกน x และ y ในที่นี้เราจะลองเลือกเป็นคอลัมน์ absences(การขาดเรียน) และ G3(คะแนนเกรดปลายภาค)

studentData.columns
data = studentData[['absences','G3']]

ต่อมาเรายังคงอยู่ในขั้นตอนของการหาจำนวนกลุ่ม(k) โดยต่อไปเราจะทำการกำหนด range ของข้อมูลที่เราต้องการทำ k-means clustering โดยในบทความนี้เราจะลองกำหนด range เป็น 1-20 และนำไปหาค่า sum of square distances ซึ่งสามารถเขียนได้ตามโค้ดด้านล่าง

from sklearn.cluster import KMeans
w = []
K = range(1,20)

for k in K: #range(1,20):
    cls = KMeans(n_clusters=k,n_init='auto')
    cls.fit(data)
    w.append(cls.inertia_)

ซึ่งเราสามารถดูค่า sum of square distances โดยเขียนโค้ดตามด้านล่างนี้

cls.inertia_

เมื่อรันแล้วจะได้ค่า sum of square distances ออกมาดังภาพ

ต่อมา เรามาถึงขั้นตอนสุดท้ายในการหาจำนวนกลุ่ม(k)กันแล้ว นั่นคือการนำค่าที่เราได้นั้นมาพล๊อตเป็นกราฟ ทุกคนอาจจะเกิดคำถามว่าค่า k จะอยู่ส่วนไหนของกราฟกันล่ะ?? คำตอบก็คือ จุดที่มีการหักของกราฟซึ่งแสดงให้เห็นถึงการลดลงของข้อมูลอย่างรวดเร็วนั่นเอง
โดยทุกคนสามารถพล๊อตกราฟโดยเขียนโค้ดดังนี้

import matplotlib.pyplot as plt
plt.plot(K,w)
plt.ylabel('inertia')
plt.xlabel('number of cluster')

จะได้กราฟดังนี้

ตามรูปเราสามารถประมาณค่าได้ว่าจุดที่แกน x เท่ากับ 4 เป็นจุดที่แสดงถึงการลดลงของข้อมูลอย่างรวดเร็ว

ขั้นที่4 : ทำการ Clustering
-เรามาถึงขั้นตอนของการทำ Clustering กันแล้วโดยในขั้นตอนนี้ผลสุดท้ายเราจะได้กราฟที่จัดกลุ่มข้อมูลของเราให้เรียบร้อย

อย่างแรกที่เราต้องทำคือการนำค่า k หรือจำนวนกลุ่มที่เราได้ในขั้นตอนที่ผ่านมา มาใช้งาน โดยการเขียนโค้ดดังนี้

cls = KMeans(n_clusters=4,n_init='auto')
cls.fit(data)

เมื่อรันจะได้ดังนี้

ต่อมาเป็นอีกหนึ่งขั้นตอนสำคัญคือการหาจุด centroid หรือจุดที่เป็นค่าเฉลี่ยของแต่ละกลุ่ม ซึ่งทุกคนอาจจะพอเข้าใจแล้วว่าทำไมสิ่งที่เรากำลังทำอยู่นี้ถึงเรียกว่า K-means clustering นั่นเพราะเรามีการนำค่าเฉลี่ยมาใช้นั่นเอง โดยทุกคนสามารถหาจุด centroid ได้ตามโค้ดด้านล่างนี้เลย

centroid = pd.DataFrame(cls.cluster_centers_,
                        columns=data.columns)
centroid

โดยจะได้ค่าดังนี้เมื่อรันออกมา

ขั้นตอนถัดไปเป็นขั้นตอนที่สามารถอธิบายวิธีการทำงานของการจัดกลุ่มโดย k-means clustering ได้ดีมาก ในขั้นตอนนี้ที่เราเขียนโค้ดขึ้นมา จะเป็นการนำข้อมูลมาวางแล้วหาว่าใกล้กับจุด centroid ใดมากที่สุด หากใกล้จุด centroid ใดมากที่สุดก็จะเป็นข้อมูลของกลุ่มนั้นนั่นเอง
โดยเขียนโค้ดดังนี้

import numpy as np
x = data
k = [data,centroid]
x = pd.concat(k)
x['cluster'] = np.concatenate((cls.predict(data),[20,20,20,20]))
x.head()

จะได้ผลดังนี้

ขั้นตอนสุดท้าย เราจะมาพล๊อตกราฟเพื่อดูการจัดกลุ่มข้อมูลกัน โดยเขียนโค้ดตามนี้

import seaborn as sns

s = sns.pairplot(x_vars='absences',y_vars='G3',hue='cluster',plot_kws={"s": 80},
                 data=x,palette='deep' )
new_labels = ['0','1','2','3','centroid']
for t, l in zip(s._legend.texts, new_labels): t.set_text(l)
s._legend.set_bbox_to_anchor((1.2, 0.5))

และจะได้ผลตามนี้

และภาพนี้ก็เป็นผลสรุปของการทำ K-Means Clustering ที่พวกเราทดลองทำกัน โดยแบ่งข้อมูลเป็น 4 กลุ่มด้วยกัน โดยจุดสีม่วงจะเป็นจุด centroid แล้วข้อมูลจะถูกแบ่งเป็นสีต่างๆ
หากทุกคนอยากเห็นการแบ่งกลุ่มที่มากกว่า4กลุ่มก็สามารถเปลี่ยนจำนวนของกลุ่มได้ก่อนการทำ clustering ทุกคนก็จะเห็นการแบ่งกลุ่มแบบใหม่ที่เกิดขึ้นกับข้อมูล

ขอบคุณทุกคนที่อ่านบทความนี้จนจบนะคะ หวังว่าจะสามารถเป็นประโยชน์ต่อใครได้บ้าง :)
หากผิดพลาดประการใดขออภัยมา ณ ที่นี้ด้วยค่ะ

References
ขอบคุณ DATASET จาก:
https://www.kaggle.com/datasets/dipam7/student-grade-prediction?resource=download
โค้ดตัวอย่างก่อนนำมาดัดแปลง :
https://colab.research.google.com/drive/12MEZXoT5nJrEjew3t6ULcdHGgz0fn_lb