Project Description
This is a my project for Data Engineering course during college to do some ETL processing on pokemon dataset which can be accessed on this link. On this project, I created some ETL processes to fix certain things in the dataset, like names, numbering, fixing missing value etc.
Dataset Description
This dataset is a dataset that discusses the data contained in the Pokemon video game. The data comes from Pokemon starting from Gen I (Pokemon Red, Green, Blue) to Gen 7 (Pokemon Sun & Moon, Lets Go). In this dataset there are 4 data or tables:
Pokedex
The Pokedex table is a sort of encyclopedia of all the pokemon in the pokemon universe. Pokedex contains information about the statistics of pokemon in general.
Moves
The moves table is a table that contains skills or attacks that can be trained on Pokemon. Usually known as TM and HM. Each move has its own statistics, and a maximum of 4 moves can exist in one Pokemon.
Types
The types table is a table that lists the types of pokemon or movesets in the Pokemon video game. There are 18 different types of Pokemon and existing movesets.
Items
The items table contains items that can be used directly by players or held by Pokemon during battle.
Table Relation
The relationship between each table in the dataset can be seen in the image above. From the picture, it can be seen that the pokemon has or is registered in the pokedex. Then, Pokemon has a maximum of up to two types. In terms of moves, each Pokemon has a maximum of four. And lastly, every Pokemon can have at least one item.
Extract
The first step is to carry out the extraction process. In this process, I also created a logging function to store information about what processes have been running. The logging function can be seen in the following program code.
!touch logfile.txt
def log(pesan):
timestamp_format = '%Y-%h-%d-%H:%M:%S'
now = datetime.now()
timestamp = now.strftime(timestamp_format)
with open("logfile.txt","a") as f:
f.write(timestamp+','+pesan+'\n')
The next step is to download the dataset from the GitHub link that has been attached above. The dataset from that page is a dataset of type NoSQL with key and value types. For that, the data is first converted to Pandas dataframe format. While converting to Pandas, we can carry out the logging process by calling the log function that was created earlier by calling log(). The program code for downloading, the conversion function to a Pandas dataframe, and the conversion can be seen in the following program code.
#Downloading dataset
!curl -LJO https://raw.githubusercontent.com/fanzeyi/pokemon.json/master/items.json
!curl -LJO https://raw.githubusercontent.com/fanzeyi/pokemon.json/master/moves.json
!curl -LJO https://raw.githubusercontent.com/fanzeyi/pokemon.json/master/pokedex.json
!curl -LJO https://raw.githubusercontent.com/fanzeyi/pokemon.json/master/types.json
#extract function
def extract_from_json(file_to_process):
dataframe = pd.read_json(file_to_process,orient=str)
return dataframe
#calling extract function
log("Proses ETL dimulai")
log("Proses Extract : Mengekstrak 4 tabel JSON menjadi dataframe")
df_1=extract_from_json('pokedex.json')
df_2=extract_from_json('moves.json')
df_3=extract_from_json('types.json')
df_4=extract_from_json('items.json')
log("Proses esktraksi tabel JSON selesai")
The picture above is the result of calling the df_1 dataframe to check the output. It can be seen that there are four features, namely id, name, type, and basic stats. If you pay attention, the name, type, and basic stats have parentheses, and each value is separated by a colon. This means that the feature stores data in the form of keys and values. Therefore, it is necessary to carry out a transformation process to retrieve only some data from this NoSQL feature.
Transform
In the transformation stage, the transformation steps are carried out differently for each table, where the explanation is as follows:
On the pokedex table, the transformation that is done is:
- Removing Japanese, Chinese, and French names
- Flattening the type attribute
- Flatten the basic stat attribute
- Deleting nested attributes that exist in the initial dataframe
- Merging previously fixed attributes into one
In the moves table, the transformations performed are
- Removing Japanese, Chinese, and tm and max_pp names because they are not on the original reference page (references come from Bulbapedia)
- Changing the value in the category column, which is still in Chinese.
- Fill in the NaN value to 100.0, because NaN here in the game is of infinite value, which move will be successfully used without the slightest failure
- Renaming the column ename to name
- Tidy up the column according to the reference from bulbapedia.
- Fixing flying moves that were recorded as fighting.
In the type table, the transformation performed is:
- Removing Japanese and Chinese names from the type, then renamed the English column to name.
- Creating an id column starting from 1
In the items table, the transformation that is done is
- Flattening the name attribute and only taking English item names
The function for carrying out the transformation process on data can be seen in the following program code snippet.
def transform_pokedex(df):
log("Memulai transformasi tabel pokedex")
#Menghapus nama bahasa Jepang, China dan Perancis
df_1_name = pd.DataFrame.from_dict(df)
df_1_name = df_1_name['name'].apply(pd.Series)
df_1_name = df_1_name.drop(['japanese','chinese','french'],axis=1)
df_1_name = df_1_name.rename(columns={'english': 'Name'})
#Melakukan flattening kepada atribut type
df_1_type = pd.DataFrame.from_dict(df)
df_1_type = df_1_type['type'].apply(pd.Series)
df_1_type = df_1_type.rename(columns={0: 'Type 1', 1: 'Type 2'})
df_1_type = df_1_type.fillna('')
#Melakukan flattening kepada atribut base
df_1_stats= pd.DataFrame.from_dict(df)
df_1_stats = df_1_stats['base'].apply(pd.Series)
#Menghapus nested atribut yang ada pada dataframe awal
df = df.drop(['name', 'base', 'type'], axis = 1)
#Menggabungkan atribut yang sudah diperbaiki sebelumnya menjadi 1
df = df.join(df_1_name)
df = df.join(df_1_type)
df = df.join(df_1_stats)
log("Transformasi tabel pokedex berhasil")
return df
def transform_moves(df):
log("Memulai transformasi tabel move")
#Menghapus nama bahasa Jepang, China, tm dan max_pp karena tidak terdapat pada
#link referensi aslinya (bulbapedia)
df_2_fix = pd.DataFrame.from_dict(df)
df_2_fix = df_2_fix.drop(['id','jname','cname','tm','max_pp'],axis=1)
#Mengganti value pada kolom 'category' yang masih berbahasa China
df_2_fix = df_2_fix.replace('物理','Physical')
df_2_fix = df_2_fix.replace('特殊','Special')
df_2_fix = df_2_fix.replace('变化','Status')
#Mengisi nilai NaN menjadi 100.0, karena NaN disini maksud di gamenya adalah infinity
#dimana move akan berhasil tanpa kegagalan sedikitpun
df_2_fix = df_2_fix.fillna(100.0)
#Mengganti nama kolom ename menjadi name
df_2_fix = df_2_fix.rename(columns={'ename': 'Move','type': 'Type','category':'Category'
,'pp': 'PP','power':'Power','accuracy':'Accuracy'})
#Merapikan kolom sesuai dengan referensi dari bulbapedia
df_2_fix = df_2_fix[['Move', 'Type', 'Category','PP','Power','Accuracy']]
#Memperbaiki moves flying yang tercatat sebagai fighting
df = df_2_fix
df = df.reset_index()
df = df.rename(columns={"index":"id move"})
df['id move'] = df.index + 1
#df.set_index('id', inplace=True)
log("Transformasi tabel move berhasil")
return df
def transform_moves_2(df):
log("Memulai perbaikan terhadap semua moveset bertipe Flying yang sebelumnya bertipe Fighting")
flying_move = 331,176,402,339,412,447,431,64,296,18,15,541,118,607,63,364,142,506,365,16
#Memperbaiki moveset Flying yang awalnya bernilai Fighting
for x in flying_move:
df.loc[x] = df.loc[x].replace(to_replace="Fighting", value = "Flying")
log("Transformasi untuk perbaikan moveset type Flying berhasil")
return df
def transform_type(df):
log("Memulai transformasi tabel type")
#Menghapus nama Jepang dan nama China dari tipe, kemudian mengganti nama kolom 'english'
#menjadi 'name'
df_3_fix = pd.DataFrame.from_dict(df)
df_3_fix = df_3_fix.drop(['chinese','japanese'],axis=1)
df_3_fix = df_3_fix.rename(columns={'english': 'Name'})
df = df_3_fix
#Membuat kolom id yang dimulai dari 1
df = df.reset_index()
df = df.rename(columns={"index":"id"})
df['id'] = df.index + 1
#df.set_index('id', inplace=True)
df = df.rename(columns={"id":"id type"})
log("Transformasi tabel type berhasil")
return df
def transform_items(df):
log("Memulai transformasi tabel items")
#Melakukan flattening ke atribut 'name' dan hanya mengambil nama item dalam
#bahasa inggris saja
df_4_fix = pd.DataFrame.from_dict(df)
df_4_name = pd.DataFrame.from_dict(df)
df_4_name = df_4_name['name'].apply(pd.Series)
df_4_name = df_4_name.drop(['japanese','chinese'],axis=1)
df_4_fix = df.join(df_4_name)
df = df_4_fix
df = df.drop(['id','name'],axis=1)
df = df.rename(columns={'english': 'Name'})
df['id']= range(0, 0+len(df))
df=df[['id','Name']]
df = df.rename(columns={"id":"id item"})
log("Transformasi tabel items berhasil")
return df
Then, the following function is to call the transform function on the dataset.
log("Memulai transformasi tabel pokedex dengan melakukan flattening data")
pokedex_fix = transform_pokedex(df_1)
moves_fix = transform_moves(df_2)
moves_fix = transform_moves_2(moves_fix)
types_fix = transform_type(df_3)
items_fix = transform_items(df_4)
log("Transformasi semua tabel telah berhasil")
The picture above is an example of the results of the transformation in the pokedex table, where now there are features id, name, type 1, type 2, HP, attack, defense, Sp.Attack, Sp.Defense, and speed. For Pokemon with a single type, type 2 contains an empty string.
Because in this project I plan to make each Pokemon have four moves and hold items, Then a transformation process is carried out again to give a random move to the pokemon, with the condition that the move must be of the same type as the pokemon. For this reason, the function to provide random moves can be seen in the following program code snippet.
def addmove4(final_pkmn,awal,akhir):
log("Memulai transaksi 4 move kepada semua pokemon dalam pokedex")
count = 0
for i in range(awal,akhir):
moves_added = 0
while(moves_added<4):
count_same = 0
rand=random.randint(0,611)
random_name= moves_fix.iloc[rand]['Move']
random_type = moves_fix.iloc[rand]['Type']
t_random=random_type
t1 = pokedex_fix["Type 1"][i]
t2 = pokedex_fix["Type 2"][i]
if (t1==t_random) or (t2 == t_random):
if(moves_added<4):
id_pkmn = pokedex_fix.iloc[i]['id']
nama_pkmn = pokedex_fix.iloc[i]['Name']
moveset=random_name
tambah = {"id" : id_pkmn,"Move" : moveset}
count += 1
moves_added += 1
final_pkmn=final_pkmn.append(tambah,ignore_index=True)
log("Menambahkan moveset "+moveset+"ke pokemon bernama "+nama_pkmn)
else:
for x in range(moves_added):
if random_name==final_pkmn.iloc[count-x]['Move']:
count_same=0
else :
id_pkmn = pokedex_fix.iloc[i]['id']
nama_pkmn = pokedex_fix.iloc[i]['Name']
moveset=random_name
tambah = {"id" : id_pkmn,"Move" : moveset}
count += 1
moves_added += 1
final_pkmn=final_pkmn.append(tambah,ignore_index=True)
log("Menambahkan moveset "+moveset+"ke pokemon bernama "+nama_pkmn)
else :
continue
logggg ="Proses transaksi moveset dari pokemon dengan index "+str(awal)+" hingga "+str(akhir)+" berhasil"
log(logggg)
print(logggg)
return final_pkmn
log("Mendeklarasikan tabel terakhir untuk menggabungkan atau JOIN beberapa tabel")
final_pkmn = pd.DataFrame()
final_pkmn["id"]=""
final_pkmn["Move"]=""
log("Menginisiasikan proses transaksi moveset kepada semua 809 pokemon")
final_pkmn = addmove4(final_pkmn,0,809)
The dataframe results obtained after the transformation are as shown above. It can be seen that a pokemon with ID 1 has four moves, and so on. Thus, the final step is to merge all tables.
def join_final_transform(final_pkmn):
log("Memulai proses JOIN pada beberapa tabel")
log("Memulai proses LEFT JOIN pada tabel pokedex dengan item -> pokemon memiliki hold item")
pokemon_w_item = pokedex_fix
item_join = items_fix
item_join = item_join.rename(columns={"id":"Item id","Name":"Item Name"})
pokemon_w_item = pokemon_w_item.join(item_join)
log("Proses LEFT JOIN berhasil dilakukan")
log("Memulai proses FULL OUTER JOIN antara tabel pokemon yang sudah memiliki hol item dengan transaksi move")
final_pkmn = pd.merge(pokemon_w_item,final_pkmn,on='id')
final_pkmn = pd.merge(final_pkmn,moves_fix,on="Move")
final_pkmn = final_pkmn.sort_values(by=['id'])
final_pkmn = final_pkmn.reset_index()
final_pkmn = final_pkmn.drop("index", axis=1)
log("Proses FULL OUTER JOIN antara tabel pokemon dengan transaksi move berhasil dilakukan")
return final_pkmn
log("Melakukan transformasi final yakni melakukan proses join")
final_pkmn = join_final_transform(final_pkmn)
log("Semua proses transformasi berhasil dilakukan")
The final appearance after the final transformation of the dataset is as shown above. For example, the bulbasaur has four moves that have different stats and don't store items.
Load
During the load process, I load two different databases, namely SQLite and MySQL. The process of loading data into a SQLite database can be seen in the following program code snippet.
log("Membuat connection ke database sqlite dengan API sqlite3 beserta cursornya")
connection = s3.connect('preprocessed_pokemon.db')
cursor = connection.cursor()
engine = create_engine('sqlite:///home/stndb01/Documents/Data Engineering/UTS/preprocessed_pokemon.db')
log("Meload semua tabel ke dalam datamart sqlite3")
final_pkmn.to_sql(name='Pokemon_Final', con=connection, if_exists='replace', index=False)
pokedex_fix.to_sql(name='Pokedex', con=connection, if_exists='replace', index=False)
moves_fix.to_sql(name='Moves', con=connection, if_exists='replace', index=False)
types_fix.to_sql(name='Type', con=connection, if_exists='replace', index=False)
items_fix.to_sql(name='Items', con=connection, if_exists='replace', index=False)
log("Proses load tabel ke datamart sqlite3 berhasil")
Meanwhile, the MySQL database has more or less the same function. The only significant difference lies in the data storage; previously, SQLite stored it in a local database, which is a file that had a .db format. While MySQL saves on localhost, which can be accessed through phpmyadmin, The process of loading data into the MySQL database can be seen in the following program code snippet.
log("Membuat connection ke database mysql dengan API mysql beserta cursornya")
engine_mysql = create_engine('mysql+pymysql://phpmyadmin:bruh@127.0.0.1/preprocessed_pokemon')
connection_mysql = engine_mysql.raw_connection()
my_cursor = connection_mysql.cursor()
log("Meload semua tabel ke dalam datamart mysql lokal")
final_pkmn.to_sql('Pokemon_Final', con = engine_mysql, if_exists='replace', index=False)
pokedex_fix.to_sql('Pokedex', con = engine_mysql, if_exists='replace', index=False)
moves_fix.to_sql('Moves', con = engine_mysql, if_exists='replace', index=False)
types_fix.to_sql('Type', con = engine_mysql, if_exists='replace', index=False)
items_fix.to_sql('Items', con = engine_mysql, if_exists='replace', index=False)
log("Proses load tabel ke datamart mysql lokal berhasil")
SQLite and MySQL have similar query functions, so we can call the same query line for the same process. The difference is the function of Python to call queries from each database. SQLite queries can be executed with read_sql from pandas. But in MySQL, queries are run using engine_mysql and the execute function. The following are examples of queries and calls for each database.
log("Menampilkan isi dari tabel dari datamart sqlite3")
sql_read_1 = ''' SELECT * FROM Pokemon_Final'''
sql_read_2 = ''' SELECT * FROM Pokedex'''
sql_read_3 = ''' SELECT * FROM Moves'''
sql_read_4 = ''' SELECT * FROM Type'''
sql_read_5 = ''' SELECT * FROM Items'''
def read_table_sqlite3(query):
return pd.read_sql(query,connection)
SQLite
MySQL
CRUD
The data that has been stored in the SQLite and MySQL databases can be changed using the Python environment. Following are examples of insert, update, and delete operations on SQLite and MySQL databases via Python.
SQLite
Insert
#Insert secara banyak ke tabel pokedex
log("Melakukan proses insert pada tiga tabel di sqlite3 dengan masing-masing 3 data")
new_pokemon = [(810,"Kadal Bersayap","Flying","Dragon",78,89,42,54,58,85),
(811,"Naga Indosiar","Flying","Dragon",68,78,67,25,52,35),
(810,"Kuda Bertanduk","Normal","Fairy",48,42,25,34,78,44)]
cursor.executemany('INSERT INTO Pokedex VALUES (?,?,?,?,?,?,?,?,?,?)', new_pokemon)
#Insert satu-satu
cursor.execute('INSERT INTO Moves VALUES (613,"Bruh","Normal","Special",5.0,100.0,90.0)')
cursor.execute('INSERT INTO Moves VALUES (614,"Fus Ro Dah","Dragon","Special",8.0,90.0,100.0)')
cursor.execute('INSERT INTO Moves VALUES (615,"Totsugeki","Flying","Physical",40.0, 100.0, 100.0)')
new_pokemon_w_move = [(810,"Kadal Bersayap","Flying","Dragon",78,89,42,54,58,85,1,"Master Ball","Totsugeki",615,"Flying","Physical",40.0, 100.0, 100.0),
(811,"Naga Indosiar","Flying","Dragon",68,78,67,25,52,35,1,"Master Ball","Fus Ro Dah",614,"Dragon","Special",8.0,90.0,100.0),
(810,"Kuda Bertanduk","Normal","Fairy",48,42,25,34,78,44,1,"Master Ball","Bruh",613,"Normal","Special",5.0,100.0,90.0)]
cursor.executemany('INSERT INTO Pokemon_Final VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)', new_pokemon_w_move)
connection.commit()
log("Proses insert pada sqlite3 berhasil dilakukan")
Update
#Update data yang sudah ada
log("Melakukan update data pada yang sudah ada pada db sqlite3")
update_pokemon_1 = ("Mega-Bulbasaur","Bulbasaur")
update_1 = '''UPDATE Pokemon_Final set Name = ? WHERE Name LIKE ?'''
cursor.execute(update_1,update_pokemon_1)
update_pokemon_2 = ("Poison","Melmetal")
update_2 = '''UPDATE Pokemon_Final set "Type 2" = ? WHERE Name LIKE ?'''
cursor.execute(update_2,update_pokemon_2)
update_pokemon_3 = ("Mega-Charizard","Dragon","Charizard")
update_3 = '''UPDATE Pokemon_Final set Name = ?, "Type 1" = ? WHERE Name LIKE ?'''
cursor.execute(update_3,update_pokemon_3)
connection.commit()
log("Update data pada db sqlite3 berhasil dilakukan")
Delete
#Delete data yang sudah ada
log("Melakukan delete pada data yang sudah ada pada db sqlite3")
cursor.execute('DELETE FROM Pokemon_Final WHERE Name LIKE "Mega-Bulbasaur"')
cursor.execute('DELETE FROM Pokemon_Final WHERE Name LIKE "Ivysaur"')
cursor.execute('DELETE FROM Pokemon_Final WHERE Name LIKE "Venusaur"')
cursor.execute('DELETE FROM Pokemon_Final WHERE Name LIKE "Melmetal"')
cursor.execute('DELETE FROM Pokemon_Final WHERE Name = "Kadal Bersayap"')
connection.commit()
log("Delete data pada db sqlite3 berhasil dilakukan")
MySQL
Insert
#Insert data secara banyak ke mysql
log("Melakukan insert data pada db mysql")
insert_my_1 = "INSERT INTO Pokedex VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"
nilai_ins_1 = [(810,"Kadal Bersayap","Flying","Dragon",78,89,42,54,58,85),
(811,"Naga Indosiar","Flying","Dragon",68,78,67,25,52,35),
(810,"Kuda Bertanduk","Normal","Fairy",48,42,25,34,78,44)]
my_cursor.executemany(insert_my_1,nilai_ins_1)
#Insert data satu persatu
engine_mysql.execute('INSERT INTO Moves VALUES (613,"Bruh","Normal","Special",5.0,100.0,90.0)')
engine_mysql.execute('INSERT INTO Moves VALUES (614,"Fus Ro Dah","Dragon","Special",8.0,90.0,100.0)')
engine_mysql.execute('INSERT INTO Moves VALUES (615,"Totsugeki","Flying","Physical",40.0, 100.0, 100.0)')
insert_my_3 = "INSERT INTO Pokemon_Final VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"
nilai_ins_3 = [(810,"Kadal Bersayap","Flying","Dragon",78,89,42,54,58,85,1,"Master Ball","Totsugeki",615,"Flying","Physical",40.0, 100.0, 100.0),
(811,"Naga Indosiar","Flying","Dragon",68,78,67,25,52,35,1,"Master Ball","Fus Ro Dah",614,"Dragon","Special",8.0,90.0,100.0),
(810,"Kuda Bertanduk","Normal","Fairy",48,42,25,34,78,44,1,"Master Ball","Bruh",613,"Normal","Special",5.0,100.0,90.0)]
my_cursor.executemany(insert_my_3,nilai_ins_3)
connection_mysql.commit()
log("Insert data pada db mysql berhasil dilakukan")
Update
#Update data yang sudah ada
log("Melakukan update data pada yang sudah ada pada db mysql")
update_my_1 = "UPDATE Pokemon_Final set Name = %s WHERE Name LIKE %s"
nilai_upd_1 = ("Mega-Bulbasaur","Bulbasaur")
my_cursor.execute(update_my_1,nilai_upd_1)
update_my_2 = "UPDATE Pokemon_Final set `Type 2` = %s WHERE Name LIKE %s"
nilai_upd_2 = ("Poison","Melmetal")
my_cursor.execute(update_my_2,nilai_upd_2)
update_my_3 = "UPDATE Pokemon_Final set Name = %s, `Type 1` = %s WHERE Name LIKE %s"
nilai_upd_3 = ("Mega-Charizard","Dragon","Charizard")
my_cursor.execute(update_my_3,nilai_upd_3)
connection_mysql.commit()
log("Update data pada db mysql berhasil dilakukan")
Delete
#Delete data yang sudah ada
log("Melakukan delete pada data yang sudah ada pada db mysql")
engine_mysql.execute('DELETE FROM Pokemon_Final WHERE Name LIKE "Mega-Bulbasaur"')
engine_mysql.execute('DELETE FROM Pokemon_Final WHERE Name LIKE "Ivysaur"')
engine_mysql.execute('DELETE FROM Pokemon_Final WHERE Name LIKE "Venusaur"')
my_cursor.execute('DELETE FROM Pokemon_Final WHERE Name LIKE "Melmetal"')
my_cursor.execute('DELETE FROM Pokemon_Final WHERE Name = "Kadal Bersayap"')
connection_mysql.commit()
log("Delete data pada db mysql berhasil dilakukan")
Top comments (0)