loading...
Cover image for Automating Data Extractions From Google Sheets With Python

Automating Data Extractions From Google Sheets With Python

seattledataguy profile image SeattleDataGuy ・7 min read

Photo by Daniele Levis Pelusi on Unsplash.

Much like in our recent article about automating emails with Python, today we want to discuss a common task that is easy to automate.

We'll outline how you can write a Python script that can export data from Google Sheets and insert it into MySQL.

When would you use this?

One way this could be useful is when you create some sort of form with Google Forms and you want to load new data automatically every night. This works best with Google Sheets connected to Google Forms because, in theory, there shouldn't be changes to the schema of the Google Sheet.

This is one important point. If the Google Sheet you want to extract data from is constantly changing, then you will need to develop a much more complex system.

As long as your data schema doesn't change too much, then automating this task is very easy. This is thanks to the gspread library. This library makes interacting with the various Google APIs very simple.

So let's start automating your Google Sheet extracts.


Extracting Data From Google Sheets

To start out, we will build a function that extracts data straight from our Google Sheet.

First, we need to set up a few things:

  • We need to set up our service account credentials. In order to do so, we will need to go to Google projects.
  • From there, we can create a service account and then create a JSON file.
  • We will also need to enable access to the Google Sheets API.

With all that set, now we can start extracting.

In order to do so, we will use the function called [open_by_url](https://gspread.readthedocs.io/en/latest/user-guide.html). This will give us access to the exact spreadsheet, as shown in the code below:

import gspread
import mysql.connector
from oauth2client.service_account import ServiceAccountCredentials

# This is the mysqlcredentials.py file containing your credentials.
import mysqlcredentials as mc

# The required variables for gspread:
scope = ['https://spreadsheets.google.com/feeds', \
       'https://www.googleapis.com/auth/drive']

# The credentials created for the service account in your Google project
# is stored in a .json file after you click 'Create Key'
# I renamed this file to sheetstodb.json.
creds = ServiceAccountCredentials.from_json_keyfile_name('creds.json', scope)
client = gspread.authorize(creds)

def GetSpreadsheetData(sheetName, worksheetIndex):
   sheet = client.open_by_url(sheetName).get_worksheet(worksheetIndex)
   return sheet.get_all_values()[1:]

With our GetSpreadsheetData function, we will return the data from our Google Sheet. Essentially, we are getting back an array of arrays.

So it looks like this:

#Array of arrays
x = [
["username1","2020-01-01","A","B","D","C"],
["username2","2020-01-01","A","B","D","C"],
["username3","2020-01-01","A","B","D","C"]]

This is important to note, as we are manipulating this dataset and working to insert it into MySQL.


Inserting Into MySQL and Preserving Nulls

Now that we have a function that returns the data from our Google Sheet with Python, we can now work on inserting it into MySQL.

Before we do that, we also created a function to replace all the empty values in the Google Sheets from '' to None.

This is an important distinction that can cause major issues down the line. We won't go into the exact details, but long story short, Null does not equal ''.

In this case, we created the function preserveNullValues. This uses a nested loop to find all the ''.

Technically, we could also write this tighter using something like:\
['None' if v is None else v for v in d]

But we find that can often be confusing, so we wanted to make sure there was a clear understanding of what the function did.

In addition, we created a function that inserts to MySQL called WriteToMysqlTable.

First, we have a separate file that contains the password, host, username, and database name. Using this, we will create a connection.

This connection will be used to run the insert statement.

In this case, the script has the SQL insert hard-coded. However, we should probably either have this set up with a config, passed as a variable, or just dynamically created.

It is also important to note that once we run the insert statement, we will need to call cursor.commit() with MySQL to make sure the data is actually inserted.

With that, we are almost done. Now we just need to put all these functions together.


# Replaces any empty cells with 'NULL'
def preserveNULLValues(listName):
   print('Preserving NULL values...')
   for x in range(len(listName)):
       for y in range(len(listName[x])):
           if listName[x][y] == '':
               listName[x][y] = None
   print('NULL values preserved.')

def WriteToMySQLTable(sql_data, tableName):
   try:
# Connection credentials for MySQL.
       connection = mysql.connector.connect(
       user = mc.user,
       password = mc.password,
       host = mc.host,
       database = mc.database
       )

       sql_insert_statement = """INSERT INTO {}(
           Username,
           Date_Taken,
           Time_Started,
           Time_Finished,
           Answer_Question_1,
           Answer_Question_2,
           Answer_Question_3,
           Answer_Question_4,
           Answer_Question_5,
           Answer_Question_6 )
           VALUES ( %s,%s,%s,%s,%s,%s,%s,%s,%s,%s )""".format(tableName)

       cursor = connection.cursor()
       for i in sql_data:
           print(i)
           #print(sql_insert_statement)
           cursor.execute(sql_insert_statement, i)
# Now we execute the commit statement, and print to the console
# that the table was updated successfully
       connection.commit()
       print("Table {} successfully updated.".format(tableName))
# Errors are handled in the except block, and we will get
# the information printed to the console if there is an error
   except mysql.connector.Error as error :
       print("Error: {}. Table {} not updated!".format(error, tableName))
       connection.rollback()
       print("Error: {}. Table {} not updated!".format(error, tableName))
# We need to close the cursor and the connection,
# and this needs to be done regardless of what happened above.

Putting Together Your Python Script

Alright, now that we have all these functions, we need to put everything together.

If you look at the code below, we provide the URL for the Google Sheet, so you can dynamically use this for any Google Sheet you are looking to extract data from.


import gspread
import mysql.connector
from oauth2client.service_account import ServiceAccountCredentials

# This is the mysqlcredentials.py file containing your credentials.
import mysqlcredentials as mc

# The required variables for gspread:
scope = ['https://spreadsheets.google.com/feeds', \
       'https://www.googleapis.com/auth/drive']

# The credentials created for the service account in your Google project
# is stored in a .json file after you click 'Create Key'
# I renamed this file to sheetstodb.json.
creds = ServiceAccountCredentials.from_json_keyfile_name('creds.json', scope)
client = gspread.authorize(creds)

# Now that that's done, pull data from the Google sheet.
# 'sheetName' describes the Google sheet's name,
# 'worksheetIndex' describes the index of the worksheet at the bottom.
def GetSpreadsheetData(sheetName, worksheetIndex):
   sheet = client.open_by_url(sheetName).get_worksheet(worksheetIndex)
   return sheet.get_all_values()[1:]

# Finally, write this data to MySQL:
def WriteToMySQLTable(sql_data, tableName):
   try:
# Connection credentials for MySQL.
       connection = mysql.connector.connect(
       user = mc.user,
       password = mc.password,
       host = mc.host,
       database = mc.database
       )
       sql_drop = " DROP TABLE IF EXISTS {} ".format(tableName)
               sql_create_table = """CREATE TABLE {}(
           Username VARCHAR(255),
           Date_Taken VARCHAR(16),
           Time_Started VARCHAR(16),
           Time_Finished VARCHAR(16),
           Answer_Question_1 VARCHAR(100),
           Answer_Question_2 VARCHAR(100),
           Answer_Question_3 VARCHAR(100),
           Answer_Question_4 VARCHAR(100),
           Answer_Question_5 VARCHAR(100),
           Answer_Question_6 VARCHAR(10),
           PRIMARY KEY (Username)
           )""".format(tableName)

       sql_insert_statement = """INSERT INTO {}(
           Username,
           Date_Taken,
           Time_Started,
           Time_Finished,
           Answer_Question_1,
           Answer_Question_2,
           Answer_Question_3,
           Answer_Question_4,
           Answer_Question_5,
           Answer_Question_6 )
           VALUES ( %s,%s,%s,%s,%s,%s,%s,%s,%s,%s )""".format(tableName)
# Here we create a cursor, which we will use to execute
# the MySQL statements above. After each statement is executed,
# a message will be printed to the console if the execution was successful.
       cursor = connection.cursor()
       cursor.execute(sql_drop)
       print('Table {} has been dropped'.format(tableName))
       cursor.execute(sql_create_table)


           print('Table {} has been created'.format(tableName))
# We need to write each row of data to the table, so we use a for loop
# that will insert each row of data one at a time
       print(sql_data)
       for i in sql_data:
           print(i)
           #print(sql_insert_statement)
           cursor.execute(sql_insert_statement, i)
# Now we execute the commit statement, and print to the console
# that the table was updated successfully
       connection.commit()
       print("Table {} successfully updated.".format(tableName))
# Errors are handled in the except block, and we will get
# the information printed to the console if there is an error
   except mysql.connector.Error as error :
       print("Error: {}. Table {} not updated!".format(error, tableName))
       connection.rollback()
       print("Error: {}. Table {} not updated!".format(error, tableName))
# We need to close the cursor and the connection,
# and this needs to be done regardless of what happened above.


# Replaces any empty cells with 'NULL'
def preserveNULLValues(listName):
   print('Preserving NULL values...')
   for x in range(len(listName)):
       for y in range(len(listName[x])):
           if listName[x][y] == '':
               listName[x][y] = None
   print('NULL values preserved.')

# Uses Google Drive's API.
# If you get an error regarding this, go to the link and enable it.
data = GetSpreadsheetData(mc.url, 0)

# Write to the table in the database.
preserveNULLValues(data)
WriteToMySQLTable(data, 'MyData')

Once the data is pulled, we can run preserveNullValues followed by WriteToMysqlTable.

That's it!

Really, a majority of the work is done with the gspread library. Once the data is extracted and in an array, it's pretty easy to move around the data.

That's what makes Python great. Much of what you need to do exists in some sort of library somewhere.


Time to Start Automating Your Google Sheet Extracts

We hope this script helps you automate your Google Sheet extracts with Python as well as look for other tasks you can automate.

In the examples above, we're manually scheduling this script. However, if you want to automate this more effectively in the future, you could use a framework like Airflow. Then instead of manually running the script, you could instead run this script on a daily basis.

Some people want to develop their own automated system. Personally, we don't recommend this. There are plenty of great frameworks that can help manage your automated tasks. It is often better to spend more time working on the process vs. building a system.

However you decide to automate, we wish you good luck and happy automating!

If you enjoyed this article, then consider reading some of these articles as well.

Advanced SQL For Data Analysts, Data Scientists And Engineers

Airbnb's Airflow Vs. Spotify's Luigi

How Algorithms Can Become Unethical and Biased

Top 10 Business Intelligence (BI) Implementation Tips​

5 Great Big Data Tools For The Future --- From Hadoop To Cassandra

Learning Data Science: Our Favorite Resources Free Or Not

Posted on by:

seattledataguy profile

SeattleDataGuy

@seattledataguy

Software Engineer | Consultant | Data Scientist

Discussion

pic
Editor guide
 
 

Great article, but without the automate itself part it's kinda not completed. Maybe this can be subject of an other article.

 

That's fair. It already was getting long from the stand-point of code. I can combine it with an airflow article in the future.