DEV Community

Cover image for How To Access And Query Your Google BigQuery Data Using Python And R
Team RudderStack for RudderStack

Posted on • Originally published at rudderstack.com

How To Access And Query Your Google BigQuery Data Using Python And R

Overview

In this post, we see how to load Google BigQuery data using Python and R, followed by querying the data to get useful insights. We leverage the Google Cloud BigQuery library for connecting BigQuery Python, and the bigrquery library is used to do the same with R.

We also look into the two steps of manipulating the BigQuery data using Python/R:

  • Connecting to Google BigQuery and accessing the data
  • Querying the data using Python/R

In this post, we assume that you have all your customer data stored in Google BigQuery.

If you are interested in learning more about how to get your data from your data sources into tools like Google BigQuery and other data warehouse solutions in real-time, you should explore the Customer Data Infrastructure tools like RudderStack.

Python

Python is one of the most widely-used general-purpose programming languages out there. It has gained a lot of attention and popularity because of its ease of use and flexibility.

Many engineers and data science teams also prefer Python because of the extensive libraries and tools available at their disposal to connect with other third-party systems to manipulate the data.

Connecting Google BigQuery With Python

To query your Google BigQuery data using Python, we need to connect the Python client to our BigQuery instance. We do so using a cloud client library for the Google BigQuery API. You can also choose to use any other third-party option to connect BigQuery with Python; the BigQuery-Python library by tylertreat is also a great option.

We use the Google Cloud BigQuery library because it is stable and officially supported by Google.

For this post, we assume that you already have a Python development environment set up. If not, we highly recommend you refer to the Python Development Environment Setup Guide.

To install the library, run the following command from your terminal:

pip install --upgrade google-cloud-bigquery

Next, we connect the client to the database. To do this, you will need to download a JSON file that contains the BigQuery service account credentials. If you don't have a service account, follow this guide to create one, and then proceed to download the JSON file to your local machine.

Now that we have everything set up, we proceed to initialize the connection. The following Python code is used to do so:

from google.cloud import bigquery
from google.oauth2 import service_account
credentials = service_account.Credentials.from_service_account_file(
'path/to/file.json')

project_id = 'my-bq'
client = bigquery.Client(credentials= credentials,project=project_id)
Enter fullscreen mode Exit fullscreen mode

In the snippet above, you will need to specify the project_id and the location of your JSON key file by replacing the 'path/to/file.json' with the actual path to the locally stored JSON file.

In Google BigQuery, the project is a top-level container and provides default access control across all the datasets.

Executing Queries On BigQuery Data With Python

Now that we have the BigQuery client set up and ready to use, we can execute queries on the BigQuery dataset.

For this, we use the query method, which inserts a query job into the BigQuery queue. These queries are then executed asynchronously -- in the sense that we do not specify any timeout, and the client waits for the job to complete. As soon as the job is complete, the method returns a Query_Job instance containing the results.

For more details on how this method works, please refer to the official documentation here.

The required Python code is as follows:

query_job = client.query("""
   SELECT *
   FROM dataset.my_table
   LIMIT 1000 """)

results = query_job.result() # Wait for the job to complete.
Enter fullscreen mode Exit fullscreen mode

Please note that the above query uses the standard SQL syntax as a default. If you wish you to use legacy SQL, use the code below:

job_config.use_legacy_sql = True
query_job = client.query("""
   SELECT *
   FROM dataset.my_table
   LIMIT 1000""", job_config = job_config)

results = query_job.result() # Wait for the job to complete.
Enter fullscreen mode Exit fullscreen mode

R

R is a popular alternative to Python used by many data scientists and engineers. If detailed, methodical statistical data analysis is your goal, then very few languages are as good as R.

When it comes to working with Google BigQuery, R too offers a robust and easy-to-use library for data manipulation and querying. For this post, we use the library bigrquery created and maintained by Hadley Wickham, a Chief Scientist at RStudio.

For this section, we assume that you have set up the R development environment already. If not, you can always follow this guide to set up RStudio.

Connecting To Google BigQuery With R

To install bigrquery, we run the following command from within R console:

install.packages("bigrquery")

And that's it! We are good to go.

As with Python, we will need to authorize our R client to access Google Cloud Services. As per the documentation for bigrquery, we will follow the prompt within the R console to open the authorization URL and copy the code into the console.

Note that this authorization needs to be done only once. The subsequent requests will automatically refresh the access credentials.

Executing Queries On BigQuery Data With R

To execute queries on the BigQuery data with R, we will follow these steps:

  • Specify the project ID from the Google Cloud Console, as we did with Python.
  • Form your query string to query the data.
  • Call query_exec with your project ID and query string.

The code to implement this is as below:

#import library
library(bigrquery)

#Your project ID here
project_id <- "your-project-id" # Your project ID goes here

#Sample query
sql_string <- "SELECT * FROM dataset.my_table LIMIT 1000"

#Execute the query and storing the result
query_results <- query_exec(sql_string, project = project_id, useLegacySql = FALSE)

Enter fullscreen mode Exit fullscreen mode

As with Python, if you wish you can execute queries using legacy SQL, you can change useLegacySql to TRUE in your query_exec function.

Conclusion

In this post, we saw how easy and straightforward it is to access and manipulate the data stored in Google BigQuery using Python and R.

These two languages make it quite easy to build a statistical model on top of this data, which can be used for various purposes -- understanding customers' in-app behavior, predicting the churn rate, etc. are just some of the use-cases.

There is a significant advantage to using a database to store your data compared to using other mediums such as CSV files. Apart from the flexibility to store large volumes of data with varying data types, you can leverage SQL's power to generate complex queries that give you meaningful insights.

While this post focuses on Google BigQuery, using any other database tool with R and Python is equally easy. The only difference will be the choice of Python/R library used to connect to the database.

Sign Up For Free And Start Sending Data

Test out our event stream, ELT, and reverse-ETL pipelines. Use our HTTP source to send data in less than 5 minutes, or install one of our 12 SDKs in your website or app. Get started

Top comments (0)