DEV Community

Cover image for BigQuery Machine Learning
Cris Crawford
Cris Crawford

Posted on

BigQuery Machine Learning

I'm wrapping up Week 3 of the Data Talks Club Data Engineering Zoomcamp. This week was all about BigQuery. It turns out that you can do machine learning on BigQuery. You don't need to export the data to another location to build a model. You don't need to write in Python or Java. BigQuery machine learning runs with SQL. In this post I'll talk about building a model to predict tip amount. The goal is to become familiar with the SQL commands that make a model, evaluate the model, and use it to make predictions.

Normally, you would export the data from the data warehouse and build a model somewhere else, then deploy it. But we can build a model in BigQuery. The first 10 GB per month of storage is free, as is 1 TB of queries processed and the first 10 GB per month of the ML CREATE MODEL step.

The steps of Machine Learning are:

  1. Determine the problem you want to solve.
  2. Collect the data - we've done this already.
  3. Evaluate the data - transform, normalize, etc.
  4. Split the data into training and testing sets.
  5. Train the model on the training set. Choose an algorithm and tune the parameters.
  6. Validate the model using the testing set.
  7. Deploy the optimized ML model.

BigQuery helps with most of these steps. Depending on the use case that you want to solve, BigQuery provides a number of ML algorithms. The use cases can be prediction, recommendation, finding anomalies, grouping data into clusters, and more. We will use a linear regression model to predict the tip amount.

BigQuery provides some feature pre-processing steps automatically. For more advanced users, it provides manual pre-processing functions such as bucketizing or polynomial expansion.

Automatic pre-processing includes standardization and centering of numeric columns, one-hot encoding of category columns such as bool and string, and a mix of these and destructuring for other types of columns.

One-hot encoding converts columns of values into a sparse vector. For instance, if you have 250 pickup locations, you end up with 250 columns, where all entries are 0 except 1 for the column that corresponds to the specific pickup location.

At first I used a table from the week 2 homework, ny_taxi.yellow_taxi_data. The Location IDs and the payment type are actually category variables. We want them to be treated as such, so we cast them as STRING values so they will be one-hot coded.

Here's the SQL that I used to create the table:

CREATE OR REPLACE TABLE `ny_taxi.yellow_tripdata_ml` (
`passenger_count` INTEGER,
`trip_distance` FLOAT64,
`pu_location_id` STRING,
`do_location_id` STRING,
`payment_type` STRING,
`fare_amount` FLOAT64,
`tolls_amount` FLOAT64,
`tip_amount` FLOAT64
) AS (
SELECT passenger_count, trip_distance, cast(pu_location_id AS STRING), CAST(do_location_id AS STRING),
CAST(payment_type AS STRING), fare_amount, tolls_amount, tip_amount
FROM `ny_taxi.yellow_taxi_data` WHERE fare_amount != 0
);
Enter fullscreen mode Exit fullscreen mode

In the next step, I created a model.

CREATE OR REPLACE MODEL `ny_taxi.tip_model`
OPTIONS
(model_type='linear_reg',
input_label_cols=['tip_amount'],
DATA_SPLIT_METHOD='AUTO_SPLIT') AS
SELECT *
FROM `ny_taxi.yellow_tripdata_ml`
WHERE tip_amount IS NOT NULL;
Enter fullscreen mode Exit fullscreen mode

Here, the "label" is what we would like to predict, which is the tip_amount. We are using a linear regression as the model type.

This model had the following statistics (under the EVALUATION tab of the model):

Mean absolute error
    0.8815
Mean squared error
    2.9538
Mean squared log error
    0.2968
Median absolute error
    0.4905
R squared
    0.4647
Enter fullscreen mode Exit fullscreen mode

Training data loss was 3.9813 and evaluation data loss was 2.9538. R squared was 0.4647. This value can be between 0 and 1, with 0 being no correlation whatsoever, and 1 meaning the variables are a perfect fit.

In the context of machine learning, "loss" refers to a measure of how well a machine learning model's predictions match the actual true values of the data it's trained on. It quantifies the difference between the predicted output of the model and the actual ground truth labels or values.

The loss function is a crucial component of the training process in machine learning algorithms. It serves as a guide for the optimization algorithm to adjust the model's parameters (weights and biases) during the training process to minimize the error or loss between the predicted outputs and the actual targets.

The choice of loss function depends on the type of machine learning task being performed, such as classification, regression, or other types of learning problems.

For regression tasks, the loss function Mean Squared Error (MSE) computes the average of the squared differences between the predicted values and the actual target values.

I ran SELECT * FROM ML.FEATURE_INFO(MODEL 'ny_taxi.tip_model'); to get some idea of what BigQuery did with the features. I could see that the floating point columns were normalized, because they didn't make sense anymore. For example, the tolls_amount minimum was -31.12 and the maximum was 811.75.

Finally I ran a predict query:

SELECT *
FROM ML.PREDICT(MODEL `ny_taxi.tip_model`,
   (
   SELECT *
   FROM `ny_taxi.yellow_tripdata_ml`
   WHERE tip_amount IS NOT NULL
   )
);
Enter fullscreen mode Exit fullscreen mode

I assume this was based on the testing data. The predicted values for tip_amount didn't seem to correlate very well with the actual tip_amount. I think the reason is that the instructor pared down the variables used in the linear regression. At the end, he had a query with tuning parameters in it, but he didn't run it. I did, and I got worse results.

On Google cloud, they have a similar tutorial which also uses the New York taxi data, and also tries to predict the tip_amount. However for the variables, it uses all of the columns. I didn't have to change the locations or payment_type to strings, because they were already cast that way.

First I created a table from the public BigQuery data set. This was actually about 10% of the size of what I used before.

CREATE TABLE `ny_taxi.taxi_tip` AS
SELECT
  * EXCEPT(tip_amount), tip_amount AS label
FROM
  `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2018`
WHERE
  tip_amount IS NOT NULL
LIMIT 100000;
Enter fullscreen mode Exit fullscreen mode

I created a hyperparameter-tuned model from this data, using NUM_TRIALS=20 as the tuning parameter. The query for that was:

CREATE MODEL `ny_taxi.hp_taxi_tip_model`
OPTIONS
  (MODEL_TYPE='LINEAR_REG',
   NUM_TRIALS=20,
   MAX_PARALLEL_TRIALS=2) AS
SELECT *
FROM `ny_taxi.taxi_tip`;
Enter fullscreen mode Exit fullscreen mode

This model took about 18 minutes to complete. I evaluated the model and saw that the r2_scores for the 20 trials were all about 0.65, which was better. I ran ML.PREDICT:

SELECT *
FROM ML.PREDICT(MODEL `ny_taxi.hp_taxi_tip_model`,
    (
    SELECT *
    FROM `ny_taxi.taxi_tip`
    LIMIT 10
    )
);
Enter fullscreen mode Exit fullscreen mode

Which showed that the predicted values did indeed seem to look like the actual values. However I needed to compare this with a plain run that didn't have hyperparameters. So I built another model:

CREATE MODEL `ny_taxi.taxi_tip_model`
OPTIONS
  (MODEL_TYPE='LINEAR_REG') AS
SELECT *
FROM `ny_taxi.taxi_tip`;
Enter fullscreen mode Exit fullscreen mode

This model had no hyperparameters, unless BigQuery somehow retained some knowledge of what I had tried before. When I evaluated this model, it turns out it did slightly better. R squared was 0.6851.

BigQuery can tell you which columns influenced the outcome the most. This is the ML.EXPLAIN_PREDICT query:

SELECT *
FROM ML.EXPLAIN_PREDICT(MODEL `ny_taxi.hp_taxi_tip_model`,
    (
    SELECT *
    FROM `ny_taxi.taxi_tip`
    ), STRUCT(3 as top_k_features));
Enter fullscreen mode Exit fullscreen mode

For this model, the three top columns, which are given for each prediction, were pickup location, dropoff location, and dropoff datetime for the untuned model. This was consistent across all predictions. It varied quite a bit for the tuned model, including payment_type, total_amount, fare_amount, dropoff_location, and others. I don't quite understand this, but since it was part of the video, I've included the command here.

Top comments (0)