DEV Community

Ahmad Faiyaz for Dataform

Posted on • Originally published at dataform.co

Building an end to end Machine Learning Pipeline in Bigquery

Google BigQuery is one of the more advanced data warehouses in the market, and has out of the box support for building and training ML models using SQL like statements without requiring any code. This is extremely powerful, however managing end to end ML pipelines in this way can be fragile and requires manual steps to updating training and prediction.

In this article we walk through building a simple end to end BigQuery ML pipeline using the open-source framework Dataform to help us manage the end to end process of data preparation, training and prediction.

Google BigQuery provides some Machine Learning algorithms such as Linear regression, Binary logistic regression etc. To find out more about the models that BigQuery supports, check out the documentation.

A typical workflow for building a machine learning model looks like:

  1. Data exploration
  2. Data pre-processing (data transformation)
  3. Model training
  4. Model evaluation on test dataset
  5. Prediction/Inference on real dataset

In this article I am going to follow the tutorial from Google Cloud documentation to create a machine learning model with Google BigQuery, please read the official documentation for understanding the technical details.

For managing our end to end pipeline, we are going to use Dataform to help us version control our queries and manage pipeline execution order. Dataform makes it easy for us to version control our BigQuery code and execute complex pipelines with just a few commands.

All of the code for this example is available in the demo repository here for you to follow along with. To run this example in your own BigQuery project you perform the following steps.

Clone the repo:

git clone https://github.com/dataform-co/bigquery-ml-pipeline.git

Install the Dataform CLI using npm or yarn:

npm i -g @dataform/cli

Set up the Dataform project:

cd bigquery-ml-pipeline && dataform install

For instructions on authenticating BigQuery so that you can run the queries, you can follow the Dataform documentation here.

Data exploration

Data exploration is usually done in jupyter notebooks or some dashboard solution: for example Looker or Google Data Studio. In this step one needs to find out which datasets are required, and which columns should be used as features for model training. Following the Google Cloud tutorial, I am going to use the public dataset named Census Adult Income (bigquery-public-data.ml_datasets.census_adult_income), which contains these columns:

Data pre processing

I need to split the dataset into three sections: training, evaluation and prediction. To do this using Dataform, I will create a new sqlx file with the code block below. This query extracts data on census respondents, including education_num, which represents the respondent's level of education, and workclass, which represents the type of work the respondent performs. This query excludes several categories that duplicate data: for example, the columns education and education_num in the census_adult_income table express the same data in different formats, so this query excludes the education column. The dataframe column uses the excluded functional_weight column to label 80% of the data source for training, and reserves the remaining data for evaluation and prediction. The query creates a table containing these columns, so that I can use them to perform training and prediction later.

census_input.sqlx

config {
  type: "table"
}

SELECT
  age,
  workclass,
  native_country,
  marital_status,
  education_num,
  occupation,
  race,
  hours_per_week,
  income_bracket,
  CASE
    WHEN MOD(functional_weight, 10) < 8 THEN 'training'
    WHEN MOD(functional_weight, 10) = 8 THEN 'evaluation'
    WHEN MOD(functional_weight, 10) = 9 THEN 'prediction'
  END AS dataframe
FROM
  `bigquery-public-data.ml_datasets.census_adult_income`

Model training

Now it's time to create a model using the training dataset. I am creating a new file for this model creation. You can find more about the model options from here.

model_train.sqlx

config {
  type: 'operations',
  hasOutput: true,
  tags: ["training"]
}

CREATE
OR REPLACE MODEL ${self()} OPTIONS (
  model_type = 'LOGISTIC_REG',
  auto_class_weights = TRUE,
  input_label_cols = ['income_bracket']
) AS
SELECT
  *
FROM
  ${ref('census_input')}
WHERE
  dataframe = 'training'

You’ll notice a few non-SQL constructs above, these are Dataform features. Our configuration block at the top tells Dataform that this is an operation (a custom SQL statement) that generates an output relation. We can use the self() function in dataform to get the fully qualified name of this relation that we should create, and we can use the ref() function to select from the preprocessed dataset that we created in the previous step.

I have also added a tag, which will be useful to run only this step with or without dependencies using the command: dataform run --tags training --include-deps or on Dataform web. Check the Dataform documentation on tags.

Model evaluation

Continuing with the standard ML workflow, I will evaluate my model with the evaluation dataset.
As I am creating a ML model pipeline, it is important not to run the prediction step if my model’s accuracy (for this model, I will consider accuracy less than .8 is bad) is not good enough. To solve this in the pipeline, I will use Dataform’s assertion feature.

Bigquery’s ML.EVALUATE function returns a row with column accuracy on which I can run the assertion.

model_evaluate.sqlx

config {
  type: 'assertion',
  tags: ["evaluate"]
}

SELECT
  *
FROM
  ML.EVALUATE (
    MODEL ${ref('model_train')},
    (
      SELECT
        *
      FROM
        ${ref('census_input')}
      WHERE
        dataframe = 'evaluation'
    )
  )
WHERE accuracy < .8

To train the model and run the accuracy test, use the following command:

dataform run --actions model_evaluate --include-deps

Run prediction

As I have created the model and evaluated, now I want to run prediction on a real datasets. Let's do it.

predict.sqlx

config {
  type: 'table',
  dependencies: ['model_evaluate'],
  tags: ["predict"]
}

SELECT
  *
FROM
  ML.PREDICT (
    MODEL ${ref('model_train')},
    (
      SELECT
        *
      FROM
        ${ref('census_input')}
      WHERE
        dataframe = 'prediction'
    )
  )

This prediction step depends on the evaluation step, so I have added it as a dependency.

So let's take a look at the dependency graph:

Now I can run the whole ML pipeline as a schedule within Dataform, written entirely in SQL and executed on BigQuery’s CPUs.

Top comments (0)