Evaluating Customers habits and Predicting their Purchases with a Classification Model with BigQuery ML

Hi Folks,

I decided to release a document on using a low code ML tool for Predictive Analytics entirely on GCP.

I'll be loading datasets into BigQuery, from Google Analytics Sample E-commerce dataset that has millions of Google Analytics records for the Google Merchandise Store.
Alongside,

Query and explore the ecommerce dataset
Create a training and evaluation dataset to be used for batch prediction
Create a classification (logistic regression) model in BigQuery ML
Evaluate the performance of your machine learning model
Predict and rank the probability that a visitor will make a purchase

Right on my datasets inside BigQuery, i run an SQL command to find the total visitors who visited our website, and what % made a purchase.

standardSQL

WITH visitors AS(
SELECT
COUNT(DISTINCT fullVisitorId) AS total_visitors
FROM data-to-insights.ecommerce.web_analytics
),
purchasers AS(
SELECT
COUNT(DISTINCT fullVisitorId) AS total_purchasers
FROM data-to-insights.ecommerce.web_analytics
WHERE totals.transactions IS NOT NULL
)
SELECT
total_visitors,
total_purchasers,
total_purchasers / total_visitors AS conversion_rate
FROM visitors, purchasers

But what are the top 5 selling products?

SELECT
p.v2ProductName,
p.v2ProductCategory,
SUM(p.productQuantity) AS units_sold,
ROUND(SUM(p.localProductRevenue/1000000),2) AS revenue
FROM data-to-insights.ecommerce.web_analytics,
UNNEST(hits) AS h,
UNNEST(h.product) AS p
GROUP BY 1, 2
ORDER BY revenue DESC
LIMIT 5;

Select features and create your training dataset:
SELECT

EXCEPT(fullVisitorId) FROM # features (SELECT fullVisitorId, IFNULL(totals.bounces, 0) AS bounces, IFNULL(totals.timeOnSite, 0) AS time_on_site FROM data-to-insights.ecommerce.web_analytics WHERE totals.newVisits = 1) JOIN (SELECT fullvisitorid, IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit FROM data-to-insights.ecommerce.web_analytics GROUP BY fullvisitorid) USING (fullVisitorId) ORDER BY time_on_site DESC LIMIT 10;

Creating a BigQuery dataset to store models:
I'll be creating a new dataset to store my models under my project name, then select a BigQuery ML model type and specify options.
CREATE OR REPLACE MODEL ecommerce.classification_model
OPTIONS
(
model_type='logistic_reg',
labels = ['will_buy_on_return_visit']
)
AS

standardSQL

SELECT

EXCEPT(fullVisitorId) FROM # features (SELECT fullVisitorId, IFNULL(totals.bounces, 0) AS bounces, IFNULL(totals.timeOnSite, 0) AS time_on_site FROM data-to-insights.ecommerce.web_analytics WHERE totals.newVisits = 1 AND date BETWEEN '20160801' AND '20170430') # train on first 9 months JOIN (SELECT fullvisitorid, IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit FROM data-to-insights.ecommerce.web_analytics GROUP BY fullvisitorid) USING (fullVisitorId) ;

Next thing is to evaluate classification model performance
For classification problems in ML, i want to minimize the False Positive Rate (predict that the user will return and purchase and they don't) and maximize the True Positive Rate (predict that the user will return and purchase and they do).

Improving model performance with feature engineering: How far the visitor got in the checkout process on their first visit Where the visitor came from (traffic source: organic search, referring site etc.) Device category (mobile, tablet, desktop) Geographic information (country)

Creating a second model:
CREATE OR REPLACE MODEL ecommerce.classification_model_2
OPTIONS
(model_type='logistic_reg', labels = ['will_buy_on_return_visit']) AS
WITH all_visitor_stats AS (
SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM data-to-insights.ecommerce.web_analytics
GROUP BY fullvisitorid
)

add in new features

SELECT * EXCEPT(unique_session_id) FROM (
SELECT
CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,
# labels
will_buy_on_return_visit,
MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
# behavior on the site
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site,
totals.pageviews,
# where the visitor came from
trafficSource.source,
trafficSource.medium,
channelGrouping,
# mobile or desktop
device.deviceCategory,
# geographic
IFNULL(geoNetwork.country, "") AS country
FROM data-to-insights.ecommerce.web_analytics,
UNNEST(hits) AS h
JOIN all_visitor_stats USING(fullvisitorid)
WHERE 1=1
# only predict for new visits
AND totals.newVisits = 1
AND date BETWEEN '20160801' AND '20170430' # train 9 months
GROUP BY
unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
);

To check the ROC-AUC

standardSQL

SELECT
roc_auc,
CASE
WHEN roc_auc > .9 THEN 'good'
WHEN roc_auc > .8 THEN 'fair'
WHEN roc_auc > .7 THEN 'not great'
ELSE 'poor' END AS model_quality
FROM
ML.EVALUATE(MODEL ecommerce.classification_model_2, (
WITH all_visitor_stats AS (
SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM data-to-insights.ecommerce.web_analytics
GROUP BY fullvisitorid
)

add in new features

SELECT * EXCEPT(unique_session_id) FROM (
SELECT
CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,
# labels
will_buy_on_return_visit,
MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
# behavior on the site
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site,
totals.pageviews,
# where the visitor came from
trafficSource.source,
trafficSource.medium,
channelGrouping,
# mobile or desktop
device.deviceCategory,
# geographic
IFNULL(geoNetwork.country, "") AS country
FROM data-to-insights.ecommerce.web_analytics,
UNNEST(hits) AS h
JOIN all_visitor_stats USING(fullvisitorid)
WHERE 1=1
# only predict for new visits
AND totals.newVisits = 1
AND date BETWEEN '20170501' AND '20170630' # eval 2 months
GROUP BY
unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
)
));