DEV Community: Nkwam Philip

PowerBI pORTf

Nkwam Philip — Thu, 04 Aug 2022 11:41:34 +0000

https://app.powerbi.com/view?r=eyJrIjoiZTA0M2FiZGMtNDYwNC00ZmJiLWIwOTMtODgyNzc5ZDEwNmM3IiwidCI6Ijc0MzBjOGJlLWQ1ZTMtNDgxYi1hNTcwLTZjOGI0MzRkZGY4OCIsImMiOjZ9

https://app.powerbi.com/view?r=eyJrIjoiODVhYTBlZDktM2I4OC00OTM2LTg0NDYtMjlkNGU4ZWJmYzgxIiwidCI6Ijc0MzBjOGJlLWQ1ZTMtNDgxYi1hNTcwLTZjOGI0MzRkZGY4OCIsImMiOjZ9

https://app.powerbi.com/view?r=eyJrIjoiYjc5ZTlhZmMtYjQ5ZC00MWU4LWEyMzAtZWVjNTllODZlOTc5IiwidCI6Ijc0MzBjOGJlLWQ1ZTMtNDgxYi1hNTcwLTZjOGI0MzRkZGY4OCIsImMiOjZ9

https://app.powerbi.com/view?r=eyJrIjoiYWY3YmYxN2ItY2RhMi00NzQ2LWFhYTktNjlhYWY4MTcyY2E1IiwidCI6Ijc0MzBjOGJlLWQ1ZTMtNDgxYi1hNTcwLTZjOGI0MzRkZGY4OCIsImMiOjZ9

https://app.powerbi.com/view?r=eyJrIjoiMjdjYzViNTUtYTdkNy00NzZlLWJlZWItODI5ODk1MWI0NzMyIiwidCI6Ijc0MzBjOGJlLWQ1ZTMtNDgxYi1hNTcwLTZjOGI0MzRkZGY4OCIsImMiOjZ9

https://app.powerbi.com/view?r=eyJrIjoiMjkyYjM1MDktNjgyMS00Mzc3LWJmZTEtOTIxM2RjMGEwMWVmIiwidCI6Ijc0MzBjOGJlLWQ1ZTMtNDgxYi1hNTcwLTZjOGI0MzRkZGY4OCIsImMiOjZ9

Explaining Precision and Recall

Nkwam Philip — Tue, 02 Aug 2022 18:40:19 +0000

In pattern recognition, information retrieval, object detection and classification (machine learning), precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space.

Predicted Positive Values - Values predicted to be positive
Real Positive Values - Values that are actually positive

Consider a computer program for recognising dogs (the relevant element) in a digital photograph. Upon processing a picture which contains ten cats and twelve dogs, the program identifies eight dogs. Of the eight elements identified as dogs, only five actually are dogs (true positives), while the other three are cats (false positives). Seven dogs were missed (false negatives), and seven cats were correctly excluded (true negatives). The program's precision is then 5/8 (true positives / Predicted Positive values) while its recall is 5/12 (true positives / Real Positive values).

Precision and recall can also be referred to as positive predicted value(PPV) and sensitivity(TPR). All of these would be properly explained.

Precision can be referred to as the number of real positive value predicted divided by the total "positively" predicted outcome. It explains how many true positive values out of the total values are predicted to be true.

TP/(TP + FP)

Precision explains how well your model has performed in terms of correctly predicting true value outcomes out of the seemingly true outcomes. The model would happened to have predicted some values to be true when it's clearly not.

That explains why it is called POSITIVE PREDICTED VALUE

Recall on the other hand explains how many True Positive values where predicted out of the total True Positive Values. The True positive outcome divided by the total True Positive Values.

TP/(TP + FN)

Recall further reveals how many True positive values were predicted to be false and it substantially measures how well a model has predicted true positive values from a range of Real Positive values.

Recall is therefore referred to as SENSITIVITY(True Positive Rate)

Building an IoT Analytics Pipeline on Google Cloud

Nkwam Philip — Thu, 28 Jul 2022 06:14:04 +0000

Internet of Things (IoT) refers to the interconnection of physical devices with the global Internet. These devices are equipped with sensors and networking hardware, and each is globally identifiable.
Cloud IOT Core is a fully managed service that allows you to easily and securely connect, manage, and ingest data from millions of globally dispersed devices. The service connects IoT devices that use the standard Message Queue Telemetry Transport (MQTT) protocol to other Google Cloud data services.

Cloud IoT Core has two main components:

A device manager for registering devices with the service, so you can then monitor and configure them.
A protocol bridge that supports MQTT, which devices can use to connect to Google Cloud.

OBJECTIVES

Connect and manage MQTT-based devices using Cloud IoT Core (using simulated devices)
Ingest a stream of information from Cloud IoT Core using Cloud Pub/Sub.
Process the IoT data using Cloud Dataflow.
Analyze the IoT data using BigQuery.

After signing into GCP, In the Cloud Console, click Navigation menu > APIs & Services.
Scroll down in the list of enabled APIs, and confirm that these APIs are enabled:

Google Cloud IoT API
Cloud Pub/Sub API
Dataflow API

If one or more API is not enabled, click the ENABLE APIS AND SERVICES button at the top. Search for the APIs by name and enable each API for your current project.

Ensure that the Dataflow API is successfully enabled
To ensure access to the necessary API, restart the connection to the Dataflow API.

In the Cloud Console, enter Dataflow API in the top search bar. Click on the result for Dataflow API.

Click Manage.

Click Disable API.

If asked to confirm, click Disable.

Click Enable.
When the API has been enabled again, the page will show the option to disable.

Create a Cloud Pub/Sub topic

Cloud Pub/Sub is an asynchronous global messaging service. By decoupling senders and receivers, it allows for secure and highly available communication between independently written applications. Cloud Pub/Sub delivers low-latency, durable messaging.

In Cloud Pub/Sub, publisher applications and subscriber applications connect with one another through the use of a shared string called a topic. A publisher application creates and sends messages to a topic. Subscriber applications create a subscription to a topic to receive messages from it.

In an IoT solution built with Cloud IoT Core, device telemetry data is forwarded to a Cloud Pub/Sub topic.

To define a new Cloud Pub/Sub topic:

In the Cloud Console, in the top search bar enter Pub/Sub then you should be on the Topics page.

Click + CREATE TOPIC.

The Create a topic dialog shows you a partial URL path.
Note: If you see qwiklabs-resources as your project name, cancel the dialog and return to the Cloud Console. Use the menu to the right of the Google Cloud logo to select the correct project. Then return to this step.

Add this string as your Topic ID:

iotlab

Click CREATE TOPIC.

*Click on the Topics page to see the new topic whose partial URL ends in iotlab. Click the three-dot icon at the right edge of its row to open the context menu. *
Choose View permissions.
In the Permissions dialogue, click ADD PRINCIPAL and copy the below principal as New principals:

cloud-iot@system.gserviceaccount.com

From the Select a role menu, give the new member the Pub/Sub > Pub/Sub Publisher role.
Click Save.

Create a BigQuery dataset

BigQuery is a serverless data warehouse. Tables in BigQuery are organized into datasets. In this lab, messages published into Pub/Sub will be aggregated and stored in BigQuery.

To create a new BigQuery dataset:

In the Cloud Console, go to Navigation menu > BigQuery.
Click Done.
To create a dataset, click on the View actions icon next to your Project ID and then select Create dataset.
Name the dataset iotlabdataset, leave all the other fields the way they are, and click Create dataset.
Click on your project name to see the newly created dataset under your project
To create a table, click on the View actions icon next to the iotlabdataset dataset and select Create table.
Ensure that the source field is set to Empty table.
In the Destination section's Table name field, enter sensordata.
In the Schema section, click the + Add field button and add the following fields:

timestamp, set the field's Type to TIMESTAMP.
device, set the field's Type to STRING.
temperature, set the field's Type to FLOAT.

Leave the other defaults unmodified. Click Create Table.

Create a cloud storage bucket

Cloud Storage allows world-wide storage and retrieval of any amount of data at any time. You can use Cloud Storage for a range of scenarios including serving website content, storing data for archival and disaster recovery, or distributing large data objects to users via direct download.

For this lab Cloud Storage will provide working space for your Cloud Dataflow pipeline.

In the Cloud Console, go to Navigation menu > Cloud Storage.
Click +CREATE BUCKET.
For Name, use your Project ID then add -bucket, then click Continue.
For Location type, click Multi-region if it is not already selected.
For Location, choose the selection closest to you.
Click CREATE.

Set up a Cloud Dataflow Pipeline

Cloud Dataflow is a serverless way to carry out data analysis. In this lab, you will set up a streaming data pipeline to read sensor data from Pub/Sub, compute the maximum temperature within a time window, and write this out to BigQuery.

In the Cloud Console, in the top search bar enter "Dataflow", then click on Dataflow.
In the top menu bar, click + CREATE JOB FROM TEMPLATE.
For Job name, enter iotlabflow.
For Regional Endpoint, choose the region as us-west1.
For Dataflow template, choose Pub/Sub Topic to BigQuery. When you choose this template, the form updates to review new fields below.
For Input Pub/Sub topic, choose from the dropdown menu. The resulting string will look like this: projects/PROJECT_ID/topics/iotlab
The BigQuery output table takes the form of Project ID:dataset.table (:iotlabdataset.sensordata). The resulting string will look like this: PROJECT_ID:iotlabdataset.sensordata
For Temporary location, enter your Cloud Storage bucket name then /tmp/. The resulting string will look like this: gs://PROJECT-bucket/tmp/
Click SHOW OPTIONAL PARAMETERS.
For Max workers, enter 2.
Click RUN JOB.

A new streaming job is started. You can now see a visual representation of the data pipeline.

Prepare your compute engine VM

In your project, a pre-provisioned VM instance named iot-device-simulator will let you run instances of a Python script that emulate an MQTT-connected IoT device. Before you emulate the devices, you will also use this VM instance to populate your Cloud IoT Core device registry.

To connect to the iot-device-simulator VM instance:

In the Cloud Console, go to Navigation menu > Compute Engine > VM Instances. You'll see your VM instance listed as iot-device-simulator.
Click the SSH drop-down arrow and select Open in browser window.
In your SSH session, enter following commands to create a virtual environment.

sudo pip3 install virtualenv
virtualenv -p python3 venv
source venv/bin/activate

Initialize the gcloud SDK.

gcloud auth login --no-launch-browser

If you get the error message "Command not found," you might have forgotten to exit your previous SSH session and start a new one.

When you are asked whether to authenticate with an @developer.gserviceaccount.com account or to log in with a new account, choose log in with a new account.

When you are asked "Are you sure you want to authenticate with your personal account? Do you want to continue (Y/n)?" enter Y.

Click on the URL shown to open a new browser window that displays a verification code.
Copy the verification code and paste it in response to the "Enter verification code:" prompt, then press Enter.
Enter this command to update the system's information about Debian Linux package repositories:

sudo apt-get update

Enter this command to make sure that various required software packages are installed:

sudo apt-get install python-pip openssl git -y

Use pip to add needed Python components:

pip install pyjwt paho-mqtt cryptography

Enter this command to add data to analyze during this lab:

git clone http://github.com/GoogleCloudPlatform/training-data-analyst

Create a registry for IoT devices

To register devices, you must create a registry for the devices. The registry is a point of control for devices.

To create the registry:

In your SSH session on the iot-device-simulator VM instance, run the following, adding your Project ID as the value for PROJECT_ID:

export PROJECT_ID=PROJECT_ID

Your completed command will look like this: export PROJECT_ID=PROJECT_ID

You must choose a region for your IoT registry. At this time, these regions are supported:

us-west1
europe-west1
asia-east1

Choose the region that is closest to you. To set an environment variable containing your preferred region, enter this command followed by the region name:

export MY_REGION=

Your completed command will look like this: export MY_REGION=us-west1.

Enter this command to create the device registry:

gcloud iot registries create iotlab-registry \
--project=$PROJECT_ID \
--region=$MY_REGION \
--event-notification-config=topic=projects/$PROJECT_ID/topics/iotlab

Create a Cryptographic Keypair

To allow IoT devices to connect securely to Cloud IoT Core, you must create a cryptographic keypair.

In your SSH session on the iot-device-simulator VM instance, enter these commands to create the keypair in the appropriate directory:

cd $HOME/training-data-analyst/quests/iotlab/
openssl req -x509 -newkey rsa:2048 -keyout rsa_private.pem \
-nodes -out rsa_cert.pem -subj "/CN=unused"

This openssl command creates an RSA cryptographic keypair and writes it to a file called rsa_private.pem.

Add simulated devices to the registry

For a device to be able to connect to Cloud IoT Core, it must first be added to the registry.

In your SSH session on the iot-device-simulator VM instance, enter this command to create a device called temp-sensor-buenos-aires:

gcloud iot devices create temp-sensor-buenos-aires \
--project=$PROJECT_ID \
--region=$MY_REGION \
--registry=iotlab-registry \
--public-key path=rsa_cert.pem,type=rs256

Enter this command to create a device called temp-sensor-istanbul:

gcloud iot devices create temp-sensor-istanbul \
--project=$PROJECT_ID \
--region=$MY_REGION \
--registry=iotlab-registry \
--public-key path=rsa_cert.pem,type=rs256

Run simulated devices

In your SSH session on the iot-device-simulator VM instance, enter these commands to download the CA root certificates from pki.google.com to the appropriate directory:

cd $HOME/training-data-analyst/quests/iotlab/
curl -o roots.pem -s -m 10 --retry 0 "https://pki.goog/roots.pem"

Enter this command to run the first simulated device:

python cloudiot_mqtt_example_json.py \
--project_id=$PROJECT_ID \
--cloud_region=$MY_REGION \
--registry_id=iotlab-registry \
--device_id=temp-sensor-buenos-aires \
--private_key_file=rsa_private.pem \
--message_type=event \
--algorithm=RS256 > buenos-aires-log.txt 2>&1 &

It will continue to run in the background.

Enter this command to run the second simulated device:

python cloudiot_mqtt_example_json.py \
--project_id=$PROJECT_ID \
--cloud_region=$MY_REGION \
--registry_id=iotlab-registry \
--device_id=temp-sensor-istanbul \
--private_key_file=rsa_private.pem \
--message_type=event \
--algorithm=RS256

Telemetry data will flow from the simulated devices through Cloud IoT Core to your Cloud Pub/Sub topic. In turn, your Dataflow job will read messages from your Pub/Sub topic and write their contents to your BigQuery table.

Analyze the Sensor Data Using BigQuery

To analyze the data as it is streaming:

In the Cloud Console, open the Navigation menu and select BigQuery.
Enter the following query in the Query editor and click RUN:

SELECT timestamp, device, temperature from iotlabdataset.sensordata
ORDER BY timestamp DESC
LIMIT 100

Browse the Results. What is the temperature trend at each of the locations?

Normalization in RDBMS

Nkwam Philip — Fri, 15 Jul 2022 15:52:55 +0000

Normalisation in DBMS

In clear terms, Normalisation is simply a process of organising data in a database, enhance its integrity by preventing data inconsistency and redundancy. It requires techniques and designs of breaking your complex database tables into bits that are connected to each other. This sort of relational model has been proven to be consistent with RDBMS.
For data to be used or stored efficiently and understandably, it has to be properly linked and distributed. Chaos sets In when data is scattered and are not linked to each other properly. Normalisation removes all anomalies and brings the database into a consistent state.
Normalisation in any DBMS follows a normal form of rules. The rules are listed into 6 forms with the 6th form being a newly developed form.

Before we proceed, Let’s understand a few things.
SQL Key - Primary Keys and Composite Keys

  Primary Key - a primary key is a key uniquely designated to identify each table record in a database. It is very crucial to a consistent and efficient relational database. A database table must have a primary key to insert, update, restore, or delete data from a database table. A primary key can be set to created manually, serially or automated in random numbers while defining your schema.
It has the following attributes:
1. A Primary Key cannot be NULL
2. A Primary Key must be unique
3. The Primary Keys should rarely be changed
4. The Primary key must be given a specific value once a new record has been inserted

Composite Key - A composite key is a primary key composed of multiple columns used to identify a record uniquely. Imagine a database with closely similar values but little distinction, a composite key would hold multiple columns together.

Now Let’s move into the rules:
 The 1NF - **1. Each table cell should contain a single value
2. Each table cell must be unique
**The 2NF - 1. Must be in 1NF
2. Single column primary key is not functionally dependent on the subsets of candidate keys in the database
In 2NF, tables are connected to each other with Foreign Keys. Foreign Keys references the primary key of another table.
- A foreign key can have a different name from the Primary Key
- It ensures rows in one table has corresponding rows in another
- They do not have to be unique
- They can be NULL, Primary Keys cannot.

TRANSITIVE Functional Dependencies - changing a Non-Key column that might affect the other non-keys column to change.

The 3NF - 1. Be in 2NF
2. Have no transitive functional dependency

The Boyce-Cord Normal Form
1. Even when a database is in 3rd Normal Form, still there would be anomalies resulted if it has more than one Candidate Key.
Sometimes is BCNF is also referred as 3.5 Normal Form.

The 4NF - 1. If no database table instance contains two or more, independent and multivalued data describing the relevant entity, then it is in 4th Normal Form.

The 5NF - 1. A table is in 5th Normal Form only if it is in 4NF and it cannot be decomposed into any number of smaller tables without loss of data.

The 6NF Proposed - 1. 6th Normal Form is not standardsed, yet however, it is being discussed by database experts for some time. Hopefully, we would have a clear & standardised definition for 6th Normal Form in the near future

Database Normalisation and designing are critical to a successful implementation of a good DBMS. Database can be designed further than 3NF, being the standard normal form for any database.

Loading Data Into BigQuery Using The CLI/Console

Nkwam Philip — Fri, 15 Jul 2022 15:34:27 +0000

BigQuery is Google's fully managed, NoOps, low cost analytics database.
With BigQuery you can query terabytes in seconds and petabytes in minutes of data without having any infrastructure to manage or needing a database administrator.
BigQuery uses SQL and can take advantage of the pay-as-you-go model. BigQuery allows you to focus on analysing data to find meaningful insights. Scales up resources(compute & storage) based on needs.
BigQuery maximizes flexibility by separating the compute engine that analyzes your data from your storage choices. You can store and analyze your data within BigQuery or use BigQuery to assess your data where it lives.

In this session, we will be loading data into BigQuery using the CLI/Console

Start with creating a dataset under your Project ID. Check the view actions and click on "Create Dataset" and name it anything you want.
Then ingest a new dataset from a CSV file. You can use this https://storage.googleapis.com/cloud-training/OCBL013/nyc_tlc_yellow_trips_2018_subset_1.csv

In the BigQuery console, click on the dataset created and create a table.

Specify the below table options:

Source:

Create table from: Upload
Choose File: select the file you downloaded locally earlier
File format: CSV
Destination: your local storage

Table name: 2018trips Leave all other setting at default.

Schema:
Check Auto Detect (tip: Not seeing the checkbox? Ensure the file format is CSV and not Avro)
Advanced Options

Leave at default values
Click Create Table.

Select Preview and confirm all columns have been loaded:
You have successfully loaded in a CSV file into a new BigQuery table.

Ingest a new Dataset from Google Cloud Storage
Now, lets try load another subset of the same 2018 trip data that is available on Cloud Storage. And this time, let's use the CLI tool to do it.

In your Cloud Shell, run the following command :
bq load \
--source_format=CSV \
--autodetect \
--noreplace \
nyctaxi.2018trips \
gs://cloud-training/OCBL013/nyc_tlc_yellow_trips_2018_subset_2.csv
Copied!

Note: With the above load job, you are specifying that this subset is to be appended to the existing 2018trips table that you created above.

When the load job is complete, you will get a confirmation on the screen.

Back on your BigQuery console, select the 2018trips table and view details. Confirm that the row count has now almost doubled.

You may want to run the same query like earlier to see if the top 5 most expensive trips have changed.

Create tables from other tables with DDL
The 2018trips table now has trips from throughout the year. What if you were only interested in January trips? For the purpose of this lab, we will keep it simple and focus only on pickup date and time. Let's use DDL to extract this data and store it in another table

In the Query Editor, run the following CREATE TABLE command :

standardSQL

CREATE TABLE
nyctaxi.january_trips AS
SELECT
*
FROM
nyctaxi.2018trips
WHERE
EXTRACT(Month
FROM
pickup_datetime)=1;

Now run the below query in your Query Editor find the longest distance traveled in the month of January:

standardSQL

SELECT
*
FROM
nyctaxi.january_trips
ORDER BY
trip_distance DESC
LIMIT
1

Loading Data Into Google Cloud SQL

Nkwam Philip — Thu, 14 Jul 2022 11:32:44 +0000

Cloud SQL is Google Cloud fully-managed database service that allows one setup, maintain, manage and administer other relational databases. Could be MySQL, PostgreSQL or Microsoft SQL Server.

Objectives:

Create Cloud SQL instance
Create a Cloud SQL database
Import text data into Cloud SQL
Check the data for integrity

The first thing required is to activate Cloud Shell at the top right corner of the GCP Home Page, list the active account name with the command

gcloud auth list

List the Project id with

gcloud config list project

Create environmental variables and the storage bucket that will contain the data

export PROJECT_ID=$(gcloud info --format='value(config.project)')
export BUCKET=${PROJECT_ID}-ml

Create a Cloud SQL Instance named taxi

gcloud sql instances create taxi \
--tier=db-n1-standard-1 --activation-policy=ALWAYS

Set root Password for the Cloud SQL Instance

gcloud sql users set-password root --host % --instance taxi \
--password Passw0rd

Now create an environment variable with the IP address of the Cloud Shell

export ADDRESS=$(wget -qO - http://ipecho.net/plain)/32

export ADDRESS=$(wget -qO - http://ipecho.net/plain)/32

gcloud sql instances patch taxi --authorized-networks $ADDRESS

When prompted press Y to accept the change.

Get the IP address of your Cloud SQL instance by running:

MYSQLIP=$(gcloud sql instances describe \
taxi --format="value(ipAddresses.ipAddress)")

Check the variable MYSQLIP, you should get the IP Address as an output:

echo $MYSQLIP

Create the taxi trips table by logging into the mysql command line interface and enter password whe prompted

mysql --host=$MYSQLIP --user=root \
--password --verbose

Create a Schema for the trips

create database if not exists bts;
use bts;
drop table if exists trips;
create table trips (
vendor_id VARCHAR(16),

pickup_datetime DATETIME,
dropoff_datetime DATETIME,
passenger_count INT,
trip_distance FLOAT,
rate_code VARCHAR(16),
store_and_fwd_flag VARCHAR(16),
payment_type VARCHAR(16),
fare_amount FLOAT,
extra FLOAT,
mta_tax FLOAT,
tip_amount FLOAT,
tolls_amount FLOAT,
imp_surcharge FLOAT,
total_amount FLOAT,
pickup_location_id VARCHAR(16),
dropoff_location_id VARCHAR(16)
);

In the mysql command line interface check the import and query the trips table

describe trips;
select distinct(pickup_location_id) from trips;

This will return an empty set as there is no data in the database yet, then Exit

exit

Add data to Cloud SQL instance

run the following - the database is pulled from google public datasets

gsutil cp gs://cloud-training/OCBL013/nyc_tlc_yellow_trips_2018_subset_1.csv trips.csv-1
gsutil cp gs://cloud-training/OCBL013/nyc_tlc_yellow_trips_2018_subset_2.csv trips.csv-2

Connect to the mysql interactive console to load local infile data

mysql --host=$MYSQLIP --user=root --password --local-infile

In the mysql interactive console select the database

use bts;

Load the local CSV file data using local-infile

LOAD DATA LOCAL INFILE 'trips.csv-1' INTO TABLE trips
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,rate_code,store_and_fwd_flag,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,imp_surcharge,total_amount,pickup_location_id,dropoff_location_id);

LOAD DATA LOCAL INFILE 'trips.csv-2' INTO TABLE trips
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,rate_code,store_and_fwd_flag,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,imp_surcharge,total_amount,pickup_location_id,dropoff_location_id);

Checking For Data Integrity

select distinct(pickup_location_id) from trips;
select
max(trip_distance),
min(trip_distance)
from
trips;
select count() from trips where trip_distance = 0;
select count() from trips where fare_amount < 0;
select
payment_type,
count(*)
from
trips
group by
payment_type;

If all these are confirmed, you have your db. then exit

exit

Evaluating Customers habits and Predicting their Purchases with a Classification Model with BigQuery ML

Nkwam Philip — Mon, 20 Jun 2022 23:46:21 +0000

Hi Folks,

I decided to release a document on using a low code ML tool for Predictive Analytics entirely on GCP.

I'll be loading datasets into BigQuery, from Google Analytics Sample E-commerce dataset that has millions of Google Analytics records for the Google Merchandise Store.
Alongside,

Query and explore the ecommerce dataset
Create a training and evaluation dataset to be used for batch prediction
Create a classification (logistic regression) model in BigQuery ML
Evaluate the performance of your machine learning model
Predict and rank the probability that a visitor will make a purchase

Right on my datasets inside BigQuery, i run an SQL command to find the total visitors who visited our website, and what % made a purchase.

standardSQL

WITH visitors AS(
SELECT
COUNT(DISTINCT fullVisitorId) AS total_visitors
FROM data-to-insights.ecommerce.web_analytics
),
purchasers AS(
SELECT
COUNT(DISTINCT fullVisitorId) AS total_purchasers
FROM data-to-insights.ecommerce.web_analytics
WHERE totals.transactions IS NOT NULL
)
SELECT
total_visitors,
total_purchasers,
total_purchasers / total_visitors AS conversion_rate
FROM visitors, purchasers

But what are the top 5 selling products?

SELECT
p.v2ProductName,
p.v2ProductCategory,
SUM(p.productQuantity) AS units_sold,
ROUND(SUM(p.localProductRevenue/1000000),2) AS revenue
FROM data-to-insights.ecommerce.web_analytics,
UNNEST(hits) AS h,
UNNEST(h.product) AS p
GROUP BY 1, 2
ORDER BY revenue DESC
LIMIT 5;

Select features and create your training dataset:
SELECT

EXCEPT(fullVisitorId) FROM # features (SELECT fullVisitorId, IFNULL(totals.bounces, 0) AS bounces, IFNULL(totals.timeOnSite, 0) AS time_on_site FROM data-to-insights.ecommerce.web_analytics WHERE totals.newVisits = 1) JOIN (SELECT fullvisitorid, IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit FROM data-to-insights.ecommerce.web_analytics GROUP BY fullvisitorid) USING (fullVisitorId) ORDER BY time_on_site DESC LIMIT 10;

Creating a BigQuery dataset to store models:
I'll be creating a new dataset to store my models under my project name, then select a BigQuery ML model type and specify options.
CREATE OR REPLACE MODEL ecommerce.classification_model
OPTIONS
(
model_type='logistic_reg',
labels = ['will_buy_on_return_visit']
)
AS

standardSQL

SELECT

EXCEPT(fullVisitorId) FROM # features (SELECT fullVisitorId, IFNULL(totals.bounces, 0) AS bounces, IFNULL(totals.timeOnSite, 0) AS time_on_site FROM data-to-insights.ecommerce.web_analytics WHERE totals.newVisits = 1 AND date BETWEEN '20160801' AND '20170430') # train on first 9 months JOIN (SELECT fullvisitorid, IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit FROM data-to-insights.ecommerce.web_analytics GROUP BY fullvisitorid) USING (fullVisitorId) ;

Next thing is to evaluate classification model performance
For classification problems in ML, i want to minimize the False Positive Rate (predict that the user will return and purchase and they don't) and maximize the True Positive Rate (predict that the user will return and purchase and they do).

Improving model performance with feature engineering: How far the visitor got in the checkout process on their first visit Where the visitor came from (traffic source: organic search, referring site etc.) Device category (mobile, tablet, desktop) Geographic information (country)

Creating a second model:
CREATE OR REPLACE MODEL ecommerce.classification_model_2
OPTIONS
(model_type='logistic_reg', labels = ['will_buy_on_return_visit']) AS
WITH all_visitor_stats AS (
SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM data-to-insights.ecommerce.web_analytics
GROUP BY fullvisitorid
)

add in new features

SELECT * EXCEPT(unique_session_id) FROM (
SELECT
CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,
# labels
will_buy_on_return_visit,
MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
# behavior on the site
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site,
totals.pageviews,
# where the visitor came from
trafficSource.source,
trafficSource.medium,
channelGrouping,
# mobile or desktop
device.deviceCategory,
# geographic
IFNULL(geoNetwork.country, "") AS country
FROM data-to-insights.ecommerce.web_analytics,
UNNEST(hits) AS h
JOIN all_visitor_stats USING(fullvisitorid)
WHERE 1=1
# only predict for new visits
AND totals.newVisits = 1
AND date BETWEEN '20160801' AND '20170430' # train 9 months
GROUP BY
unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
);

To check the ROC-AUC

standardSQL

SELECT
roc_auc,
CASE
WHEN roc_auc > .9 THEN 'good'
WHEN roc_auc > .8 THEN 'fair'
WHEN roc_auc > .7 THEN 'not great'
ELSE 'poor' END AS model_quality
FROM
ML.EVALUATE(MODEL ecommerce.classification_model_2, (
WITH all_visitor_stats AS (
SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS will_buy_on_return_visit
FROM data-to-insights.ecommerce.web_analytics
GROUP BY fullvisitorid
)

add in new features

SELECT * EXCEPT(unique_session_id) FROM (
SELECT
CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,
# labels
will_buy_on_return_visit,
MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,
# behavior on the site
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site,
totals.pageviews,
# where the visitor came from
trafficSource.source,
trafficSource.medium,
channelGrouping,
# mobile or desktop
device.deviceCategory,
# geographic
IFNULL(geoNetwork.country, "") AS country
FROM data-to-insights.ecommerce.web_analytics,
UNNEST(hits) AS h
JOIN all_visitor_stats USING(fullvisitorid)
WHERE 1=1
# only predict for new visits
AND totals.newVisits = 1
AND date BETWEEN '20170501' AND '20170630' # eval 2 months
GROUP BY
unique_session_id,
will_buy_on_return_visit,
bounces,
time_on_site,
totals.pageviews,
trafficSource.source,
trafficSource.medium,
channelGrouping,
device.deviceCategory,
country
)
));

Results
Of the top 6% of first-time visitors (sorted in decreasing order of predicted probability), more than 6% make a purchase in a later visit.

These users represent nearly 50% of all first-time visitors who make a purchase in a later visit.

Overall, only 0.7% of first-time visitors make a purchase in a later visit.

Targeting the top 6% of first-time increases marketing ROI by 9x vs targeting them all!

Data Engineering - Creating a Streaming Data Pipeline for a Real-Time Dashboard with Dataflow

Nkwam Philip — Thu, 09 Jun 2022 12:19:04 +0000

To think if GCP as a simplified infrastructure is something really positive to say. Not a feigned compliment cos we can all see it from the singleness of the platform.
I used GCP provisioned resources to create a streaming data pipeline for a real-time dashboard with dataflow.

I sourced a public streaming dataset (PubSub Topic) gathered of the NYC Taxi Rides, collected from google public pubsub data.

After signing in to my GCP Cloud Account, I navigated to "BigQuery" on the side-nav, would thereafter use my command line to create a dataset called taxirides.

On the command line, i run 'bq mk taxirides' to make a dataset under my project.

I would thereafter create a table in my dataset with my specified schema - the blueprint of my table data.
bq mk \
--time_partitioning_field timestamp \
--schema ride_id:string,point_idx:integer,latitude:float,longitude:float,\
timestamp:timestamp,meter_reading:float,meter_increment:float,ride_status:string,\
passenger_count:integer -t taxirides.realtime

My Schema has been successfully created, and you can clearly see it on a dropdown showing on the left of the screen.

Before creating my data pipeline, i'd create a bucket on my GCP Cloud Storage to serve as my data lake.

Named the bucket with my project ID and selected a multi-region.

And i have my bucket!

The Next thing to do is to be sure to enable the Dataflow Api, as this would be serving out dataflow pipeline.

I searched for it on the search bar and voila! it's there.

I'd click on "Manage", disable and enable the API.
After enabling the API, i'll proceed to create my Pipeline from an existing template on Dataflow.

Right Here!

I'll start by creating a Job from a template,

Insert my Job name, and select "PubSub Topic to BigQuery" considering where my data is coming from.

I fill in the required parameters comprising of the PubSub Topic link, my dataset location as an output, and a temporary location.

After filling the values, i'd make sure to specify the number of compute engine instances - Max workers as 2, number of workers as 2.

Then i run the job. A data pipeline is created after this and i can refresh my cloud storage bucket now, before i move to check the data on BigQuery.

Have it in mind that, BigQuery was the output of the Dataflow Pipeline.

Now i successfully have my Datasets on my BigQuery.

we can Perform aggregations on the stream for reporting, i'll navigate to the query option and input my query

I can also choose to save this transformed result back to my Data Warehouse or anywhere else. I can also do this using a scheduled query that returns the same transformed data on a streaming pipeline.

I choose to explore with Data Studio.

On the Reports page, in the Start with a Template section, click the [+] Blank Report template.
If prompted with the Welcome to Google Studio window, click Get started. Check the checkbox to acknowledge the Google Data Studio Additional Terms, and click Continue.
Select No to all the questions, then click Continue.

I switched back to bigquery and explore with data studio again.

I selected a Combo Chart and i specified
Date range Dimension: dashboard_sort
Dimension: dashboard_sort
Drill Down: dashboard_sort (Make sure that Drill down option is turned ON)
Metric: SUM() total_rides, SUM() total_passengers, SUM() total_revenue
Sort: dashboard_sort, Ascending (latest rides first)

NB: Visualizing data at a minute-level granularity is currently not supported in Data Studio as a timestamp. This is why we created our own dashboard_sort dimension.

Hassles pushing a deployed heroku app to my github repository and re-deploying to heroku.

Nkwam Philip — Tue, 24 May 2022 23:46:11 +0000

Just so you don’t spend over 3 hours trying to push a web app to heroku after pushing it to your GitHub repo. Here I’d be showing you practically how to push your web app to heroku, change the branch and push the same code to your GitHub repository.  I had just deployed a flask CRUD app to heroku, https://flaskcrudmanagerapp.herokuapp.com/

And the master branch was heroku.

I made several commit changes afterwards, pretty much to show it’s a working app I pushed into production and have modified multiple times.

But I just thought to have my web app in my GitHub repository, at least. So I had to create a new GitHub repository, then I added a new origin to the main branch. Meanwhile I had checked the list of remote repo links. You can see clearly in the picture above that I have heroku and an origin alongside.

Most importantly, origin was push as the MAIN Branch. Literally where the problem unknowingly started.

Then i tried pushing a change to my deployed heroku app, but whoossh, it stopped working.

I removed the origin and main branch having the github repo i made earlier, tried pushing my heroku app again, but it still didnt work. It got hilarious at a point, yo i just removed the dude that got me into the problem.

it did tell me there was no match for master.

i had to remove heroku, sadly i thought i was in a messier situation but i luckily added it back. Tried pushing again but naahh, it didnt work.

i used "Get-url", staged with force (--force, -f) but it still didnt work. Could have actually given up buh nope, i eventually figured it out hereeee, i pulled the branch and pushed with force, lol.

I believe you've seen the solution already. it happened that there was no branch (since i deleted the remote - main branch for the github repo i created earlier) and i have to push heroku as the main branch --The Upstream.

One common cause of this behavior is attempting to deploy code from a different branch, in-fact the branch wasn't existinggg.

Finally i set Heroku as the upstream main branch and voila, i can now push my commit changes to my live app. My dear Flask Application.

Yeah, that reminds me, i have to remove the staging app i force created earlier.

I trust you had a great read here.