DEV Community: Massy

The evolution of Data Engineering and the role of ELT tools

Massy — Mon, 12 May 2025 10:10:32 +0000

Data engineering has progressed rapidly in the past 3 decades. The warp speed changes in the field have created a significant knowledge gap for existing data Engineers, people interested in moving into a career in data engineering, data Scientists, machine learning engineers, BI & analytics teams, software & infrastructure teams as well as executives who want to better understand how data engineering fits into their companies.

In the data engineering space, a good deal of ceremony occurs around data movement and processing in order to be able to effectively support downstream use cases such as data science(AI/ML), Business Intelligence and Operational analytics in production. Therefore, it comes as no surprise that the data movement and processing practices and tools are at the forefront of the data engineering evolution.

This article begins by explaining the long established data movement and processing pattern known as ETL and the shift to the newer pattern known as ELT. It then covers the ELT process in detail and its benefits. Subsequent sections highlight the top ELT tools, describe the Airbyte approach to ELT and feature the exciting crucial data engineering predictions that might redefine the way we work with, process, and harness data throughout 2025 and beyond.

The evolution of Data Engineering

The Birth of Data Engineering

Data engineering as a practice has existed in some form since companies started doing things with data—such as predictive analysis, descriptive analytics, and reports. Before it came into sharp focus as a distinct field alongside the rise of data science in the 2010s, the practice had been branded in a whole host of different ways in the past, including as Database Administration, Data Analysis, Business Intelligence Engineering, Database Development, and more.

The birth of Data Engineering can arguably be traced back to data warehousing, originating as early as the 1970s, with the business data warehouse taking shape in the 1980s, and Bill Inmon officially coining the term "data warehouse" in 1989. The advent of database technology in this period saw enterprises employ online transactional processing (OLTP) systems, which offered efficient methods for storing, querying, and updating transactional and operational data, typically managed by relational database management systems (RDBMS). OLTP systems were designed for application-oriented data collection and maintaining the most current state of the enterprise, optimised for multiple, concurrent, and fast reads and writes, ensuring ACID properties (atomicity, consistency, isolation, durability).

With the ability to manage data logistics, the next logical step for enterprises was to leverage this data for insights and profitability. This led to the emergence of online analytical processing (OLAP) systems around the mid-1990s, which became the cornerstone of business intelligence and decision support. OLAP systems utilise a data warehouse (DW) or enterprise data warehouse (EDW) specifically designed for analytical processes, maintaining historical and commutative data and performing heavy operations like user-defined functions, aggregates, and complex joins for business analysis.

Emergence of ETL

To move data from OLTP systems (the current state) to OLAP systems (the historical data), ETL (Extract, Transform, Load) processes emerged. In its original form, still relevant today, ETL involves identifying and extracting relevant data from various sources, transforming it for cleansing and customisation, and finally loading it into a data warehouse. This often involves a Data Staging Area (DSA) where transformations take place before loading into fact and dimension tables. Early ETL processes faced challenges such as schema mapping, data cleansing and quality, complex transformations, and a lack of standardisation.

The Shift to ELT

A significant shift in data engineering occurred with the advent of the modern data stack, catalysed by the release of Amazon Redshift, a cloud-native massively parallel processing (MPP) / OLAP database, in October 2012. While ETL served as the primary method for data processing for decades, the evolution of cloud technologies led to the rise of Extract, Load, Transform (ELT) as a modern alternative. Traditionally, data was transformed before loading into the data warehouse because the warehouse was often too slow and constrained to handle heavy processing itself. Business intelligence (BI) tools also performed local data processing to circumvent warehouse bottlenecks, and data processing was centrally governed to avoid overwhelming the warehouse.

The cloud, a significant 21st-century innovation, revolutionised how data is extracted, loaded, and transformed. The cloud flips the on-premises model by offering rented hardware and managed services, allowing for dynamic scaling of resources. This scalability and the pay-as-you-go model of cloud data warehouses have made them accessible even to smaller companies.

In the ELT data warehouse architecture, data is moved more or less directly from production systems into a staging area within the data warehouse in its raw form. Transformations are then handled directly within the data warehouse, leveraging the massive computational power of cloud data warehouses and processing tools. This data is processed in batches, and the transformed output is written into tables and views for analytics. ELT is also popular today in streaming arrangements, where events are streamed and subsequently transformed within the data warehouse.

Understanding the ELT Process

The ELT process comprises three main stages: Extract, Load, and Transform.

Extract: This involves retrieving data from various source systems, which can be pull-based or push-based and may require reading metadata and schema changes. Data is extracted from sources like relational databases, CRM systems, cloud applications, or APIs – essentially the data intended for eventual analytics use. Accessing these diverse sources can be simplified by managed data connector platforms and frameworks like Airbyte, reducing the need for custom development. These tools automate pipeline creation and management, extracting data and loading it into data warehouses via user interfaces.
Load: Once extracted, data is loaded into a target data platform (data warehouse, data lake). Unlike ETL, ELT loads raw data immediately after extraction, making data available for analysis much faster. This step is efficient as it requires no upfront transformation. This process is often referred to as data ingestion – the movement of data from a source to a destination. Data integration, in contrast, combines data from disparate sources into a new dataset. Ingestion processes can be batch, micro-batch, or real-time.
Transform: In the final step, the raw data loaded into the data warehouse is transformed for analytical use. This often involves light transformations like casting data types, standardising time zones, and renaming fields, as well as heavy transformations that incorporate business logic, create materialisations, and join data. Data quality checks (QA) are also crucial during this stage.

Benefits of ELT

ELT offers several advantages, particularly in cloud-based environments where scalability, flexibility, and performance are paramount. Key benefits include:

Scalability: ELT leverages the vast computational power of modern cloud data warehouses, enabling effortless scaling of data pipelines to handle growing data volumes without performance bottlenecks.

Faster data availability: By loading raw data immediately, ELT makes data available for analysis much quicker than ETL, which is crucial for organisations needing near-real-time insights.

Cost efficiency: ELT reduces the need for expensive on-premise ETL tools and infrastructure by offloading processing to the cloud and utilising pay-as-you-go resources.

Flexibility: ELT allows for more flexibility in data transformation, as analysts can apply transformations iteratively to adapt to changing business requirements since raw data is readily available in the warehouse.

Simplified pipelines: The ELT process simplifies data pipelines by eliminating the need for upfront data transformation before loading, reducing complexity and improving overall pipeline management.

Adoption of software development best practices: Performing transformations last in the pipeline allows for code-based and version-controlled transformations, enabling features like easy recreation of historical transformations, code-based tests, CI/CD workflows, and documentation of data models like typical software code.

Top ELT Tools

The modern data stack, which facilitates the ELT workflow, comprises various tools that have become reasonably consistent over time. These can be broadly categorised as:

Ingestion: Tools like Airbyte, Fivetran and Stitch simplify the extraction and loading of data.

Warehousing/Lakehouse Platforms: Cloud data warehouses such as BigQuery, Databricks, Redshift and Snowflake serve as the primary storage and transformation environment.

Transformation: dbt (data build tool) has emerged as a popular tool specifically for the transformation step within the data warehouse.

BI: Tools like Looker, Mode, Periscope, Chartio, Metabase, and Redash are used for data visualisation and analysis.

Workflow Orchestration Tools: While not strictly ELT tools, orchestrators like Apache Airflow are essential for scheduling and managing the entire ELT pipeline.

As stated above, It is evident that some tools focus primarily on data integration (the EL part of ELT), while others, like dbt, focus on transformation (the T part). Some tools can also perform both ETL and ELT. Cloud vendors also have proprietary services for storage and databases, often bundled to work well together within their ecosystem. Examples include AWS Glue with Redshift, Databricks Workflows, Microsoft Fabric Data Factory, and BigQuery's Data Transfer Service and integration with Dataform.

Airbyte and the ELT Workflow

Airbyte is an open-source data integration platform designed to consolidate data from various sources into data warehouses, lakes, and databases. It plays the “EL” role in the ELT workflow. It is available in both self-managed and cloud versions. Airbyte simplifies self-serve data extraction from numerous API(550+), database, and file sources, offering predictable data loading into over 25+ destinations while managing typing and deduplication.

Airbyte enables users to build connectors using a no-code builder for HTTP APIs or a low-code CDK for REST APIs, significantly reducing development effort. Its unified platform ensures reliability across all data synchronisations, allowing control over schema propagation and flexible sync frequencies.

Airbyte also provides transformation capabilities as a critical part of the ELT process, allowing users to convert raw data into a more usable format after it has been loaded. This includes basic normalisation to convert JSON blobs into structured tables. Users can also implement custom transformations using SQL or integrate with dbt cloud for more complex transformations.

As highlighted above, it is clear that Airbyte strongly favours ELT over ETL.

The Future of Data Engineering and ELT

While nobody can predict the future, there’s a good perspective on the past, the present, and current trends. Below observations of ongoing developments and wild future speculation.

As more organizations shift towards cloud-based infrastructures, The modern data stack is and will continue to be the default choice of data architecture and ELT will continue to play a crucial role in data integration processes.
ELT tools will continue to mature, extending their coverage to more use cases to become more reliable foundational technologies, sparking the next wave of innovation in the modern data stack.
The ELT workflow and the specific tools are changing and evolving rapidly, but the core aim will remain the same: to reduce complexity and increase modularization. Plug-and-play modular tools with easy-to-understand pricing and implementation is the way of the future.
Batch transformations are overwhelmingly popular, but given the growing popularity of stream-processing solutions and the general increase in the amount of streaming data, the popularity of streaming transformations is expected to continue growing, perhaps entirely replacing batch processing in certain domains soon.

Conclusion

In this blog post, we've explored the early days of Data Engineering in which the Extract, Transform, Load (ETL) data processing framework was popular, then the adoption of the ELT framework which was mainly driven by cloud technology. We also learnt what the ELT framework consists of in detail, the benefits of ELT and Top ELT tools. Lastly, we covered how Airbyte fits into the ELT process and future data engineering predictions.

References

How to access the ERD (Entity-Relationship Diagram) of your database schema in MySQL Workbench

Massy — Fri, 19 May 2023 04:12:21 +0000

As a data analysis student, you may know what an Entity Relationship Diagram is and how it helps you quickly get familiar with a database schema and its main properties like tables, table relationships as well as table columns (and their respective data types)

But because you've always been practising and answering the SQL questions using an online interactive SQL instance, you have only been able to view an ERD because it has always been provided to you as an image. This means you've never had to generate a physical entity relationship diagram in a Database Management System like MySQL.

Yet, in the real workplace setting, or when you start answering SQL questions using a DBMS, you need to know how you can view or access the ERD of the database schema you intend to query from.

Below are steps you can follow to view the ERD (Entity-Relationship Diagram) of your database schema in MySQL Workbench:

Open MySQL Workbench and open your database connection.
In the top navigation bar, click on “Database” to expand the list of options.
Select “Reverse Engineer” from the database context menu.
In the “Reverse Engineer” dialogue box, click “Next” to set parameters for connecting to the Database Management System(DBMS).
Enter your password so that MySQL can connect to your DBMS and then click ‘’next’’.
Select the database schema for which you want to generate an ERD. And then click ‘’Next’’ again.
Enter your password again so that MySQL can retrieve and reverse engineer the schema objects for the schema you chose.
Click “Execute” to start the reverse engineering process. This may take a few moments depending on the size of your database.
Click ‘’finish’’ to see the ERD of your database schema displayed in the main window of MySQL Workbench.

In the ERD tab, your ERD may look like this by default :

The default ERD above may require scrolling back and forth to make sense of its contents. But you can avoid this by customising the ERD layout by dragging the tables and resizing them so that you can see all of them at a glance without the need to scroll down.

And that’s a wrap!

Cohort Retention Analysis from A-Z in Tableau

Massy — Sat, 13 May 2023 04:07:48 +0000

I recently learnt how to carry out cohort retention analysis in Tableau.

But most of the articles I came across were just using Tableau merely as a visualisation tool for the cohort retention rate results. For example, they’d use SQL for calculation and then Tableau for visualization or use Excel for calculation and then visualise in Tableau.

I found there’s an efficient way to do all this. Calculating the cohort retention rate and also visualizing the results all in one place — Tableau. Isn’t that awesome?

In this article, you’ll learn :

What cohort retention analysis is
Why it’s important
What data is required for cohort retention analysis & if it’s not available, how to calculate it (derived columns/calculated field)
How to create the cohort retention table
How to interpret the cohort retention rate

Prerequisites

You should have a Tableau public account to be able to create and publicly share your visualisation after you’re done.
You should know the basics of Tableau. In this article, I assume you’re already familiar with the Tableau environment. I’ll focus on teaching you how to use Tableau for cohort analysis specifically.

Understand cohort retention analysis

Cohort analysis is used by businesses to understand the behaviour, patterns and trends of their customers so that they can subsequently tailor their products and services to the identified cohorts.

You might ask yourself what a cohort is. A cohort is simply a group of people in this case who are customers, that share common characteristics such as time and size. Therefore, Cohort Analysis is an analysis of several different cohorts to get a better understanding of their behaviour, patterns, and trends.

There are different types of cohorts to analyze. They include:

Time-based cohorts
Segment-based cohorts
Size-based cohort

In this case, the type of cohorts you are going to create are time-based cohorts. Specifically, the cohort analysis you're going to do is retention-based. You are going to look at the time frame a certain group of people made their first purchase and then track the percentage of them that made subsequent purchases in future quarters.

Get Data

The Data set that you are going to be using is the famous Superstore Dataset. You can access the dataset by downloading it from Kaggle.

Open Tableau Public and connect to the Superstore data.

Go ahead and analyse the rows and columns to know what kind of data about the superstore was documented.

Proceed to the Sheet 1 tab. This is going to be your workspace and where you’re going to subsequently create your visualisation.

Data points of interest

To carry out cohort analysis the following data points are required. You need:

Unique identifier. In this case, your unique identifier is the Customer ID
First purchase date. This refers to the date the customer made their first purchase from the business and became a customer. This initial date is going to come in handy in creating a cohort group.
Revenue data

Create calculated fields

Customers’ First purchase date(quarter)

As mentioned earlier, the first purchase date is used in assigning a cohort to each customer.

Since the first purchase date field is not readily available in the data set. You are going to come up with it. This is known as a calculated field. A calculated field is a numeric or date field that derives its data from the calculation of the data in other fields that are readily present in the dataset.

Through the ‘first purchase’ calculated field, you’re going to create quarter and year cohorts which will be assigned to customers depending on the date they made their first purchase

Earlier in this article, I defined a cohort as a group of people that share common characteristics like time. In this case, a cohort is going to be a group of customers that made their first purchase in the same quarter and same year.

If you explored this Superstore dataset at the start, you might have realised that this data is spanning four years (2014–2017), if decided to make the month cohorts, it would make the cohort table very big and hard to analyse at just a glance. So I deemed it efficient to form cohorts based on the quarter and year in which a customer made their first purchase. What I’m trying to say is that you can create time-based cohorts based on other time parameters like day, week or month and it wouldn’t matter.

The calculation to establish the quarter in which a customer made their first purchase is as below:

DATE({ FIXED [Customer ID] : MIN(DATETRUNC('quarter', [Order Date])) })

Customers per first quarter

This calculated field is going to be used to establish the number of 'unique' customers that made their first purchase in each quarter.

This calculation builds on the first calculated field that you've just completed.

{ FIXED [Customers First Purchase Quarter] : COUNTD([Customer ID]) }

Retention Rate

COUNTD([Customer ID])/SUM([Customers per First Quarter])

Assemble the cohort table

This is about assembling the calculated fields you’ve created to form a cohort retention table.

Step 1: Click on Customers First Purchase field and drag it to the rows.

Step 2: On the rows, click on the drop-down, then switch the specifications from year to quarter and from continuous to discrete.

Step 3: Click on the Order Date field and drag it to the columns. Click the drop-down in the columns to switch the specifications from year to quarter and from continuous to discrete.

Step 4: Drag the Customers per first quarter field you created from the measures to the dimensions. And then drag it to the rows.

Step 5: Click the Retention Rate field you previously created and drag it to the text tile. Click the drop-down to format number to percentage of one decimal place.

Step 6: Drag the Retention Rate field from the text tile to the colour tile. Then drag Customer ID field to the tooltip tile. On the customer ID tool tip, click the drop down to change from attribute to a measure of count distinct. Then finally make the values visible by clicking T on the tool bar.

Bonus tip: You can further customize the look of the table by going over to the coolors website to pick out some unique colours that appeal to you.

Interpret the retention rate

Going down the view

Going down the view of the cohort table, look at the first column and second column, you see

different year and quarter groups (cohorts)
the number of customers that made their first purchase in each of those periods.

Going across the view

Going across the view (3rd column), you see the percentage of customers that continued to make purchases at the superstore 0 through 15 quarters after making their first purchase.

For example, 160 customers made their first purchase in 2014 Q2. Of those 160 customers, 24.4% and 36.3% of them came back to make purchases in 2014 Q3 and 2014 Q4 respectively. And so forth…

Conclusion

Creating calculated fields is inevitable in carrying out Cohort analysis in Tableau. And with creating calculated fields comes the need to use functions. If you are fairly new to the concept of Tableau functions, I understand there might be some knowledge gaps for you to fill. I recommend you check out this Tableau article on functions to gain a deeper understanding of the subject.

Problem solving with SQL: Case Study #1 — Danny’s Diner

Massy — Thu, 30 Mar 2023 00:16:01 +0000

Thank you Danny Ma for the excellent case study! You can find it here and try it yourself. While at it, you should give Danny Ma a follow on LinkedIn and support his posts if you aren’t already doing so!

I’ve posted the solution to this case study as a raw SQL script file on GitHub too.

Introduction

Danny seriously loves Japanese food so at the beginning of 2021, he decides to embark upon a risky venture and opens up a cute little restaurant that sells his 3 favourite foods: sushi, curry and ramen.

Danny’s Diner needs your assistance to help the restaurant stay afloat — the restaurant has captured some very basic data from its few months of operation but has no idea how to use its data to help them run the business.

Problem Statement

Danny wants to use the data to answer a few simple questions about his customers, especially about their

visiting patterns,
how much money they’ve spent, and
which menu items are their favourite. Having this deeper connection with his customers will help him deliver a better and more personalised experience for his loyal customers.

He plans on using these insights to help him decide whether he should expand the existing customer loyalty program — additionally, he needs help to generate some basic datasets so his team can easily inspect the data without needing to use SQL.

The data set contains the following 3 tables which you may refer to the relationship diagram below to understand the connection.

sales
members
menu

Relational model

Case Study Questions

What is the total amount each customer spent at the restaurant?
How many days has each customer visited the restaurant?
What was the first item from the menu purchased by each customer?
What is the most purchased item on the menu and how many times was it purchased by all customers?
Which item was the most popular for each customer?
Which item was purchased first by the customer after they became a member?
Which item was purchased just before the customer became a member?
What are the total items and amount spent for each member before they became a member?
If each $1 spent equates to 10 points and sushi has a 2x points multiplier — how many points would each customer have?
In the first week after a customer joins the program (including their join date) they earn 2x points on all items, not just sushi — how many points do customers A and B have at the end of January?

Solution

I used MySQL Workbench and these are the particular functions I employed:

Aggregate functions — SUM, COUNT
Joins — Inner join, left join
Temp tables (CTE)
Window function

Before attempting the questions I used the Entity Relationship Diagram as a guide to determine the logical structure of this database. I then went ahead to create a schema and tables to which tables I inserted the respective values

CREATE SCHEMA dannys_diner;
USE dannys_diner;

CREATE TABLE menu (
  product_id INT NOT NULL,
  product_name VARCHAR(5),
  price INT,
  PRIMARY KEY (product_id)
);

INSERT INTO menu
  (product_id, product_name, price)
VALUES
  ('1', 'sushi', '10'),
  ('2', 'curry', '15'),
  ('3', 'ramen', '12');

CREATE TABLE members (
  customer_id VARCHAR(1) NOT NULL,
  join_date DATE,
  PRIMARY KEY (customer_id)
);

INSERT INTO members
  (customer_id, join_date)
VALUES
  ('A', '2021-01-07'),
  ('B', '2021-01-09');

CREATE TABLE sales (
  customer_id VARCHAR(1) NOT NULL,
  order_date DATE,
  product_id INTEGER NOT NULL
);

INSERT INTO sales
  (customer_id, order_date, product_id)
VALUES
  ('A', '2021-01-01', '1'),
  ('A', '2021-01-01', '2'),
  ('A', '2021-01-07', '2'),
  ('A', '2021-01-10', '3'),
  ('A', '2021-01-11', '3'),
  ('A', '2021-01-11', '3'),
  ('B', '2021-01-01', '2'),
  ('B', '2021-01-02', '2'),
  ('B', '2021-01-04', '1'),
  ('B', '2021-01-11', '1'),
  ('B', '2021-01-16', '3'),
  ('B', '2021-02-01', '3'),
  ('C', '2021-01-01', '3'),
  ('C', '2021-01-01', '3'),
  ('C', '2021-01-07', '3');

Questions deep dive

Q1. What is the total amount each customer spent at the restaurant?
I use the SUM and GROUP BY functions to find out the total amount spent for each customer. I added the JOIN function because customer_id is from sales table and price is from menu table.

SELECT customer_id , SUM(price) amount_spent
FROM sales
LEFT JOIN menu 
  ON sales.product_id = menu.product_id
GROUP BY customer_id;

Answer:

Customer A spent $76.
Customer B spent $74.
Customer C spent $36.

Q2. How many days has each customer visited the restaurant?
I wrapped the COUNT function around the DISTINCTfunction to find out the number of days each customer visited the restaurant.

If I did not use DISTINCT for order_date, the number of days could be repeated. For example, if customer A visited the restaurant twice on ‘2021–01–07’, then the number of days may have counted as 2 instead of 1 day.

SELECT customer_id, COUNT(DISTINCT(order_date)) no_of_visits
FROM sales
GROUP BY customer_id;

Answer:

Customer A visited 4 times.
Customer B visited 6 times.
Customer C visited 2 times.

Q3. What was the first item from the menu purchased by each customer?
I first ran a query to find out the earliest order_date and used the answer to filter for only purchases on that date.

SELECT customer_id , product_name, order_date
FROM sales
LEFT JOIN menu 
  ON sales.product_id = menu.product_id
WHERE order_date = '2021-01-01' 
GROUP BY customer_id;

Answer:

Customer A’s first order was sushi.
Customer B’s first order was curry.
Customer C’s first order was ramen.

Q4. What is the most purchased item on the menu and how many times was it purchased by all customers?

SELECT product_name, COUNT(product_name) times_purchased
FROM sales
LEFT JOIN menu 
  ON sales.product_id = menu.product_id
GROUP BY product_name
ORDER BY times_purchased DESC
LIMIT 1;

Answer:
The most purchased item on the menu is ramen.

Q5. Which item was the most popular for each customer?

SELECT customer_id, product_name, COUNT(product_name) times_purchased
FROM sales
LEFT JOIN menu 
  ON sales.product_id = menu.product_id
GROUP BY customer_id, product_name
ORDER BY times_purchased DESC;

Customer A and C’s favourite item was ramen.
Customer B equally enjoyed all items on the menu.

Q6. Which item was purchased first by the customer after they became a member?
Only two customers were members. I ran independent queries to find out the first item they purchased.

-- Customer A
SELECT customer_id, order_date, product_name 
FROM sales
LEFT JOIN menu 
  ON sales.product_id = menu.product_id
WHERE customer_id = 'A' AND order_date > '2021-01-07' -- date after membership
ORDER BY order_date
LIMIT 1

-- Customer B
SELECT customer_id, order_date, product_name 
FROM sales
LEFT JOIN menu 
  ON sales.product_id = menu.product_id
WHERE customer_id = 'B' AND order_date > '    2021-01-09' -- date after membership
ORDER BY order_date
LIMIT 1;

Answer:
After Customer A became a member, his/her first order was ramen, whereas it was sushi for Customer B.

Q7. Which item was purchased just before the customer became a member?
I also did the same here. I ran independent queries to find out the first item they purchased because only two customers were members.

-- Customer A
SELECT customer_id, order_date, product_name 
FROM sales
LEFT JOIN menu 
  ON sales.product_id = menu.product_id
WHERE customer_id = 'A' AND order_date < '2021-01-07' -- dates before membership
ORDER BY order_date DESC

-- Customer B?
SELECT customer_id, order_date, product_name 
FROM sales
LEFT JOIN menu 
  ON sales.product_id = menu.product_id
WHERE customer_id = 'B' AND order_date < '2021-01-09' -- get dates before membership
ORDER BY order_date DESC -- to capture closest date before membership
LIMIT 1;

Answer:
Customer A’s order before he/she became a member was sushi and curry whereas Customer B’s order was sushi.

Q8. What are the total items and amount spent for each member before they became a member?

-- Customer A
SELECT customer_id, order_date, COUNT(product_name) total_items,             SUM(price) amount_spent
FROM sales
LEFT JOIN menu 
  ON sales.product_id = menu.product_id
WHERE customer_id = 'A' AND order_date < '2021-01-07' -- get dates before membership
GROUP BY customer_id
ORDER BY order_date;

-- Customer B
SELECT customer_id, order_date, COUNT(product_name) total_items, SUM(price) amount_spent
FROM sales
LEFT JOIN menu 
  ON sales.product_id = menu.product_id
WHERE customer_id = 'B' AND order_date < '2021-01-09' -- dates before membership
GROUP BY customer_id
ORDER BY order_date;

Answer: Before becoming members,
Customer A spent $ 25 on 2 items.
Customer B spent $40 on 3 items.

Q9. If each $1 spent equates to 10 points and sushi has a 2x points multiplier — how many points would each customer have?

Let’s break down the question.

Each $1 spent = 10 points.
But, sushi (product_id 1) gets 2x points, meaning each $1 spent = 20 points
So, we use CASE WHEN to create conditional statements

If product_id = 1, then every $1 price multiply by 20 points
All other product_id that is not 1, multiply $1 by 10 points
So, you can see the table below with the new column, total_points.

SELECT customer_id,
SUM(CASE
    WHEN product_name = 'sushi' THEN 20 * price
    ELSE 10 * price
END) total_points
FROM sales
LEFT JOIN menu 
  ON sales.product_id = menu.product_id
GROUP BY customer_id;

Answer:
Total points for Customer A, B and C are 860, 940 and 360 respectively.

Q10. If the first week after a customer joins the program (including their join date) they earn 2x points on all items, not just sushi — how many points do customers A and B have at the end of January?

The build up to my final query

Found out the customer’s validity date (which is 6 days after join_date and inclusive of join_date) and the last day of Jan 2021 (‘2021–01–21’). I made the result of this query a CTE because I was going to query further from this result in the following CASE WHEN statement .
Used CASE WHEN to allocate points by dates and product_name.
Filtered by the first day of February to get only points that apply to January.
Wrapped the CASE WHEN statement into the SUM function to add up the points for each customer. It’s at this point I dropped all the columns that were originally present in my CTE except for the customer_id column. This is because retaining those other columns was not going to display their actual 'group' representation while I grouped by the customer_id only which I was interested in. And also I had retained them previously to help me check if my query results were right.

WITH cte_OfferValidity AS 
    (SELECT s.customer_id, m.join_date, s.order_date,
        date_add(m.join_date, interval(6) DAY) firstweek_ends, menu.product_name, menu.price
    FROM sales s
    LEFT JOIN members m
      ON s.customer_id = m.customer_id
    LEFT JOIN menu
        ON s.product_id = menu.product_id)
SELECT customer_id,
    SUM(CASE
            WHEN order_date BETWEEN join_date AND firstweek_ends THEN 20 * price 
            WHEN (order_date NOT BETWEEN join_date AND firstweek_ends) AND product_name = 'sushi' THEN 20 * price
            ELSE 10 * price
        END) points
FROM cte_OfferValidity
WHERE order_date < '2021-02-01' -- filter jan points only
GROUP BY customer_id;

Answer:
Customer A and Customer B have 1370 points and 820 points respectively by the end of January 2021.

Bonus Questions

Join All The Things
Recreate the table with: customer_id, order_date, product_name, price, member (Y/N)

SELECT s.customer_id, order_date, menu.product_name, menu.price, 
CASE
  WHEN s.order_date >= '2021-01-07' AND m.join_date IS NOT NULL THEN 'Y' 
  WHEN s.order_date >= '2021-01-09' AND m.join_date IS NOT NULL THEN 'Y'
    ELSE 'N'
END AS member
FROM sales s
LEFT JOIN menu 
  ON s.product_id = menu.product_id
LEFT JOIN members m
  ON s.customer_id = m.customer_id;

Rank All The Things
Danny also requires further information about the ranking of customer products, but he purposely does not need the ranking for non-member purchases so he expects null ranking values for the records when customers are not yet part of the loyalty program.

WITH cte AS
  (SELECT s.customer_id, order_date, menu.product_name, menu.price, 
    CASE
      WHEN s.order_date >= '2021-01-07' AND m.join_date IS NOT NULL THEN 'Y' 
      WHEN s.order_date >= '2021-01-09' AND m.join_date IS NOT NULL THEN 'Y'
      ELSE 'N'
    END AS member
  FROM sales s
  LEFT JOIN menu 
    ON s.product_id = menu.product_id
  LEFT JOIN members m
    ON s.customer_id = m.customer_id)
SELECT *, 
  CASE
    WHEN member = 'N' THEN NULL 
    ELSE RANK() OVER w
  END AS ranking
FROM cte
WINDOW w AS (PARTITION BY s.customer_id, member ORDER BY s.order_date)

Summary of insights

From the analysis, I discovered a few interesting insights that would be certainly useful for Danny.

Customer B is the most frequent visitor with 6 visits in Jan 2021.
Danny’s Diner’s most popular item is ramen.
Customer A and C loves ramen whereas Customer B seems to enjoy sushi, curry and ramen equally.
Customer A is the 1st member of Danny’s Diner and his first order is curry.
Before they became members, Customer A and Customer B spent $25 and $40 respectively.
Throughout Jan 2021, Customer A, Customer B and Customer C had 860 points, 940 points and Customer C: 360 points respectively.
Assuming that members can earn 2x points a week from the day they became a member — not just sushi, Customer A has 1370 points and Customer B has 820 points by the end of Jan 2021.

It’s a wrap!

Feel free to share your opinion about my analysis in the comments. Suggestions on how to optimize my SQL code for performance are also welcome.

Happy querying folks 👋