DEV Community

Abdelrahman Adnan
Abdelrahman Adnan

Posted on

Module 3 Summary - Data Warehousing & BigQuery

DataEngineeringZoomcamp #BigQuery #DataWarehouse #GCP


Part 1: Understanding Data Warehouses & OLAP vs OLTP 🏢

Why Do We Need Data Warehouses? 🤔

Imagine you run an online store. Your website has a database that handles:

  • Customer sign-ups
  • Product orders
  • Payment processing
  • Inventory updates

This database needs to be FAST because customers are waiting. Every millisecond counts!

Now, your boss asks: "What were our top-selling products last year by region, and how did that compare to the year before?"

Running that query on your production database would:

  1. Slow down your website (bad for customers!)
  2. Take forever because the database isn't designed for such complex analysis
  3. Potentially crash things if the query is too heavy

This is exactly why data warehouses exist! They're a separate place to store your data, specifically designed for answering complex analytical questions without affecting your live applications.

OLTP vs OLAP - The Two Worlds of Databases

These acronyms sound scary, but they're simple concepts:

OLTP = Online Transaction Processing (Your everyday app databases)
OLAP = Online Analytical Processing (Data warehouses for analysis)

Aspect OLTP (Transactional) OLAP (Analytical)
What it's for Running your app - orders, logins, updates Answering business questions - reports, dashboards
Type of queries Simple: "Get user #123's info" Complex: "Show sales trends by region for 5 years"
Speed Super fast for small operations Can take minutes for huge analyses
Data freshness Real-time, always up-to-date Usually updated daily/hourly (batch)
How data is organized Normalized (split into many tables, no duplicates) Denormalized (fewer tables, some duplication OK)
Data size Gigabytes (current data) Terabytes/Petabytes (years of history)
Who uses it Your application, customers Data analysts, managers, executives
Examples MySQL for your website, PostgreSQL for your app BigQuery, Snowflake, Amazon Redshift

Real-World Example 🛒

OLTP scenario (your app database):

-- A customer places an order - needs to be FAST
INSERT INTO orders (customer_id, product_id, quantity, price) 
VALUES (123, 456, 2, 29.99);
Enter fullscreen mode Exit fullscreen mode

OLAP scenario (data warehouse):

-- Your CEO wants to know Q4 performance - can take a minute, that's fine
SELECT 
    region,
    product_category,
    SUM(revenue) as total_revenue,
    COUNT(DISTINCT customer_id) as unique_customers
FROM sales_data
WHERE order_date BETWEEN '2023-10-01' AND '2023-12-31'
GROUP BY region, product_category
ORDER BY total_revenue DESC;
Enter fullscreen mode Exit fullscreen mode

💡 Key insight: OLTP is like a cashier at a store - fast, handles one customer at a time. OLAP is like the accounting department - takes time to analyze all the receipts and produce reports.

What Exactly is a Data Warehouse? 🏗️

A data warehouse is a centralized repository where you collect data from ALL your different systems and store it in a way that's optimized for analysis.

Think of it like this:

Imagine a company with multiple departments:

  • Sales team uses Salesforce
  • Marketing uses HubSpot
  • Website runs on PostgreSQL
  • Inventory managed in SAP

Each system has its own database. But your CEO wants a report combining data from ALL of them. This is where a data warehouse comes in!

┌─────────────┐   ┌─────────────┐   ┌─────────────┐
│  Salesforce │   │   HubSpot   │   │  PostgreSQL │
│   (Sales)   │   │ (Marketing) │   │  (Website)  │
└──────┬──────┘   └──────┬──────┘   └──────┬──────┘
       │                 │                 │
       │    ETL/ELT      │                 │
       │   (Extract,     │                 │
       │   Transform,    │                 │
       │    Load)        │                 │
       ▼                 ▼                 ▼
┌──────────────────────────────────────────────────┐
│              DATA WAREHOUSE (BigQuery)            │
│                                                  │
│   All your data, cleaned, organized, ready       │
│   for analysis!                                  │
└──────────────────────────────────────────────────┘
                        │
                        ▼
         ┌──────────────────────────┐
         │  Reports, Dashboards,    │
         │  Machine Learning, etc.  │
         └──────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Key characteristics of a data warehouse:

  • 📊 Subject-oriented - Organized by business topics (sales, customers, products)
  • 🔗 Integrated - Data from multiple sources combined together
  • 📅 Time-variant - Keeps historical data (years worth!)
  • 🔒 Non-volatile - Data doesn't change once loaded (it's a historical record)

Modern Cloud Data Warehouses ☁️

Traditional data warehouses (like Oracle, Teradata) required:

  • Buying expensive hardware
  • Hiring DBAs to manage servers
  • Months of setup time
  • Huge upfront costs

Modern cloud data warehouses (BigQuery, Snowflake, Redshift) changed everything:

  • ✅ No servers to manage (serverless)
  • ✅ Pay only for what you use
  • ✅ Scales automatically
  • ✅ Set up in minutes
  • ✅ Access from anywhere

Top comments (0)