DataEngineeringZoomcamp #BigQuery #DataWarehouse #GCP
Part 1: Understanding Data Warehouses & OLAP vs OLTP 🏢
Why Do We Need Data Warehouses? 🤔
Imagine you run an online store. Your website has a database that handles:
- Customer sign-ups
- Product orders
- Payment processing
- Inventory updates
This database needs to be FAST because customers are waiting. Every millisecond counts!
Now, your boss asks: "What were our top-selling products last year by region, and how did that compare to the year before?"
Running that query on your production database would:
- Slow down your website (bad for customers!)
- Take forever because the database isn't designed for such complex analysis
- Potentially crash things if the query is too heavy
This is exactly why data warehouses exist! They're a separate place to store your data, specifically designed for answering complex analytical questions without affecting your live applications.
OLTP vs OLAP - The Two Worlds of Databases
These acronyms sound scary, but they're simple concepts:
OLTP = Online Transaction Processing (Your everyday app databases)
OLAP = Online Analytical Processing (Data warehouses for analysis)
| Aspect | OLTP (Transactional) | OLAP (Analytical) |
|---|---|---|
| What it's for | Running your app - orders, logins, updates | Answering business questions - reports, dashboards |
| Type of queries | Simple: "Get user #123's info" | Complex: "Show sales trends by region for 5 years" |
| Speed | Super fast for small operations | Can take minutes for huge analyses |
| Data freshness | Real-time, always up-to-date | Usually updated daily/hourly (batch) |
| How data is organized | Normalized (split into many tables, no duplicates) | Denormalized (fewer tables, some duplication OK) |
| Data size | Gigabytes (current data) | Terabytes/Petabytes (years of history) |
| Who uses it | Your application, customers | Data analysts, managers, executives |
| Examples | MySQL for your website, PostgreSQL for your app | BigQuery, Snowflake, Amazon Redshift |
Real-World Example 🛒
OLTP scenario (your app database):
-- A customer places an order - needs to be FAST
INSERT INTO orders (customer_id, product_id, quantity, price)
VALUES (123, 456, 2, 29.99);
OLAP scenario (data warehouse):
-- Your CEO wants to know Q4 performance - can take a minute, that's fine
SELECT
region,
product_category,
SUM(revenue) as total_revenue,
COUNT(DISTINCT customer_id) as unique_customers
FROM sales_data
WHERE order_date BETWEEN '2023-10-01' AND '2023-12-31'
GROUP BY region, product_category
ORDER BY total_revenue DESC;
💡 Key insight: OLTP is like a cashier at a store - fast, handles one customer at a time. OLAP is like the accounting department - takes time to analyze all the receipts and produce reports.
What Exactly is a Data Warehouse? 🏗️
A data warehouse is a centralized repository where you collect data from ALL your different systems and store it in a way that's optimized for analysis.
Think of it like this:
Imagine a company with multiple departments:
- Sales team uses Salesforce
- Marketing uses HubSpot
- Website runs on PostgreSQL
- Inventory managed in SAP
Each system has its own database. But your CEO wants a report combining data from ALL of them. This is where a data warehouse comes in!
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Salesforce │ │ HubSpot │ │ PostgreSQL │
│ (Sales) │ │ (Marketing) │ │ (Website) │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
│ ETL/ELT │ │
│ (Extract, │ │
│ Transform, │ │
│ Load) │ │
▼ ▼ ▼
┌──────────────────────────────────────────────────┐
│ DATA WAREHOUSE (BigQuery) │
│ │
│ All your data, cleaned, organized, ready │
│ for analysis! │
└──────────────────────────────────────────────────┘
│
▼
┌──────────────────────────┐
│ Reports, Dashboards, │
│ Machine Learning, etc. │
└──────────────────────────┘
Key characteristics of a data warehouse:
- 📊 Subject-oriented - Organized by business topics (sales, customers, products)
- 🔗 Integrated - Data from multiple sources combined together
- 📅 Time-variant - Keeps historical data (years worth!)
- 🔒 Non-volatile - Data doesn't change once loaded (it's a historical record)
Modern Cloud Data Warehouses ☁️
Traditional data warehouses (like Oracle, Teradata) required:
- Buying expensive hardware
- Hiring DBAs to manage servers
- Months of setup time
- Huge upfront costs
Modern cloud data warehouses (BigQuery, Snowflake, Redshift) changed everything:
- ✅ No servers to manage (serverless)
- ✅ Pay only for what you use
- ✅ Scales automatically
- ✅ Set up in minutes
- ✅ Access from anywhere
Top comments (0)