Sahil Kakkar

Posted on Nov 8

From DataWareHouses to BigData Systems: What and Why - Questions that nobody asks, but you should!

#database #olap #bigdata #apachespark

From DataWareHouses to BigData Systems

In the 1980s, data warehouses evolved as a way to separate operational reporting, which requires read-heavy querying across a full dataset, from the application's transactional database, which is focused on fast reads and writes in a much smaller set of records. Data warehouses are still relational databases, but they have been optimized for reporting and analytics.

How reporting differs from transactions?

Workload characteristics

Access pattern: Reporting is read-heavy across large historical datasets while transactions are frequent, small, and write-heavy on a narrow slice of recent records.

Latency tolerance: For Reporting, latency in seconds to minutes is often acceptable if the query is complex. For Transactions, sub-second latency is required to keep user actions snappy.

Concurrency profile: Transactions involve intense concurrent writes with strict isolation/locking. Reporting involve high concurrent reads, often from many analysts and BI tools.

Online analytical processing (OLAP) databases are optimized for reporting, whereas online transaction processing (OLTP) databases are designed for transactions, such as creating an order or making an ATM withdrawal.

Data modeling differences

Normalization vs denormalization:

OLTP: Highly normalized (3NF) to avoid anomalies and reduce write amplification.

OLAP: Denormalized star/snowflake schemas reduce join depth and accelerate scans; dimensional modeling (facts + dimensions) enables efficient aggregations.

Redundancy:

OLTP: Minimizes redundancy to keep updates atomic and consistent.

OLAP: Intentional redundancy (materialized aggregates, precomputed rollups) trades optimizing storage for optimizing query speed.

Storage and execution

Row vs column orientation:

OLTP (usually row-store): Optimized for point lookups and small updates.

OLAP (usually column-store): Optimized for scanning few columns across many rows, with compression (RLE, dictionary encoding) and vectorized execution.

Join behavior:

OLTP: Short, selective joins.

OLAP: Broad joins and group-bys over billions of rows; benefit from columnar pruning, predicate pushdown, and distributed execution.

Business implications

OLTP example: Processing an ATM withdrawal requires ACID guarantees, low latency, and minimal locking.

OLAP example: Quarterly revenue by segment across 3 years requires scanning and aggregating billions of records, with complex dimensional filters and time-series logic.

The extract, transform, and load (ETL) process was introduced to extract data from OLTP databases, transform it, and then load it into the data warehouse.

Why extract and transform (ET) matter before load?

Core reasons not to directly load OLTP into OLAP

Schema mismatch: OLTP schemas are normalized and operationally focused; OLAP schemas need curated dimensions, conformed keys, time grains, and slowly changing dimension handling.

Data quality and semantics: Raw operational data contains duplicates, missing values, status codes, and late arrivals that must be reconciled, deduped, and standardized.

Business context alignment: Transformations encode business rules (FX conversion, tax logic, product hierarchies) so analytics reflect “one version of truth.”

ET use cases

Customer 360:

Extract: CRM, support tickets, web events.

Transform: Resolve identities, dedupe, conform dimensions, enrich with marketing segments.

Load: Star schema for churn prediction and lifetime value reporting.

Financial reporting:

Extract: Orders and payments from OLTP systems in multiple currencies.

Transform: Apply exchange rates at transaction time, recognize revenue, handle refunds/chargebacks.

Load: Fact tables with daily granularity for P&L and auditing.

Supply chain analytics:

Extract: Inventory, shipments, supplier SLAs.

Transform: Normalize units, compute lead times, flag exceptions.

Load: Aggregates by warehouse, SKU, and week for planning dashboards.

The rise of the Internet brought new data to be collected and analyzed. Even with a data warehouse dedicated to analysis, keeping up with the volume and velocity of incoming data created database bottlenecks.

Bottlenecks in OLAP with rising velocity and variety

Ingestion pressure: Rapid event streams cause batch windows to overrun; nightly jobs fail to complete before business hours. Example: a retailer’s web clickstream and POS data doubling during festivals breaks the daily refresh SLA.
Schema drift and late-arriving dimensions: New fields, changing enums, or delayed customer updates invalidate downstream joins, producing incomplete dashboards. Example: marketing adds “campaign_subtype” mid-quarter; reports need backfill and reprocessing.
High cardinality explosions: Dimensional attributes like user_id, session_id, or SKU variants cause massive cardinality, slowing group-bys and increasing memory pressure. Example: ad-tech impression logs with billions of unique device IDs.
Concurrency and mixed workloads: BI tools launch many ad-hoc queries; without workload management, compute thrashes and queues grow. Example: end-of-month finance closes saturate warehouse resources, delaying operational analytics.
Small files and fragmentation: Streaming ingestion creates many tiny files, harming scan efficiency and metadata operations. Example: hourly micro-batches to object storage create millions of small Parquet files, degrading query planning.
Data freshness vs compute cost: Demanding near-real-time KPIs increases incremental build frequency, stressing storage IO and caches.

Administrators could scale vertically (i.e., increase the size and speed of the database), but there wasn't an easy way to scale horizontally (i.e., to distribute the load across multiple databases).

Vertical vs horizontal scaling in OLAP context

Vertical scaling (scale-up):

Definition: Add more CPU/RAM/storage to a single warehouse node or a monolithic appliance.

Pros: Simple operations, no data redistribution, stronger single-query performance up to hardware limits.

Cons: Diminishing returns, expensive hardware, single failure domain, limited concurrency headroom.

Horizontal scaling (scale-out):

Definition: Distribute data and query execution across many nodes; partition and replicate data; use parallel processing.

Pros: Linearly better throughput and concurrency, fault tolerance via replication, elastic capacity, better data locality.

Cons: Requires partitioning strategy, shuffle-aware design, and coordination overhead.

Big data systems or frameworks addressed this shortcoming in the 2000s. Big data frameworks were designed to distribute data across multiple nodes and handle any failures automatically.

Concrete benefits of horizontal scaling and partitioning (e.g., in Apache Spark)

Parallelism and throughput:

Benefit: Partitions enable tasks to run concurrently across executors, dramatically reducing wall-clock time for scans, joins, and aggregates.

Solves what scale-up cannot: When dataset size exceeds single-node memory/IO, parallel processing sustains performance at scale.

Pruning and locality:

Benefit: Partition pruning and predicate pushdown avoid scanning irrelevant data; co-locating compute with storage reduces network IO.

Solves: Scale-up still scans full datasets; scale-out skips partitions (e.g., date=2025-11-07).

Handling skew and hotspots:

Benefit: Techniques like salting, adaptive query execution, and repartitioning redistribute skewed keys to balance work.

Solves: A single powerful node can’t fix skew—one hot key still dominates; distributed strategies mitigate tail latency.

Resilience and elasticity:

Benefit: Task retries, speculative execution, and node replacement maintain progress despite failures; clusters can be right-sized on demand.

Solves: Single-node outages or resource saturation halt scale-up systems.

File and metadata management:

Benefit: Compaction, clustering, and partitioned layouts (e.g., Hive-style or Iceberg/Delta partition specs) reduce small-file overhead and speed planning.

Solves: Monolithic warehouses suffer from fragmented storage; distributed engines can rewrite and optimize layouts iteratively.

Streaming + batch unification:

Benefit: Micro-batch or continuous processing handles high-velocity data while preserving exactly-once semantics and incremental merges (upserts).

Solves: Traditional scale-up batch windows can’t meet freshness SLAs for near-real-time KPIs.

These frameworks also allowed the big data systems to handle many ETL transformations, which helped to increase the speed with which analysis could be done.

Why ETL speed matters to analytics and reporting

Operational decisioning:

Use case: Inventory restock recommendations every hour; faster ETL prevents stockouts and reduces carrying costs.

Revenue-critical personalization:

Use case: E-commerce updates recommendations and pricing with live signals; delays reduce conversion and average order value.

Risk and fraud mitigation:

Use case: Banking flags anomalous transactions within minutes; slow ETL increases exposure and chargebacks.

Ad bidding and campaign optimization:

Use case: Real-time attribution and budget pacing; stale aggregates waste spend and underdeliver on ROAS.

Executive dashboards and SLAs:

Use case: Daily sales close by 8 AM; if ETL overruns, leadership decisions are made on outdated numbers, eroding trust.

Customer support and SRE analytics:

Use case: Incident triage relies on fresh telemetry; delays prolong outages and degrade user experience.

Summary

Reporting demands wide scans, denormalized models, columnar storage, and tolerance for moderate latency, which conflicts with OLTP’s normalized, low-latency transactional updates. ET (not just TL) ensures trustworthy, business-aligned analytics by reconciling semantics, quality, and model shape before loading. As data velocity and variety grow, OLAP faces ingestion, schema drift, high cardinality, concurrency, and small-file bottlenecks. Vertical scaling hits hard limits; horizontal scaling with partitioning and distributed execution (as in Spark) delivers parallelism, pruning, skew mitigation, resilience, and streaming-friendly pipelines. Faster ETL directly improves revenue, risk, operations, and trust in decision-making.

DEV Community