<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sahil Kakkar</title>
    <description>The latest articles on DEV Community by Sahil Kakkar (@denuggets).</description>
    <link>https://dev.to/denuggets</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1339371%2F60b1a74f-ddcb-40e9-9251-d3d04aabbbda.jpg</url>
      <title>DEV Community: Sahil Kakkar</title>
      <link>https://dev.to/denuggets</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/denuggets"/>
    <language>en</language>
    <item>
      <title>From DataWareHouses to BigData Systems: What and Why - Questions that nobody asks, but you should!</title>
      <dc:creator>Sahil Kakkar</dc:creator>
      <pubDate>Sat, 08 Nov 2025 06:44:29 +0000</pubDate>
      <link>https://dev.to/denuggets/from-datawarehouses-to-bigdata-systems-what-and-why-questions-that-nobody-asks-but-you-should-4556</link>
      <guid>https://dev.to/denuggets/from-datawarehouses-to-bigdata-systems-what-and-why-questions-that-nobody-asks-but-you-should-4556</guid>
      <description>&lt;h1&gt;
  
  
  From DataWareHouses to BigData Systems
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;In the 1980s, data warehouses evolved as a way to separate operational reporting, which requires read-heavy querying across a full dataset, from the application's transactional database, which is focused on fast reads and writes in a much smaller set of records. Data warehouses are still relational databases, but they have been optimized for reporting and analytics.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How reporting differs from transactions?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Workload characteristics
&lt;/h3&gt;

&lt;p&gt;&lt;u&gt;Access pattern&lt;/u&gt;: Reporting is read-heavy across large historical datasets while transactions are frequent, small, and write-heavy on a narrow slice of recent records.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Latency tolerance&lt;/u&gt;: For Reporting, latency in seconds to minutes is often acceptable if the query is complex. For Transactions, sub-second latency is required to keep user actions snappy.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Concurrency profile&lt;/u&gt;: Transactions involve intense concurrent writes with strict isolation/locking. Reporting involve high concurrent reads, often from many analysts and BI tools.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Online analytical processing (OLAP) databases are optimized for reporting, whereas online transaction processing (OLTP) databases are designed for transactions, such as creating an order or making an ATM withdrawal.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Data modeling differences
&lt;/h3&gt;

&lt;p&gt;&lt;u&gt;Normalization vs denormalization&lt;/u&gt;:&lt;/p&gt;

&lt;p&gt;OLTP: Highly normalized (3NF) to avoid anomalies and reduce write amplification.&lt;/p&gt;

&lt;p&gt;OLAP: Denormalized star/snowflake schemas reduce join depth and accelerate scans; dimensional modeling (facts + dimensions) enables efficient aggregations.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Redundancy&lt;/u&gt;:&lt;/p&gt;

&lt;p&gt;OLTP: Minimizes redundancy to keep updates atomic and consistent.&lt;/p&gt;

&lt;p&gt;OLAP: Intentional redundancy (materialized aggregates, precomputed rollups) trades optimizing storage for optimizing query speed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Storage and execution
&lt;/h3&gt;

&lt;p&gt;&lt;u&gt;Row vs column orientation&lt;/u&gt;:&lt;/p&gt;

&lt;p&gt;OLTP (usually row-store): Optimized for point lookups and small updates.&lt;/p&gt;

&lt;p&gt;OLAP (usually column-store): Optimized for scanning few columns across many rows, with compression (RLE, dictionary encoding) and vectorized execution.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Join behavior&lt;/u&gt;:&lt;/p&gt;

&lt;p&gt;OLTP: Short, selective joins.&lt;/p&gt;

&lt;p&gt;OLAP: Broad joins and group-bys over billions of rows; benefit from columnar pruning, predicate pushdown, and distributed execution.&lt;/p&gt;

&lt;h3&gt;
  
  
  Business implications
&lt;/h3&gt;

&lt;p&gt;OLTP example: Processing an ATM withdrawal requires ACID guarantees, low latency, and minimal locking.&lt;/p&gt;

&lt;p&gt;OLAP example: Quarterly revenue by segment across 3 years requires scanning and aggregating billions of records, with complex dimensional filters and time-series logic.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The extract, transform, and load (ETL) process was introduced to extract data from OLTP databases, transform it, and then load it into the data warehouse.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why extract and transform (ET) matter before load?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Core reasons not to directly load OLTP into OLAP
&lt;/h3&gt;

&lt;p&gt;&lt;u&gt;Schema mismatch&lt;/u&gt;: OLTP schemas are normalized and operationally focused; OLAP schemas need curated dimensions, conformed keys, time grains, and slowly changing dimension handling.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Data quality and semantics&lt;/u&gt;: Raw operational data contains duplicates, missing values, status codes, and late arrivals that must be reconciled, deduped, and standardized.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Business context alignment&lt;/u&gt;: Transformations encode business rules (FX conversion, tax logic, product hierarchies) so analytics reflect “one version of truth.”&lt;/p&gt;

&lt;h3&gt;
  
  
  ET use cases
&lt;/h3&gt;

&lt;p&gt;&lt;u&gt;Customer 360&lt;/u&gt;:&lt;/p&gt;

&lt;p&gt;Extract: CRM, support tickets, web events.&lt;/p&gt;

&lt;p&gt;Transform: Resolve identities, dedupe, conform dimensions, enrich with marketing segments.&lt;/p&gt;

&lt;p&gt;Load: Star schema for churn prediction and lifetime value reporting.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Financial reporting&lt;/u&gt;:&lt;/p&gt;

&lt;p&gt;Extract: Orders and payments from OLTP systems in multiple currencies.&lt;/p&gt;

&lt;p&gt;Transform: Apply exchange rates at transaction time, recognize revenue, handle refunds/chargebacks.&lt;/p&gt;

&lt;p&gt;Load: Fact tables with daily granularity for P&amp;amp;L and auditing.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Supply chain analytics&lt;/u&gt;:&lt;/p&gt;

&lt;p&gt;Extract: Inventory, shipments, supplier SLAs.&lt;/p&gt;

&lt;p&gt;Transform: Normalize units, compute lead times, flag exceptions.&lt;/p&gt;

&lt;p&gt;Load: Aggregates by warehouse, SKU, and week for planning dashboards.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The rise of the Internet brought new data to be collected and analyzed. Even with a data warehouse dedicated to analysis, keeping up with the volume and velocity of incoming data created database bottlenecks.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Bottlenecks in OLAP with rising velocity and variety
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;u&gt;Ingestion pressure&lt;/u&gt;: Rapid event streams cause batch windows to overrun; nightly jobs fail to complete before business hours. Example: a retailer’s web clickstream and POS data doubling during festivals breaks the daily refresh SLA.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;u&gt;Schema drift and late-arriving dimensions&lt;/u&gt;: New fields, changing enums, or delayed customer updates invalidate downstream joins, producing incomplete dashboards. Example: marketing adds “campaign_subtype” mid-quarter; reports need backfill and reprocessing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;u&gt;High cardinality explosions&lt;/u&gt;: Dimensional attributes like user_id, session_id, or SKU variants cause massive cardinality, slowing group-bys and increasing memory pressure. Example: ad-tech impression logs with billions of unique device IDs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;u&gt;Concurrency and mixed workloads&lt;/u&gt;: BI tools launch many ad-hoc queries; without workload management, compute thrashes and queues grow. Example: end-of-month finance closes saturate warehouse resources, delaying operational analytics.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;u&gt;Small files and fragmentation&lt;/u&gt;: Streaming ingestion creates many tiny files, harming scan efficiency and metadata operations. Example: hourly micro-batches to object storage create millions of small Parquet files, degrading query planning.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;u&gt;Data freshness vs compute cost&lt;/u&gt;: Demanding near-real-time KPIs increases incremental build frequency, stressing storage IO and caches.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Administrators could scale vertically (i.e., increase the size and speed of the database), but there wasn't an easy way to scale horizontally (i.e., to distribute the load across multiple databases).&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Vertical vs horizontal scaling in OLAP context
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Vertical scaling (scale-up):
&lt;/h3&gt;

&lt;p&gt;&lt;u&gt;Definition&lt;/u&gt;: Add more CPU/RAM/storage to a single warehouse node or a monolithic appliance.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Pros&lt;/u&gt;: Simple operations, no data redistribution, stronger single-query performance up to hardware limits.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Cons&lt;/u&gt;: Diminishing returns, expensive hardware, single failure domain, limited concurrency headroom.&lt;/p&gt;

&lt;h3&gt;
  
  
  Horizontal scaling (scale-out):
&lt;/h3&gt;

&lt;p&gt;&lt;u&gt;Definition&lt;/u&gt;: Distribute data and query execution across many nodes; partition and replicate data; use parallel processing.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Pros&lt;/u&gt;: Linearly better throughput and concurrency, fault tolerance via replication, elastic capacity, better data locality.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Cons&lt;/u&gt;: Requires partitioning strategy, shuffle-aware design, and coordination overhead.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Big data systems or frameworks addressed this shortcoming in the 2000s. Big data frameworks were designed to distribute data across multiple nodes and handle any failures automatically.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Concrete benefits of horizontal scaling and partitioning (e.g., in Apache Spark)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Parallelism and throughput:
&lt;/h3&gt;

&lt;p&gt;Benefit: Partitions enable tasks to run concurrently across executors, dramatically reducing wall-clock time for scans, joins, and aggregates.&lt;/p&gt;

&lt;p&gt;Solves what scale-up cannot: When dataset size exceeds single-node memory/IO, parallel processing sustains performance at scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pruning and locality:
&lt;/h3&gt;

&lt;p&gt;Benefit: Partition pruning and predicate pushdown avoid scanning irrelevant data; co-locating compute with storage reduces network IO.&lt;/p&gt;

&lt;p&gt;Solves: Scale-up still scans full datasets; scale-out skips partitions (e.g., date=2025-11-07).&lt;/p&gt;

&lt;p&gt;Handling skew and hotspots:&lt;/p&gt;

&lt;p&gt;Benefit: Techniques like salting, adaptive query execution, and repartitioning redistribute skewed keys to balance work.&lt;/p&gt;

&lt;p&gt;Solves: A single powerful node can’t fix skew—one hot key still dominates; distributed strategies mitigate tail latency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Resilience and elasticity:
&lt;/h3&gt;

&lt;p&gt;Benefit: Task retries, speculative execution, and node replacement maintain progress despite failures; clusters can be right-sized on demand.&lt;/p&gt;

&lt;p&gt;Solves: Single-node outages or resource saturation halt scale-up systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  File and metadata management:
&lt;/h3&gt;

&lt;p&gt;Benefit: Compaction, clustering, and partitioned layouts (e.g., Hive-style or Iceberg/Delta partition specs) reduce small-file overhead and speed planning.&lt;/p&gt;

&lt;p&gt;Solves: Monolithic warehouses suffer from fragmented storage; distributed engines can rewrite and optimize layouts iteratively.&lt;/p&gt;

&lt;h3&gt;
  
  
  Streaming + batch unification:
&lt;/h3&gt;

&lt;p&gt;Benefit: Micro-batch or continuous processing handles high-velocity data while preserving exactly-once semantics and incremental merges (upserts).&lt;/p&gt;

&lt;p&gt;Solves: Traditional scale-up batch windows can’t meet freshness SLAs for near-real-time KPIs.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;These frameworks also allowed the big data systems to handle many ETL transformations, which helped to increase the speed with which analysis could be done.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why ETL speed matters to analytics and reporting
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Operational decisioning:
&lt;/h3&gt;

&lt;p&gt;Use case: Inventory restock recommendations every hour; faster ETL prevents stockouts and reduces carrying costs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Revenue-critical personalization:
&lt;/h3&gt;

&lt;p&gt;Use case: E-commerce updates recommendations and pricing with live signals; delays reduce conversion and average order value.&lt;/p&gt;

&lt;h3&gt;
  
  
  Risk and fraud mitigation:
&lt;/h3&gt;

&lt;p&gt;Use case: Banking flags anomalous transactions within minutes; slow ETL increases exposure and chargebacks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ad bidding and campaign optimization:
&lt;/h3&gt;

&lt;p&gt;Use case: Real-time attribution and budget pacing; stale aggregates waste spend and underdeliver on ROAS.&lt;/p&gt;

&lt;h3&gt;
  
  
  Executive dashboards and SLAs:
&lt;/h3&gt;

&lt;p&gt;Use case: Daily sales close by 8 AM; if ETL overruns, leadership decisions are made on outdated numbers, eroding trust.&lt;/p&gt;

&lt;h3&gt;
  
  
  Customer support and SRE analytics:
&lt;/h3&gt;

&lt;p&gt;Use case: Incident triage relies on fresh telemetry; delays prolong outages and degrade user experience.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Reporting demands wide scans, denormalized models, columnar storage, and tolerance for moderate latency, which conflicts with OLTP’s normalized, low-latency transactional updates. ET (not just TL) ensures trustworthy, business-aligned analytics by reconciling semantics, quality, and model shape before loading. As data velocity and variety grow, OLAP faces ingestion, schema drift, high cardinality, concurrency, and small-file bottlenecks. Vertical scaling hits hard limits; horizontal scaling with partitioning and distributed execution (as in Spark) delivers parallelism, pruning, skew mitigation, resilience, and streaming-friendly pipelines. Faster ETL directly improves revenue, risk, operations, and trust in decision-making.&lt;/p&gt;

</description>
      <category>database</category>
      <category>olap</category>
      <category>bigdata</category>
      <category>apachespark</category>
    </item>
  </channel>
</rss>
