DEV Community

MJ-O
MJ-O

Posted on

Foundational Concepts in Data Engineering Using an E-Commerce Platform Example

INTRODUCTION

Data engineering is the process of collecting, moving, transforming, storing, and managing data so that it can be used for reporting, analytics, machine learning, and decision-making. Almost every modern digital platform depends heavily on data, and behind these systems are data engineers who build pipelines and architectures that ensure data flows correctly and reliably.

To better understand these concepts, we will use a real-world example throughout this article: an e-commerce platform similar to Amazon, Jumia, Alibaba, or Shopify.

In an e-commerce platform:

  • Customers browse products
  • Orders are placed
  • Payments are processed
  • Products are shipped
  • Notifications are sent
  • Reports and dashboards are generated

Every click, payment, search, review, and order generates data. As the platform grows, the amount of data becomes extremely large and complex. Data engineering helps ensure this data can be processed efficiently and used effectively.

In this article, we will explore important foundational concepts in data engineering using this e-commerce platform example.


1. BATCH VS STREAMING INGESTION

Data ingestion is the process of collecting data from different sources and moving it into a storage or processing system.

There are two main ways data is ingested:

  • Batch ingestion
  • Streaming ingestion

Batch Ingestion

Batch ingestion processes data in groups after a certain period of time.

In an e-commerce platform, not all tasks require immediate processing. Some reports and operations are performed periodically.

Examples include:

  • Daily sales reports
  • Weekly inventory reports
  • Monthly revenue summaries
  • End-of-day analytics

Suppose the company wants to calculate the total sales made during the day. Instead of calculating every transaction individually in real time, the system may collect all transactions throughout the day and process them together at midnight.

This is batch ingestion because the data is processed in batches after a certain interval.

Advantages of Batch Ingestion

  • Easier to manage
  • Efficient for historical data processing
  • Lower infrastructure costs
  • Suitable for scheduled analytics

Limitations of Batch Ingestion

  • Delayed updates
  • Not suitable for real-time monitoring
  • Errors may only be detected later

Streaming Ingestion

Streaming ingestion processes data continuously as it is generated.

In an e-commerce system, many activities need immediate processing.

For example:

  • Customers should instantly receive order confirmations
  • Inventory should update immediately after purchases
  • Fraudulent transactions should be detected quickly
  • Delivery tracking should update in real time

Suppose a customer buys the last item in stock. The inventory system must update immediately so other customers do not purchase unavailable products.

This requires streaming ingestion.

Streaming systems continuously process incoming events as they happen.

Common technologies used include:

  • Apache Kafka
  • Spark Streaming
  • Apache Flink

Advantages of Streaming Ingestion

  • Real-time updates
  • Faster decision-making
  • Immediate notifications
  • Better customer experience

Limitations of Streaming Ingestion

  • More complex architecture
  • Higher processing requirements
  • More difficult to maintain

2. CHANGE DATA CAPTURE (CDC)

Change Data Capture (CDC) is a method used to track changes made to data in a database.

Instead of repeatedly copying the entire database, CDC captures only data that has changed.

This includes:

  • New records
  • Updated records
  • Deleted records

Example in an E-Commerce Platform

Suppose:

  • A customer changes their shipping address
  • A product price is updated
  • An order status changes from “Processing” to “Shipped”

Instead of copying the entire customer or orders table again, CDC captures only those changed records.

CDC is important because it:

  • Reduces unnecessary processing
  • Saves storage and bandwidth
  • Improves synchronization between systems
  • Supports near real-time analytics

Without CDC, systems would waste resources transferring unchanged data repeatedly.


3. IDEMPOTENCY

Idempotency means performing the same operation multiple times without changing the final result.

This concept is extremely important in distributed systems where failures and retries are common.


Example in an E-Commerce Platform

Suppose:

  1. A customer pays for an order
  2. The payment request times out
  3. The system retries the payment

Without idempotency:

  • The customer may be charged multiple times

With idempotency:

  • The system recognizes the repeated request as the same transaction
  • The payment is processed only once

This is usually achieved using:

  • Unique transaction IDs
  • Request tracking systems

Idempotency helps:

  • Prevent duplicate payments
  • Improve reliability
  • Make retries safe
  • Protect customer trust

Large platforms depend heavily on idempotent systems to avoid costly transaction errors.


4. OLTP VS OLAP

OLTP and OLAP are two different approaches used in database systems.


OLTP (Online Transaction Processing)

OLTP systems handle daily operational activities.

In an e-commerce platform, OLTP systems process:

  • Customer orders
  • Payments
  • Product updates
  • Cart additions
  • User logins

These systems require:

  • Fast response times
  • Real-time processing
  • High accuracy
  • Support for many users simultaneously

For example:

  • When a customer clicks “Buy Now,” the transaction must happen immediately

OLTP systems focus on:

  • Inserts
  • Updates
  • Short transactions

Examples of OLTP databases include:

  • MySQL
  • PostgreSQL
  • Oracle

OLAP (Online Analytical Processing)

OLAP systems are designed for analysis and reporting.

The company may use OLAP systems to analyze:

  • Best-selling products
  • Customer purchasing behavior
  • Revenue trends
  • Seasonal demand patterns
  • Delivery performance

OLAP systems focus on:

  • Complex analytical queries
  • Historical analysis
  • Large datasets
  • Aggregations

For example:

  • Management may want to compare monthly sales across different countries over several years

OLAP systems commonly use:

  • Data warehouses
  • Analytical databases
  • Columnar storage systems

Examples include:

  • Snowflake
  • BigQuery
  • Amazon Redshift

Simple Difference

  • OLTP runs the business
  • OLAP analyzes the business

5. COLUMNAR VS ROW-BASED STORAGE

Databases can store data either by rows or by columns.

The storage format affects how efficiently queries run.


Row-Based Storage

In row-based storage, all information for one record is stored together.

Example:

Order ID Customer Amount
101 James 250

This works well in transactional systems where entire records are frequently inserted or updated.

Advantages

  • Faster inserts and updates
  • Good for transactional workloads
  • Efficient for OLTP systems

Limitations

  • Slower analytical queries
  • Less efficient for reporting

Columnar Storage

In columnar storage:

  • All values from the same column are stored together

For example:

  • All product prices together
  • All order dates together
  • All customer IDs together

This works well for analytics because reports often require only a few columns.

For example:

  • Calculating total revenue may only require the “Amount” column

Advantages

  • Faster analytical queries
  • Better data compression
  • Efficient for reporting and aggregations

Limitations

  • Slower updates
  • Less suitable for transactional systems

6. PARTITIONING

Partitioning means dividing large datasets into smaller sections.

As the e-commerce platform grows, storing all orders in one massive table becomes inefficient.

The platform may partition order data based on:

  • Date
  • Country
  • Product category
  • Customer region

For example:

  • January orders stored separately from February orders

If analysts only need January sales data, the system reads only the January partition instead of scanning the entire database.

Partitioning improves:

  • Query performance
  • Data management
  • Processing efficiency

Partitioning is especially important in systems handling millions or billions of records.


7. ETL VS ELT

ETL and ELT are approaches used to move and process data.


ETL (Extract, Transform, Load)

In ETL:

  1. Data is extracted
  2. Data is transformed
  3. Data is loaded

The transformation happens before storage.

For example:

  • Product data from different suppliers may have inconsistent formats
  • Data engineers clean and standardize the data before loading it into a warehouse

ETL ensures:

  • Clean data enters the system
  • Better data quality
  • Easier reporting

ETL is commonly used in traditional data systems.


ELT (Extract, Load, Transform)

In ELT:

  1. Data is extracted
  2. Data is loaded immediately
  3. Transformation happens later

For example:

  • Raw customer activity logs are stored first
  • Cleaning happens later during analysis

ELT provides:

  • More flexibility
  • Faster loading
  • Access to raw data

ELT is common in modern cloud systems such as:

  • Snowflake
  • BigQuery
  • Databricks

8. CAP THEOREM

CAP Theorem explains a limitation in distributed systems.

It states that a distributed system cannot fully guarantee all three simultaneously:

  • Consistency
  • Availability
  • Partition Tolerance

Consistency

All users see the same data at the same time.

Example:

  • If a product goes out of stock, all customers should immediately see the updated inventory.

Availability

The system always responds to requests.

Even if some servers fail, customers should still access the platform.


Partition Tolerance

The system continues operating even during network failures.


Example in an E-Commerce Platform

Suppose there is a network issue between servers.

The platform may prioritize:

  • Keeping the website available
  • Continuing customer purchases

Even if inventory updates become slightly delayed temporarily.

Different systems choose different trade-offs depending on business priorities.


9. WINDOWING IN STREAMING

Streaming data arrives continuously, making it difficult to process everything at once.

Windowing divides streaming data into smaller time-based groups.

Examples:

  • Orders every 5 minutes
  • Active users every 10 minutes
  • Hourly revenue monitoring

Types of Windows

Tumbling Windows

Fixed non-overlapping windows.

Example:

  • Total sales calculated every 10 minutes

Sliding Windows

Windows overlap continuously.

Example:

  • Average purchases over the last 30 minutes recalculated every 5 minutes

Session Windows

Based on user activity.

Example:

  • A shopping session ends after inactivity

Windowing helps organize continuous streams into manageable sections.


10. DAGS AND WORKFLOW ORCHESTRATION

A DAG (Directed Acyclic Graph) organizes tasks in a workflow.

An e-commerce reporting pipeline may:

  1. Extract order data
  2. Clean the data
  3. Store it in a warehouse
  4. Generate dashboards
  5. Send reports to management

Each task depends on previous tasks.

Tools like Apache Airflow automate these workflows.


Why DAGs Are Important

  • Automate workflows
  • Manage dependencies
  • Schedule tasks
  • Monitor failures
  • Improve reliability

Without orchestration tools, managing large pipelines manually becomes difficult.


11. RETRY LOGIC AND DEAD LETTER QUEUES

Distributed systems frequently experience failures.

Examples include:

  • Network timeouts
  • Payment API failures
  • Invalid records

Retry Logic

Retry logic allows systems to attempt failed operations again automatically.

Example:

  • A payment gateway temporarily fails
  • The system retries after a few seconds

This improves reliability and reduces manual intervention.


Dead Letter Queues (DLQs)

Sometimes records repeatedly fail processing.

Example:

  • An order record contains corrupted product information

Instead of crashing the entire pipeline:

  • The failed record is moved into a Dead Letter Queue

This helps:

  • Isolate problematic data
  • Prevent system failures
  • Improve debugging

12. BACKFILLING AND REPROCESSING

Data pipelines sometimes fail or produce incorrect outputs.


Backfilling

Backfilling means processing older missing data.

Example:

  • Sales records failed to load for several hours
  • Missing records need recovery

Reprocessing

Reprocessing means running pipelines again using updated logic.

Example:

  • The company changes how discounts are calculated
  • Historical reports must be recalculated

These processes help maintain accurate historical data.


13. DATA GOVERNANCE

Data governance refers to managing data properly and securely.

An e-commerce platform handles sensitive data such as:

  • Customer names
  • Addresses
  • Payment details
  • Purchase history

Good governance ensures:

  • Data quality
  • Security
  • Privacy protection
  • Controlled access

For example:

  • Only authorized employees should access customer payment information

Poor governance may lead to:

  • Data leaks
  • Compliance violations
  • Incorrect reporting

14. TIME TRAVEL AND DATA VERSIONING

Time travel allows systems to access older versions of data.

For example:

  • Recover deleted product information
  • Audit historical pricing changes
  • Compare previous inventory records

Data versioning tracks how datasets change over time.

These features are useful for:

  • Recovery
  • Auditing
  • Debugging
  • Historical analysis

Modern platforms such as Snowflake and Delta Lake support time travel functionality.


15. DISTRIBUTED PROCESSING CONCEPTS

Large e-commerce platforms generate huge amounts of data every second.

A single computer cannot efficiently process:

  • Millions of customer clicks
  • Product searches
  • Payment transactions
  • Delivery tracking data

Distributed processing solves this problem by spreading work across multiple machines.

Frameworks such as Apache Spark support distributed processing.


Important Distributed Processing Concepts

Parallel Processing

Multiple tasks run simultaneously.

Example:

  • Processing orders from different countries at the same time

Cluster

A group of computers working together.


Scalability

The ability to handle increasing workloads.

Example:

  • Adding more servers during Black Friday sales

Fault Tolerance

The system continues functioning even if one machine fails.

This is important because large online stores must remain available continuously.


CONCLUSION

Modern data engineering systems rely on many foundational concepts to ensure data pipelines are scalable, reliable, efficient, and secure.

Using an e-commerce platform example makes it easier to understand how these concepts work in real-world systems used by millions of people daily. Concepts such as streaming ingestion, ETL, partitioning, orchestration, distributed processing, and governance all work together to support large-scale digital platforms.

As organizations continue generating massive amounts of data, understanding these foundational concepts becomes increasingly important for anyone interested in data engineering, analytics, or modern data systems.

Top comments (0)