MJ-O

Posted on Jun 1

Foundational Concepts in Data Engineering Using an E-Commerce Platform Example

#architecture #beginners #data #dataengineering

INTRODUCTION

Data engineering is the process of collecting, moving, transforming, storing, and managing data so that it can be used for reporting, analytics, machine learning, and decision-making. Almost every modern digital platform depends heavily on data, and behind these systems are data engineers who build pipelines and architectures that ensure data flows correctly and reliably.

To better understand these concepts, we will use a real-world example throughout this article: an e-commerce platform similar to Amazon, Jumia, Alibaba, or Shopify.

In an e-commerce platform:

Customers browse products
Orders are placed
Payments are processed
Products are shipped
Notifications are sent
Reports and dashboards are generated

Every click, payment, search, review, and order generates data. As the platform grows, the amount of data becomes extremely large and complex. Data engineering helps ensure this data can be processed efficiently and used effectively.

In this article, we will explore important foundational concepts in data engineering using this e-commerce platform example.

1. BATCH VS STREAMING INGESTION

Data ingestion is the process of collecting data from different sources and moving it into a storage or processing system.

There are two main ways data is ingested:

Batch ingestion
Streaming ingestion

Batch Ingestion

Batch ingestion processes data in groups after a certain period of time.

In an e-commerce platform, not all tasks require immediate processing. Some reports and operations are performed periodically.

Examples include:

Daily sales reports
Weekly inventory reports
Monthly revenue summaries
End-of-day analytics

Suppose the company wants to calculate the total sales made during the day. Instead of calculating every transaction individually in real time, the system may collect all transactions throughout the day and process them together at midnight.

This is batch ingestion because the data is processed in batches after a certain interval.

Advantages of Batch Ingestion

Easier to manage
Efficient for historical data processing
Lower infrastructure costs
Suitable for scheduled analytics

Limitations of Batch Ingestion

Delayed updates
Not suitable for real-time monitoring
Errors may only be detected later

Streaming Ingestion

Streaming ingestion processes data continuously as it is generated.

In an e-commerce system, many activities need immediate processing.

For example:

Customers should instantly receive order confirmations
Inventory should update immediately after purchases
Fraudulent transactions should be detected quickly
Delivery tracking should update in real time

Suppose a customer buys the last item in stock. The inventory system must update immediately so other customers do not purchase unavailable products.

This requires streaming ingestion.

Streaming systems continuously process incoming events as they happen.

Common technologies used include:

Apache Kafka
Spark Streaming
Apache Flink

Advantages of Streaming Ingestion

Real-time updates
Faster decision-making
Immediate notifications
Better customer experience

Limitations of Streaming Ingestion

More complex architecture
Higher processing requirements
More difficult to maintain

2. CHANGE DATA CAPTURE (CDC)

Change Data Capture (CDC) is a method used to track changes made to data in a database.

Instead of repeatedly copying the entire database, CDC captures only data that has changed.

This includes:

New records
Updated records
Deleted records

Example in an E-Commerce Platform

Suppose:

A customer changes their shipping address
A product price is updated
An order status changes from “Processing” to “Shipped”

Instead of copying the entire customer or orders table again, CDC captures only those changed records.

CDC is important because it:

Reduces unnecessary processing
Saves storage and bandwidth
Improves synchronization between systems
Supports near real-time analytics

Without CDC, systems would waste resources transferring unchanged data repeatedly.

3. IDEMPOTENCY

Idempotency means performing the same operation multiple times without changing the final result.

This concept is extremely important in distributed systems where failures and retries are common.

Example in an E-Commerce Platform

Suppose:

A customer pays for an order
The payment request times out
The system retries the payment

Without idempotency:

The customer may be charged multiple times

With idempotency:

The system recognizes the repeated request as the same transaction
The payment is processed only once

This is usually achieved using:

Unique transaction IDs
Request tracking systems

Idempotency helps:

Prevent duplicate payments
Improve reliability
Make retries safe
Protect customer trust

Large platforms depend heavily on idempotent systems to avoid costly transaction errors.

4. OLTP VS OLAP

OLTP and OLAP are two different approaches used in database systems.

OLTP (Online Transaction Processing)

OLTP systems handle daily operational activities.

In an e-commerce platform, OLTP systems process:

Customer orders
Payments
Product updates
Cart additions
User logins

These systems require:

Fast response times
Real-time processing
High accuracy
Support for many users simultaneously

For example:

When a customer clicks “Buy Now,” the transaction must happen immediately

OLTP systems focus on:

Inserts
Updates
Short transactions

Examples of OLTP databases include:

MySQL
PostgreSQL
Oracle

OLAP (Online Analytical Processing)

OLAP systems are designed for analysis and reporting.

The company may use OLAP systems to analyze:

Best-selling products
Customer purchasing behavior
Revenue trends
Seasonal demand patterns
Delivery performance

OLAP systems focus on:

Complex analytical queries
Historical analysis
Large datasets
Aggregations

For example:

Management may want to compare monthly sales across different countries over several years

OLAP systems commonly use:

Data warehouses
Analytical databases
Columnar storage systems

Examples include:

Snowflake
BigQuery
Amazon Redshift

Simple Difference

OLTP runs the business
OLAP analyzes the business

5. COLUMNAR VS ROW-BASED STORAGE

Databases can store data either by rows or by columns.

The storage format affects how efficiently queries run.

Row-Based Storage

In row-based storage, all information for one record is stored together.

Example:

Order ID	Customer	Amount
101	James	250

This works well in transactional systems where entire records are frequently inserted or updated.

Advantages

Faster inserts and updates
Good for transactional workloads
Efficient for OLTP systems

Limitations

Slower analytical queries
Less efficient for reporting

Columnar Storage

In columnar storage:

All values from the same column are stored together

For example:

All product prices together
All order dates together
All customer IDs together

This works well for analytics because reports often require only a few columns.

For example:

Calculating total revenue may only require the “Amount” column

Advantages

Faster analytical queries
Better data compression
Efficient for reporting and aggregations

Limitations

Slower updates
Less suitable for transactional systems

6. PARTITIONING

Partitioning means dividing large datasets into smaller sections.

As the e-commerce platform grows, storing all orders in one massive table becomes inefficient.

The platform may partition order data based on:

Date
Country
Product category
Customer region

For example:

January orders stored separately from February orders

If analysts only need January sales data, the system reads only the January partition instead of scanning the entire database.

Partitioning improves:

Query performance
Data management
Processing efficiency

Partitioning is especially important in systems handling millions or billions of records.

7. ETL VS ELT

ETL and ELT are approaches used to move and process data.

ETL (Extract, Transform, Load)

In ETL:

Data is extracted
Data is transformed
Data is loaded

The transformation happens before storage.

For example:

Product data from different suppliers may have inconsistent formats
Data engineers clean and standardize the data before loading it into a warehouse

ETL ensures:

Clean data enters the system
Better data quality
Easier reporting

ETL is commonly used in traditional data systems.

ELT (Extract, Load, Transform)

In ELT:

Data is extracted
Data is loaded immediately
Transformation happens later

For example:

Raw customer activity logs are stored first
Cleaning happens later during analysis

ELT provides:

More flexibility
Faster loading
Access to raw data

ELT is common in modern cloud systems such as:

Snowflake
BigQuery
Databricks

8. CAP THEOREM

CAP Theorem explains a limitation in distributed systems.

It states that a distributed system cannot fully guarantee all three simultaneously:

Consistency
Availability
Partition Tolerance

Consistency

All users see the same data at the same time.

Example:

If a product goes out of stock, all customers should immediately see the updated inventory.

Availability

The system always responds to requests.

Even if some servers fail, customers should still access the platform.

Partition Tolerance

The system continues operating even during network failures.

Example in an E-Commerce Platform

Suppose there is a network issue between servers.

The platform may prioritize:

Keeping the website available
Continuing customer purchases

Even if inventory updates become slightly delayed temporarily.

Different systems choose different trade-offs depending on business priorities.

9. WINDOWING IN STREAMING

Streaming data arrives continuously, making it difficult to process everything at once.

Windowing divides streaming data into smaller time-based groups.

Examples:

Orders every 5 minutes
Active users every 10 minutes
Hourly revenue monitoring

Types of Windows

Tumbling Windows

Fixed non-overlapping windows.

Example:

Total sales calculated every 10 minutes

Sliding Windows

Windows overlap continuously.

Example:

Average purchases over the last 30 minutes recalculated every 5 minutes

Session Windows

Based on user activity.

Example:

A shopping session ends after inactivity

Windowing helps organize continuous streams into manageable sections.

10. DAGS AND WORKFLOW ORCHESTRATION

A DAG (Directed Acyclic Graph) organizes tasks in a workflow.

An e-commerce reporting pipeline may:

Extract order data
Clean the data
Store it in a warehouse
Generate dashboards
Send reports to management

Each task depends on previous tasks.

Tools like Apache Airflow automate these workflows.

Why DAGs Are Important

Automate workflows
Manage dependencies
Schedule tasks
Monitor failures
Improve reliability

Without orchestration tools, managing large pipelines manually becomes difficult.

11. RETRY LOGIC AND DEAD LETTER QUEUES

Distributed systems frequently experience failures.

Examples include:

Network timeouts
Payment API failures
Invalid records

Retry Logic

Retry logic allows systems to attempt failed operations again automatically.

Example:

A payment gateway temporarily fails
The system retries after a few seconds

This improves reliability and reduces manual intervention.

Dead Letter Queues (DLQs)

Sometimes records repeatedly fail processing.

Example:

An order record contains corrupted product information

Instead of crashing the entire pipeline:

The failed record is moved into a Dead Letter Queue

This helps:

Isolate problematic data
Prevent system failures
Improve debugging

12. BACKFILLING AND REPROCESSING

Data pipelines sometimes fail or produce incorrect outputs.

Backfilling

Backfilling means processing older missing data.

Example:

Sales records failed to load for several hours
Missing records need recovery

Reprocessing

Reprocessing means running pipelines again using updated logic.

Example:

The company changes how discounts are calculated
Historical reports must be recalculated

These processes help maintain accurate historical data.

13. DATA GOVERNANCE

Data governance refers to managing data properly and securely.

An e-commerce platform handles sensitive data such as:

Customer names
Addresses
Payment details
Purchase history

Good governance ensures:

Data quality
Security
Privacy protection
Controlled access

For example:

Only authorized employees should access customer payment information

Poor governance may lead to:

Data leaks
Compliance violations
Incorrect reporting

14. TIME TRAVEL AND DATA VERSIONING

Time travel allows systems to access older versions of data.

For example:

Recover deleted product information
Audit historical pricing changes
Compare previous inventory records

Data versioning tracks how datasets change over time.

These features are useful for:

Recovery
Auditing
Debugging
Historical analysis

Modern platforms such as Snowflake and Delta Lake support time travel functionality.

15. DISTRIBUTED PROCESSING CONCEPTS

Large e-commerce platforms generate huge amounts of data every second.

A single computer cannot efficiently process:

Millions of customer clicks
Product searches
Payment transactions
Delivery tracking data

Distributed processing solves this problem by spreading work across multiple machines.

Frameworks such as Apache Spark support distributed processing.

Important Distributed Processing Concepts

Parallel Processing

Multiple tasks run simultaneously.

Example:

Processing orders from different countries at the same time

Cluster

A group of computers working together.

Scalability

The ability to handle increasing workloads.

Example:

Adding more servers during Black Friday sales

Fault Tolerance

The system continues functioning even if one machine fails.

This is important because large online stores must remain available continuously.

CONCLUSION

Modern data engineering systems rely on many foundational concepts to ensure data pipelines are scalable, reliable, efficient, and secure.

Using an e-commerce platform example makes it easier to understand how these concepts work in real-world systems used by millions of people daily. Concepts such as streaming ingestion, ETL, partitioning, orchestration, distributed processing, and governance all work together to support large-scale digital platforms.

As organizations continue generating massive amounts of data, understanding these foundational concepts becomes increasingly important for anyone interested in data engineering, analytics, or modern data systems.

DEV Community

Foundational Concepts in Data Engineering Using an E-Commerce Platform Example

Row-Based Storage

Top comments (0)