John Otienoh

Posted on Jun 13

Data Engineering Described

#dataengineering #data #intoductiontodataengineering

Source: Fundamentals of Data Engineering by Joe Reis and Matt Housley, published by O'Reilly Media.

This article summarizes and interprets key concepts from Chapter 1: Data Engineering Described. It is not a reproduction of the original text but a study guide and learning resource based on the chapter.

Data engineering has become one of the most critical disciplines in modern technology organizations. Every dashboard, machine learning model, business report, and analytical insight depends on reliable data pipelines built and maintained by data engineers.

What Is Data Engineering?

Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning. A data engineer manages the data engineering lifecycle, beginning with getting data from source systems and ending with serving data for use cases, such as analysis or machine learning.

Data engineering sits at the intersection of several disciplines:

Data Architecture
Data Management
Security
DataOps
Orchestration
Software Engineering

A data engineer's responsibility extends from collecting data from source systems to making that data available for analytics, reporting, and machine learning applications.

The Data Engineering Lifecycle

Data engineering is a lifecycle rather than a collection of isolated tools.

The lifecycle consists of five major stages:

Generation → Storage → Ingestion → Transformation → Serving

The Undercurrents of Data Engineering

While the lifecycle represents the flow of data, several concepts influence every stage i.e undercurrents:

Security
Data Management
DataOps
Data Architecture
Orchestration
Software Engineering

These disciplines are not separate activities; they support and shape the entire lifecycle.

For example:

Security protects data assets.
DataOps improves operational efficiency.
Architecture ensures scalability.
Software engineering principles improve reliability and maintainability.

The Evolution of the Data Engineer

The Data Warehousing Era (1980s–2000s)

The first major era of data engineering emerged with data warehousing.

Organizations began building centralized repositories for analytics using:

Enterprise Data Warehouses
Massively Parallel Processing (MPP) systems
Business Intelligence platforms

This period introduced large-scale analytical processing capabilities.

The Big Data Era (2000s–2010s)

As data volumes exploded, traditional systems struggled to scale.

New technologies emerged:

Hadoop
Spark
Distributed storage systems
Cloud-based data platforms

The role evolved into the "Big Data Engineer," focused on processing massive datasets efficiently.

The Modern Data Engineering Era

Today's data engineers work across:

Cloud platforms
Data lakes
Streaming systems
Machine learning infrastructure
Real-time analytics

The profession now emphasizes business value as much as technical implementation.

Data Engineering vs. Data Science

One of the most important distinctions made in Chapter 1 is that data engineering and data science are separate disciplines.

Data Engineers

Focus on:

Building pipelines
Managing infrastructure
Ensuring data quality
Scaling systems
Delivering reliable datasets

Data Scientists

Focus on:

Analysis
Experimentation
Statistical modeling
Machine learning
Business insights

In simple terms:

Data Engineers → Build the foundation

Data Scientists → Create value from that foundation

Data engineering sits upstream, providing the inputs necessary for data science and analytics.

Understanding Data Maturity

Data maturity reflects how effectively an organization leverages data as a strategic asset.

Importantly, maturity is not determined by company age or size.

A startup can be more data mature than a century-old enterprise if it uses data more effectively.

Stage	Primary Focus	Key Activities	Best Practices	Risks / Pitfalls
1. Starting with Data	Establish a data foundation aligned with business goals	- Define data architecture - Identify and audit relevant data sources - Build foundational data systems - Enable future analytics and ML use cases	- Secure executive sponsorship - Deliver quick wins to demonstrate value - Engage business stakeholders frequently - Use off-the-shelf solutions where possible - Build custom solutions only for competitive advantage	- Lack of visible business impact reduces support - Technical debt from rapid delivery - Working in silos without stakeholder feedback - Overengineering and unnecessary complexity
2. Scaling with Data	Create scalable, repeatable, and operational data practices	- Formalize data engineering processes - Build scalable architectures - Implement DevOps and DataOps practices - Develop ML-ready infrastructure	- Prioritize simplicity and maintainability - Focus on team productivity and scalability - Select technologies based on business value - Educate the organization on data usage	- Chasing trendy technologies without ROI - Overcomplicating infrastructure - Treating technology as the bottleneck instead of team capacity - Focusing on technical prestige instead of business outcomes
3. Leading with Data	Use data as a strategic competitive advantage across the organization	- Automate data onboarding and usage - Build proprietary data products - Implement governance, quality, and metadata management - Deploy data catalogs and lineage tools - Foster cross-functional collaboration	- Invest in DataOps and governance - Promote transparency and collaboration - Share data broadly across teams - Build custom systems only when they create measurable advantage	- Organizational complacency - Neglecting maintenance and continuous improvement - Pursuing expensive technology projects with little business value - Overengineering custom solutions without strategic benefit

The Background and Skills of a Data Engineer

Business Responsibilities of a Data Engineer

Technical ability alone is not sufficient.
Here are several critical business responsibilities.

Communication

Data engineers must communicate effectively with:

Executives
Product managers
Analysts
Data scientists
Software engineers

Trust and collaboration are essential.

Requirements Gathering

Engineers must understand:

What stakeholders need
Why they need it
How data solutions support business goals

Understanding Agile, DevOps, and DataOps

These are cultural practices as much as technical methodologies.

Successful implementation requires organizational alignment, not just tooling.

Cost Management

Data engineers should optimize:

Time-to-value
Total cost of ownership
Opportunity cost

Monitoring cloud and infrastructure spending is a core responsibility.

Continuous Learning

The data ecosystem evolves rapidly.

Strong data engineers:

Learn continuously
Evaluate new technologies critically
Distinguish innovations from hype

Technical Responsibilities of a Data Engineer

SQL

SQL remains the most important language in data engineering.

It is used for:

Querying data
Transformations
Analytics
Data warehousing
Data lake-house operations

Despite the rise of big data technologies, SQL continues to be the dominant language of data.

Python

Python serves as a bridge between data engineering and data science.

Popular tools include:

Pandas
NumPy
Airflow
PySpark
TensorFlow
PyTorch
Scikit-learn

Python excels at automation, orchestration, and integration.

Java and Scala

These languages are commonly used in large-scale distributed systems such as:

Apache Spark
Apache Hive
Apache Druid

They often provide greater performance and lower-level access than Python APIs.

Bash

Command-line skills remain valuable for:

Automation
File processing
System administration
Data pipeline operations

Tools like awk, sed, and shell scripting continue to play an important role in production environments.

Key Takeaways

Chapter 1 establishes several foundational ideas:

Data engineering transforms raw data into usable information.
The data engineering lifecycle includes generation, storage, ingestion, transformation, and serving.
Security, architecture, DataOps, and software engineering influence every stage of the lifecycle.
Data engineering and data science are complementary but distinct disciplines.
Data maturity reflects how effectively organizations use data as a strategic asset.
Successful data engineers combine technical expertise with business understanding.
SQL, Python, JVM languages, and Bash remain foundational technical skills.
The ultimate goal of data engineering is delivering business value through reliable and scalable data systems.

Reference

Reis, J., & Housley, M. (2022). Fundamentals of Data Engineering: Plan and Build Robust Data Systems. O'Reilly Media. Chapter 1: Data Engineering Described.

DEV Community