Agustín José Mazzeo

Posted on May 29

What does a Data Engineer do in Production (No Hype)

#dataengineering #etl #analitycsengineering

What Does a Data Engineer Do in Production (No Hype)

If you learned Data Engineering with notebooks and clean datasets, this article is for you. In production there are no clean datasets: there are systems that change, pipelines that fail, and data that has to be correct every single day.

TL;DR

A Data Engineer in production:

Builds and maintains reliable pipelines
Ensures data quality (doesn’t “wait” for good data)
Designs for consumption (BI, ML, APIs)
Operates: monitors, debugs, and reprocesses
Makes real trade-off decisions: cost, performance, risk

Problem

In theory:

“Extract data, transform it, and load it into a data warehouse”

In production:

Data arrives incomplete or late
APIs fail or change schemas
“OK” jobs can still produce incorrect data
Dashboards depend on you

👉 Result: the job is not just building, it’s operating living data systems

Explanation

A Data Engineer builds and operates systems that turn chaotic data into reliable data for the business.

It’s not just ETL.

It’s:

pipeline design
data quality enforcement
continuous operations
architectural decision-making

Once you understand the problem, the work breaks down into these layers:

1. Ingestion (unstable sources)

What it involves:

Integrating APIs, databases, events
Handling errors and retries
Detecting schema changes

Example:


expected = {"order_id", "user_id", "amount"}

for col in expected:

    if col not in df.columns:

        df[col] = None

👉 Defensive design, not perfect data

2. Transformation

What it involves:

Cleaning and deduplication
Business logic
Performance

Example:


SELECT *

FROM (

  SELECT *,

         ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY updated_at DESC) AS rn

  FROM raw_users

)

WHERE rn = 1;

👉 Key decision: correctness vs performance

3. Modeling

What it involves:

Designing for BI or ML

Example:

Dashboard → aggregated table
ML → detailed events

👉 Depends on consumption

4. Consumption

Consumers:

BI
ML
APIs

👉 Changing schema breaks things → you need contracts

5. Operation (the most important part)

What it involves:

Alerts
Debugging
Reprocessing

👉 This is where the real work happens

Practical Example

E-commerce pipeline:

Source


orders = fetch_api("/orders")

events = read_stream("user_events")

Raw


INSERT INTO raw_orders

SELECT *

FROM api_orders;

Curated


SELECT

  order_id,

  user_id,

  order_date,

  total_amount

FROM raw_orders

WHERE order_id IS NOT NULL;

Serving


SELECT

  order_date,

  SUM(total_amount) AS revenue

FROM curated_orders

GROUP BY order_date;

Consumption

Dashboard
ML

👉 The pipeline ends when someone actually uses the data

Common Mistakes

Assuming data is correct
Not storing raw data
Not validating outputs
Breaking contracts
Not handling failures

Checklist

Can I reprocess data?
Do I store raw data?
Do I have validations?
Is it idempotent?
Do I know who consumes it?
Do I have alerts?
Are costs controlled?
Are logs useful?

Conclusion

Being a Data Engineer in production is not writing SQL.

It is:

building resilient systems
anticipating failures
balancing trade-offs

👉 Your real value is making sure data always works

CTA

If you’re learning Data Engineering: start with real pipelines, not theory.

👉 Next step: understand Batch vs Streaming in production

Top comments (1)

Mauricio Montoya • May 29

Great article! As an aspiring data engineer, this gives me a “no fluff view” into the job, the article does not act a s a deterrent, but rather additional motivation to continue to learn and develop.