DEV Community

Cover image for What does a Data Engineer do in Production (No Hype)
Agustín José Mazzeo
Agustín José Mazzeo

Posted on

What does a Data Engineer do in Production (No Hype)

What Does a Data Engineer Do in Production (No Hype)

If you learned Data Engineering with notebooks and clean datasets, this article is for you. In production there are no clean datasets: there are systems that change, pipelines that fail, and data that has to be correct every single day.

TL;DR

A Data Engineer in production:

  • Builds and maintains reliable pipelines 
  • Ensures data quality (doesn’t “wait” for good data) 
  • Designs for consumption (BI, ML, APIs) 
  • Operates: monitors, debugs, and reprocesses 
  • Makes real trade-off decisions: cost, performance, risk 

Problem

In theory:

“Extract data, transform it, and load it into a data warehouse”

In production:

  • Data arrives incomplete or late 
  • APIs fail or change schemas 
  • “OK” jobs can still produce incorrect data 
  • Dashboards depend on you 

👉 Result: the job is not just building, it’s operating living data systems

Explanation

A Data Engineer builds and operates systems that turn chaotic data into reliable data for the business.

It’s not just ETL. 

It’s:

  • pipeline design 
  • data quality enforcement 
  • continuous operations 
  • architectural decision-making 

Once you understand the problem, the work breaks down into these layers:

1. Ingestion (unstable sources)

What it involves:

  • Integrating APIs, databases, events 
  • Handling errors and retries 
  • Detecting schema changes 

Example:


expected = {"order_id", "user_id", "amount"}

for col in expected:

    if col not in df.columns:

        df[col] = None

Enter fullscreen mode Exit fullscreen mode

👉 Defensive design, not perfect data

2. Transformation

What it involves:

  • Cleaning and deduplication 
  • Business logic 
  • Performance 

Example:


SELECT *

FROM (

  SELECT *,

         ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY updated_at DESC) AS rn

  FROM raw_users

)

WHERE rn = 1;

Enter fullscreen mode Exit fullscreen mode

👉 Key decision: correctness vs performance

3. Modeling

What it involves:

  • Designing for BI or ML 

Example:

  • Dashboard → aggregated table 
  • ML → detailed events 

👉 Depends on consumption

4. Consumption

Consumers:

  • BI 
  • ML 
  • APIs 

👉 Changing schema breaks things → you need contracts

5. Operation (the most important part)

What it involves:

  • Alerts 
  • Debugging 
  • Reprocessing 

👉 This is where the real work happens

Practical Example

E-commerce pipeline:

Source


orders = fetch_api("/orders")

events = read_stream("user_events")

Enter fullscreen mode Exit fullscreen mode

Raw


INSERT INTO raw_orders

SELECT *

FROM api_orders;

Enter fullscreen mode Exit fullscreen mode

Curated


SELECT

  order_id,

  user_id,

  order_date,

  total_amount

FROM raw_orders

WHERE order_id IS NOT NULL;

Enter fullscreen mode Exit fullscreen mode

Serving


SELECT

  order_date,

  SUM(total_amount) AS revenue

FROM curated_orders

GROUP BY order_date;

Enter fullscreen mode Exit fullscreen mode

Consumption

  • Dashboard 
  • ML 

👉 The pipeline ends when someone actually uses the data

Common Mistakes

  • Assuming data is correct 
  • Not storing raw data 
  • Not validating outputs 
  • Breaking contracts 
  • Not handling failures 

Checklist

  • Can I reprocess data? 
  • Do I store raw data? 
  • Do I have validations? 
  • Is it idempotent? 
  • Do I know who consumes it? 
  • Do I have alerts? 
  • Are costs controlled? 
  • Are logs useful? 

Conclusion

Being a Data Engineer in production is not writing SQL.

It is:

  • building resilient systems 
  • anticipating failures 
  • balancing trade-offs 

👉 Your real value is making sure data always works 

CTA

If you’re learning Data Engineering: start with real pipelines, not theory.

👉 Next step: understand Batch vs Streaming in production

Top comments (1)

Collapse
 
mauricio_montoya_eb9aa5ba profile image
Mauricio Montoya

Great article! As an aspiring data engineer, this gives me a “no fluff view” into the job, the article does not act a s a deterrent, but rather additional motivation to continue to learn and develop.