DEV Community

# dataengineering

Posts

👋 Sign in for the ability to sort posts by relevant, latest, or top.
Auto-Detecting CSV Schemas for Lightning-Fast ClickHouse Ingestion with Parquet

Auto-Detecting CSV Schemas for Lightning-Fast ClickHouse Ingestion with Parquet

7
Comments
5 min read
Smart Invoice Analyzer — How I Automated Invoice Processing & Predicted Sales Using Machine Learning

Smart Invoice Analyzer — How I Automated Invoice Processing & Predicted Sales Using Machine Learning

3
Comments
2 min read
How I Streamed Live Binance L2 Order Book Data on AWS for ~$10/Month

How I Streamed Live Binance L2 Order Book Data on AWS for ~$10/Month

5
Comments
14 min read
Stop Waiting for the Cloud: Building a Hybrid SQL+Python Data Pipeline Locally with DuckDB

Stop Waiting for the Cloud: Building a Hybrid SQL+Python Data Pipeline Locally with DuckDB

Comments
5 min read
Decommissioning the Dinosaur: A 4-Phase Playbook for Migrating Your Legacy Data Warehouse to Databricks

Decommissioning the Dinosaur: A 4-Phase Playbook for Migrating Your Legacy Data Warehouse to Databricks

Comments
4 min read
The Data Analytics Lifecycle

The Data Analytics Lifecycle

Comments
3 min read
Set up an open-source AI analyst for PostgreSQL in 2 minutes

Set up an open-source AI analyst for PostgreSQL in 2 minutes

1
Comments
5 min read
The Semantic Gap in Data Quality: Why Your Monitoring is Lying to You

The Semantic Gap in Data Quality: Why Your Monitoring is Lying to You

1
Comments 1
7 min read
Building an Enterprise Patching Dashboard with AWS - A Complete Guide

Building an Enterprise Patching Dashboard with AWS - A Complete Guide

5
Comments
5 min read
Azure Data Solutions: Data Factory, Synapse, Data Lake & Databricks Integration

Azure Data Solutions: Data Factory, Synapse, Data Lake & Databricks Integration

2
Comments 1
4 min read
Building an Automated Data Pipeline: Injuries vs Performance in the Premier League

Building an Automated Data Pipeline: Injuries vs Performance in the Premier League

Comments
6 min read
2025-2026 Guide to Learning about Apache Iceberg, Data Lakehouse & Agentic AI

2025-2026 Guide to Learning about Apache Iceberg, Data Lakehouse & Agentic AI

Comments
9 min read
Evolution of Processing: SPL One-Click Acceleration for Log-to-Metric Conversion

Evolution of Processing: SPL One-Click Acceleration for Log-to-Metric Conversion

Comments
6 min read
My First Data Engineering Project: Building a Real-Time IoT Pipeline on Azure

My First Data Engineering Project: Building a Real-Time IoT Pipeline on Azure

Comments
6 min read
The Data Engineer’s Codex: From First Principles to the Modern Lakehouse

The Data Engineer’s Codex: From First Principles to the Modern Lakehouse

6
Comments
10 min read
Breaking Into Gaming Analytics: From 1 Billion Mobile Users to 5B Daily Events

Breaking Into Gaming Analytics: From 1 Billion Mobile Users to 5B Daily Events

Comments 1
6 min read
Building a Real-Time Data Lake on AWS: S3, Glue, and Athena in Production

Building a Real-Time Data Lake on AWS: S3, Glue, and Athena in Production

1
Comments
5 min read
Embeddings and Vector Similarity: How Machines Understand Meaning

Embeddings and Vector Similarity: How Machines Understand Meaning

1
Comments
19 min read
Containerization for Data Engineering: A Practical Guide with Docker and Docker Compose

Containerization for Data Engineering: A Practical Guide with Docker and Docker Compose

Comments
2 min read
Join OSA CON 2025: Two Days of Open‑Source Analytics and AI (Nov. 4–5)

Join OSA CON 2025: Two Days of Open‑Source Analytics and AI (Nov. 4–5)

Comments
3 min read
AWS Glue for ETL

AWS Glue for ETL

Comments
5 min read
What to use for data preparation in report, query or analysis business?

What to use for data preparation in report, query or analysis business?

5
Comments
10 min read
Optimizing Data Processing on AWS with Data Compaction

Optimizing Data Processing on AWS with Data Compaction

4
Comments
7 min read
Real-Time Earthquake CDC Pipeline

Real-Time Earthquake CDC Pipeline

Comments
5 min read
Designing a Cost-Efficient Parallel Data Pipeline on AWS Using Lambda and SQS

Designing a Cost-Efficient Parallel Data Pipeline on AWS Using Lambda and SQS

3
Comments
6 min read
loading...