DEV Community: Xin Xu

Project: Building "Mini-C4" — A Production-Grade LLM Pre-training Pipeline 🏗️

Xin Xu — Fri, 13 Feb 2026 10:22:36 +0000

Project: Building "Mini-C4" Pre-training Corpus 🏗️

This project demonstrates how to build a miniaturized version of the C4 (Colossal Clean Crawled Corpus) pipeline. Our mission: transform chaotic, raw web data (Common Crawl) into low-noise, deduplicated, high-quality text ready for LLM pre-training.

👉 GitHub: datascale-ai/data_engineering_book

1. Project Brief

Objective: Build a pipeline to process raw Common Crawl data into a clean text corpus.
Input: Raw WARC files (.warc.gz) containing HTTP headers, HTML source, and binary noise.
Output: Categorized JSONL files (final_data.jsonl) featuring clean text, language labels, and Perplexity (PPL) scores.
Challenges:
Extremely Low Signal-to-Noise Ratio: Over 90% of raw web data consists of navbars, ads, SEO spam, and JS code.
Fuzzy Deduplication: Identifying semantically similar documents across millions of records is computationally expensive.
Quality Quantification: How to distinguish "human-grade prose" from "machine-generated gibberish" without expensive LLM APIs.

2. Architecture Design

We designed a Funnel-shaped pipeline to filter noise layer by layer:

Tech Stack Decisions

Component	Choice	Rationale
Parsing	`warcio`, `trafilatura`	`trafilatura` excels at extracting main content (removing footers/ads) far better than BeautifulSoup.
Compute	`Ray`	Python's `multiprocessing` has high overhead for large shared states. Ray’s Actor Model scales easily from multi-core to clusters.
Deduplication	`MinHash LSH`	Reduces complexity to using Locality Sensitive Hashing.
Evaluation	`KenLM`	A lightweight N-gram model used by GPT-3/CCNet to measure text "naturalness" via Perplexity.

3. Step-by-Step Implementation

Phase I: Heuristic Cleaning & Extraction

Raw WARC files are a mess. We use warcio for streaming and trafilatura to extract the "soul" of the webpage.

Code Insight: Streaming Processor

from warcio.archiveiterator import ArchiveIterator
import trafilatura

for record in ArchiveIterator(stream):
    if record.rec_type == 'response':
        # Filter for HTML only
        if 'text/html' not in record.http_headers.get_header('Content-Type', ''):
            continue
        # Extract main body, ignoring comments and tables
        text = trafilatura.extract(record.content_stream().read(), include_comments=False)

🔍 The Cleaning Rules (Gopher/C4 Standards):

Symbol-to-Word Ratio: If symbols like { } [ ] exceed 10%, it's likely code.
Average Word Length: High-quality English text usually averages 5-10 characters. Values > 15 suggest minified JS or URL lists.
Keyword Blocklist: Drop pages containing "lorem ipsum", "enable cookies", or "403 forbidden".

Phase II: Distributed MinHash Deduplication

To handle "mirrored" content, we use Ray to parallelize the computation of MinHash signatures.

Code Insight: Ray Actor Pattern

@ray.remote
def process_batch(lines):
    results = []
    for line in lines:
        m = MinHash(num_perm=128)
        for w in line['text'].split():
            m.update(w.encode('utf8'))
        results.append((line['url'], m))
    return results
# Map-Reduce: Dispatch batches to all CPU cores
futures = [process_batch.remote(batch) for batch in batches]
results = ray.get(futures)

Phase III: Quality Filtering (KenLM)

We use a pre-trained KenLM model to calculate Perplexity. Lower perplexity means more "natural" language.

📈 Tuning the Threshold:

Score > -5.0: Wikipedia-grade, highly fluent content.
Score -5.0 to -6.0: Standard blog posts and forum discussions.
Score < -6.5: Broken sentences, machine translation failures, or SEO keyword lists (Discard).

4. Performance & Showcase (Data Funnel)

Based on a sample 1GB WARC file processing:

Stage	In (Docs)	Out (Docs)	Retention	Main Loss Reason
Raw WARC	~35,000	~10,000	28%	Non-HTML, Empty content.
Heuristics	10,000	~6,500	65%	Code snippets, short text.
Deduplication	6,500	~4,800	73%	Mirrored sites, templates.
Quality Filter	4,800	~3,900	81%	Gibberish, non-English.
Final Yield	35,000	3,900	~11%	Data Purity over Volume.

5. Scaling to Terabytes (The Next Steps)

State Management: Move MinHashLSH indices from RAM to Redis or Cassandra to handle billions of records.
I/O Optimization: Transition from local files to S3/MinIO using Apache Arrow for columnar streaming.
Global Sharding: Follow the CCNet approach—shard data by hash buckets and deduplicate within shards to minimize cross-node communication.

Conclusion

Building Mini-C4 is a masterclass in Data Funneling. It’s not about how much data you have, but how effectively you can discard the garbage.

👉 Full Source Code: datascale-ai/data_engineering_book

Have you ever tried processing Common Crawl? What’s the weirdest thing you’ve found in a raw WARC file? Let’s talk in the comments! 👇

Recaptioning: Upgrading Your Image-Text Data for Better Model Alignment 🚀

Xin Xu — Fri, 13 Feb 2026 10:14:31 +0000

Recaptioning: Engineering High-Quality Descriptions for Multi-modal Models 🚀

In multi-modal AI, we often face the "Garbage In, Garbage Out" problem: scraped image captions are often too vague ("a pretty cup"), too long (exceeding the 77-token limit), or simply incorrect. Recaptioning is the process of rewriting or regenerating these descriptions to ensure they are model-ready and semantically dense.

Based on the data_engineering_book, this post covers why you need recaptioning, the core strategies to implement it, and how to evaluate the results.

1. Why Recaptioning is a Game Changer

Improve Semantic Alignment: Fix vague or fictional descriptions to match 100% of the image content.
Adapt to Model Constraints: Shorten long sentences to fit token limits (e.g., CLIP's 77-token bottleneck) without losing core info.
Multi-dimensional Coverage: Generate multiple captions covering "Appearance," "Texture," and "Context" to improve retrieval robustness.
Standardize Style: Clean up slang, typos, and irregular formatting.

2. Core Strategies (From Simple to Advanced)

A. Rule-based Recaptioning (Low Cost)

Best for small datasets where you have metadata (like OCR or Object Detection tags). Use Python and RegEx to standardize and merge tags into a clean string.

B. Model-based Recaptioning (High Performance)

Use Vision-Language Models (VLM) like BLIP-2 or LLaVA to automatically generate detailed, accurate captions.

Implementation Example with BLIP-2:

from transformers import Blip2Processor, Blip2ForConditionalGeneration
import torch
from PIL import Image

class Recaptioner:
    def __init__(self, model_id="Salesforce/blip2-opt-2.7b"):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.processor = Blip2Processor.from_pretrained(model_id)
        self.model = Blip2ForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float16).to(self.device)

    def generate(self, image_path):
        image = Image.open(image_path).convert("RGB")
        prompt = "Question: Describe this image accurately including color, material, and context. Answer:"
        inputs = self.processor(images=image, text=prompt, return_tensors="pt").to(self.device, torch.float16)

        # Generating 3 diverse captions
        outputs = self.model.generate(**inputs, num_return_sequences=3, do_sample=True, temperature=0.7)
        return [self.processor.decode(out, skip_special_tokens=True) for out in outputs]

C. Human-in-the-Loop (Highest Quality)

For production datasets, use a hybrid approach:

Mass Generation: Generate 5 candidates per image using LLMs.
CLIP Filtering: Automatically keep the top 2 captions based on CLIP similarity scores.
Human Audit: Randomly sample 5-10% for manual correction.

3. Evaluation: Is Your New Caption Better?

Don't guess—measure. Use CLIP Similarity to quantify the alignment between the new text and the image.

Metric	Method	Goal
Semantic Alignment	CLIP Score (Cosine Similarity)	Higher than the original caption.
Text Quality	Perplexity / Grammar Check	Fluent, no hallucinations.
Downstream Performance	Recall@K in Retrieval Tasks	Improved retrieval accuracy.

4. Engineering Pitfalls & Tips

Hallucination: Models might describe objects not present in the image. Solution: Use a prompt that restricts the model to "only what you see."
Homogeneity: Models often repeat the same phrases. Solution: Increase temperature (0.7-1.0) and use repetition_penalty.
Throughput: Generating millions of captions is slow. Solution: Use FP16/INT8 quantization and batch inference.

Conclusion

Recaptioning transforms "raw data" into "high-octane fuel" for multi-modal models. Whether you use simple rules or advanced VLMs, the goal remains the same: Precision, Adaptation, and Diversity.

For the full implementation guide and more multi-modal data tricks, visit the repo:

👉 GitHub: datascale-ai/data_engineering_book

Have you tried recaptioning your datasets? Did you see a jump in model performance? Share your findings below! 👇

Image-Text Pairs: The Fuel for Multi-modal Large Language Models 🖼️✍️

Xin Xu — Fri, 13 Feb 2026 10:12:54 +0000

Image-Text Pairs: Building the Foundation for Multi-modal AI 🖼️✍️

In the era of Multi-modal Large Language Models (like CLIP, BLIP, and LLaVA), Image-Text Pairs are the most critical data assets. Whether it's pre-training, fine-tuning, or evaluation, the quality of your image-text alignment directly determines the model's ability to "see" and "describe."

Based on the data_engineering_book, this post breaks down how to construct, validate, and pipe multi-modal data for production-grade AI.

1. What are Image-Text Pairs?

An image-text pair consists of one image and one or more matching textual descriptions. The core requirement is Strong Semantic Alignment.

Core Scenarios

Scenario	Data Requirement
Image-Text Retrieval	Precise descriptions of core features, zero redundancy.
V-L Pre-training	Massive diversity (People, Landscapes, Goods) and varied styles.
Generative AI (Stable Diffusion)	Rich detail (Colors, Textures, Actions) corresponding to every pixel.

2. Building High-Quality Datasets

A. Data Sourcing

Open Datasets: Start with standards like COCO Captions, Flickr30k, or LAION-400M. (Always check licenses for commercial use!)
Manual Annotation: Use platforms like Label Studio. Rule #1: Describe the subject + attributes (e.g., "An orange tabby cat lying on a gray sofa").
Automated Captioning: Use models like BLIP-2 or LLaVA to generate initial descriptions for unlabelled images, followed by human verification.

B. Quality Validation Checklist

✅ Semantic Alignment: 100% of the text must exist in the image. No hallucinations.
✅ Uniqueness: No identical descriptions for different images.
✅ Length Optimization: For CLIP-style models, keep text tokens.

3. Engineering: Storage & Loading

I. Storage Formats

Small Scale: JSONL (Easy to read and extend).
Large Scale: Parquet or WebDataset (High compression, supports streaming/mmap).

JSONL Example:

{
  "image_id": "img_001",
  "image_path": "data/images/img_001.jpg",
  "texts": ["A white ceramic mug with blue stripes, 350ml capacity"],
  "quality_score": 0.98
}

II. High-Efficiency Loader (Python/PyTorch)

Using a CLIP Processor to handle both image resizing and text tokenization:

import json
import os
from PIL import Image
from torch.utils.data import Dataset, DataLoader
from transformers import CLIPProcessor

class ImageTextPairDataset(Dataset):
    def __init__(self, jsonl_path, image_root, processor):
        self.image_root = image_root
        self.processor = processor
        with open(jsonl_path, "r") as f:
            self.data = [json.loads(line) for line in f]

    def __getitem__(self, idx):
        item = self.data[idx]
        image = Image.open(os.path.join(self.image_root, item["image_path"])).convert("RGB")

        # Process both modalities at once
        inputs = self.processor(
            images=image,
            text=item["texts"][0],
            return_tensors="pt",
            padding="max_length",
            truncation=True,
            max_length=77
        )
        return {k: v.squeeze(0) for k, v in inputs.items()}

# Usage
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
dataloader = DataLoader(ImageTextPairDataset("pairs.jsonl", "data", processor), batch_size=32)

4. Pitfalls & Solutions

Pitfall	Engineering Solution
Weak Alignment	Create an "Annotation Style Guide" and perform 10%+ random spot checks.
Format Chaos	Standardize all images to RGB and specific resolutions (e.g., 224x224).
Slow Loading	Use Memory Mapping (mmap) for JSONL or switch to WebDataset for sharded binary loading.

Conclusion

Image-text pairs are the "fuel" for multi-modal AI. The logic is simple but the execution is hard: Define Scenario → Standardize Construction → Optimize Data Pipeline.

For full source code and advanced multi-modal data strategies, visit our project:

👉 GitHub: datascale-ai/data_engineering_book

Are you working with custom image-text data for your models? What's the biggest challenge you've faced—quality or scale? Let's discuss! 👇

Tokenization & Serialization: The Unsung Heroes of LLM Development 🤖

Xin Xu — Fri, 13 Feb 2026 10:03:50 +0000

Tokenization & Serialization: Mastering the Foundation of LLM Data Engineering 🤖

In the lifecycle of Large Language Model (LLM) development, Tokenization and Serialization are the invisible bridges between raw data and model intelligence. One determines how a model "reads" text, while the other ensures that processed data is stored and transmitted efficiently.

Based on the data_engineering_book, this guide breaks down these core concepts with hands-on practice using the Hugging Face ecosystem.

1. Core Concepts: Why Do They Matter?

A. Tokenization: The "Translator" for LLMs

LLMs don't understand words; they understand numbers (integers). Tokenization is the process of converting natural language into discrete Tokens.

Goal: Balance Vocabulary Size and Text Compression Ratio.
Mainstream Algorithms:
BPE (Byte Pair Encoding): Used by GPT/LLaMA. Merges high-frequency byte pairs iteratively.
WordPiece: Used by BERT. Uses a greedy approach to split words.
Unigram: Used by T5. Selects the best subword combination based on probabilities.

B. Serialization: Packaging Your Data

Serialization converts in-memory objects (like tokenized datasets or model weights) into formats (JSON, Pickle, Arrow) for storage or transmission. Deserialization is the reverse.

Why use it? Avoid repeating expensive preprocessing, enable cross-framework data sharing (PyTorch ↔ TensorFlow), and persist training checkpoints.

2. Hands-on: Tokenization & Serialization with Hugging Face

I. Tokenization in Practice

Using the LLaMA-2 tokenizer as an example:

from transformers import AutoTokenizer

# 1. Load Tokenizer (Use Fast version for speed)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", use_fast=True)
tokenizer.add_special_tokens({"pad_token": "[PAD]"})

# 2. Encoding Text
texts = ["Data Engineering is the backbone of AI!"]
encoded = tokenizer(
    texts,
    truncation=True,
    padding="max_length",
    max_length=32,
    return_tensors="pt"
)

print(f"Token IDs: {encoded['input_ids']}")
print(f"Decoded: {tokenizer.decode(encoded['input_ids'][0])}")

II. Serialization Strategies

Depending on your scale, you should choose different formats:

Option 1: JSON (Human-readable, Cross-platform)

Best for small datasets or debugging.

import json
with open("result.json", "w") as f:
    json.dump({"input_ids": encoded["input_ids"].tolist()}, f)

Option 2: Apache Arrow (High-performance, Scalable)

The industry standard for large-scale LLM training.

from datasets import Dataset
dataset = Dataset.from_dict({"input_ids": encoded["input_ids"].tolist()})
dataset.save_to_disk("tokenized_dataset") # Highly efficient binary format

3. Pitfalls & Best Practices

🚨 Common Pitfalls

Tokenizer Mismatch: Using a different tokenizer during inference than the one used in training leads to "garbage" outputs. Always use save_pretrained() to bundle the tokenizer with the model.
Incorrect Padding Side: LLaMA generally prefers padding_side="right", while BERT uses left. Setting this incorrectly can confuse the model's attention mechanism.
Pickle Security: Never unpickle data from untrusted sources (it can execute malicious code). Use JSON or Safetensors for public data.

✅ Best Practices

Cache Processed Data: For large corpora, tokenize once and serialize to Parquet or Arrow. Don't re-tokenize every time you start a training job.
Verify Consistency: Always decode a few serialized samples to ensure the tokens still represent the original text.
Special Token Handling: Ensure tokens like [PAD], [BOS], and [EOS] are correctly defined and mapped in your vocabulary.

Conclusion

Tokenization is the "first gate" for an LLM's understanding, while Serialization is the "infrastructure" that ensures your data pipeline is scalable and reproducible.

If you found this helpful, check out the full code and advanced docs in our repository:

👉 GitHub: datascale-ai/data_engineering_book

What’s your go-to serialization format for large datasets? Parquet, Arrow, or good old JSON? Let’s talk in the comments! 👇

Why 80% of Data Engineering is Cleaning (and How to Do It Right)

Xin Xu — Fri, 13 Feb 2026 10:01:28 +0000

Data Cleaning & Denoising: The "Battlefield" of Data Engineering 🧹

It is an industry consensus that data engineers spend 60% to 80% of their time on data cleaning. Why? Because raw data is messy, and "garbage in, garbage out" is the absolute truth in data science.

In this post, based on the data_engineering_book, we’ll deconstruct the logic of industrial-grade data cleaning—moving from "just fixing bugs" to "building robust cleaning pipelines."

1. Where Does the "Noise" Hide?

According to the Data Engineering Book, data quality is the prerequisite for data value. Noise typically falls into 5 categories:

Noise Type	Symptoms	Business Impact
Missing Values	Null addresses, missing age fields	Failed deliveries, incomplete user segments
Outliers	$1M orders (avg is $100), 1000°C sensors	Flawed sales forecasts, cost miscalculations
Duplicates	Double-submitted forms, sync errors	Inflated user counts, duplicate revenue
Inconsistency	"2024-05-01" vs "05/01/24"	Aggregation failures, broken time-series
Logic Conflicts	Registration date after purchase date	Distorted behavior analysis

2. The Methodology: Diagnosis-Treatment-Validation

The handbook proposes a three-step closed-loop system for industrial data cleaning:

Step 1: Data Profiling (Diagnosis)

Never start cleaning without measuring. Use Pandas for a quick health check:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("raw_data.csv")

# 1. Missing Value Ratio
print(df.isnull().sum() / len(df))

# 2. Outlier Detection using Boxplot
df["order_amount"].plot(kind="box")
plt.show()

# 3. Logic Check: Reg_date should be before Order_date
conflict_count = len(df[pd.to_datetime(df["reg_time"]) > pd.to_datetime(df["order_time"])])
print(f"Logic conflicts found: {conflict_count}")

Step 2: Targeted Cleaning (Treatment)

Cleaning should be context-aware. Don't just delete everything.

Missing Values: Use Median for skewed numerical data, Mode for categorical, or Model-based imputation for high-priority missingness.
Outliers: Use Winsorization (clipping at 99th percentile) or business-rule-based correction.
Duplicates: Keep the "First" or "Last" based on a timestamp.

Step 3: Validation

Repeat the profiling step. Are the null ratios acceptable? Are the logic conflicts cleared? Does the cleaned data still represent the business reality?

3. The 4 Principles of Engineering Excellence

Data cleaning isn't a one-time script; it's an automated process. Follow these rules:

Traceability: Log every step. Know exactly how many records were dropped and why.
Reusability: Wrap your logic into functions.
Non-intrusive: Never modify the source file. Always output to a new "Cleaned" layer (Silver layer).
Automation: Orchestrate your cleaning jobs using Airflow or Prefect.

Example: A Reusable Cleaning Module

def clean_missing_values(df, fill_rules):
    """
    A reusable function for missing value imputation.
    :param fill_rules: e.g., {"age": "median", "status": "Unknown"}
    """
    df_clean = df.copy()
    for col, rule in fill_rules.items():
        if rule == "median":
            df_clean[col] = df_clean[col].fillna(df_clean[col].median())
        elif rule == "mode":
            df_clean[col] = df_clean[col].fillna(df_clean[col].mode()[0])
        else:
            df_clean[col] = df_clean[col].fillna(rule)
    return df_clean

Conclusion

Data cleaning is the foundation of your data "building." If the foundation is weak, the building will fall.

The data_engineering_book covers the entire pipeline—from ingestion to deployment—with industrial-grade insights. If you want to move from "writing scripts" to "designing systems," this repo is a goldmine.

👉 Repo Link: datascale-ai/data_engineering_book

What's the weirdest "dirty data" you've ever encountered in production? Let's share some horror stories in the comments! 👻👇

High-Performance Data Processing: A Practical Guide from the Data Engineering Book

Xin Xu — Fri, 13 Feb 2026 09:58:37 +0000

Data Processing & Transformation: Mastering ETL/ELT Workflows with Spark and Flink ⚡

In data engineering, transformation is where raw data becomes valuable insight. Based on the data_engineering_book, this post dives deep into the ETL vs. ELT paradigms, provides hands-on code for Spark (Batch) and Flink (Stream), and shares industry best practices for performance tuning.

👉 GitHub: datascale-ai/data_engineering_book

1. ETL vs. ELT: The Paradigm Shift

The fundamental difference lies in where and when the data is transformed.

Dimension	ETL (Extract-Transform-Load)	ELT (Extract-Load-Transform)
Workflow	Extract → Transform (External Engine) → Load	Extract → Load (Raw Lake) → Transform (In-situ)
Execution	Separate Compute (e.g., Spark Cluster)	Target Engine (e.g., Snowflake, Delta Lake)
Schema	Schema-on-Write (Structured only)	Schema-on-Read (Structured/Unstructured)
Flexibility	Low (Rigid rules)	High (Agile exploration)
Use Case	Traditional BI, Small datasets	Big Data, ML, Data Lakes, Modern Lakehouse

2. Hands-on: Batch & Stream Transformation

A. Batch Processing with Spark (ELT Paradigm)

In ELT, we load raw CSV data into a Delta Lake table first, then perform cleaning and aggregation.

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder \
    .appName("Spark_Batch_ELT") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# 1. Extract & Load (Raw Layer)
raw_df = spark.read.csv("./data/orders.csv", header=True, inferSchema=True)
raw_df.write.format("delta").mode("overwrite").save("./delta/raw/orders")

# 2. Transform (Cleaning & Aggregation)
clean_df = spark.read.format("delta").load("./delta/raw/orders") \
    .dropDuplicates(["order_id"]) \
    .filter(F.col("amount") > 0)

agg_df = clean_df.groupBy(F.to_date("create_time").alias("dt")) \
    .agg(F.sum("amount").alias("daily_total"))

agg_df.show()

B. Stream Processing with Flink (Real-time Transformation)

Real-time UV/PV calculation from a Kafka stream, persisting results to Delta Lake.

// Extracting from Kafka
FlinkKafkaConsumer<String> source = new FlinkKafkaConsumer<>("user_behavior", new SimpleStringSchema(), props);
DataStream<String> stream = env.addSource(source);

// Transform: 5-minute Tumbling Window for UV calculation
DataStream<Tuple2<String, Long>> uvStream = behaviorStream
    .keyBy(behavior -> behavior.action)
    .window(TumblingProcessingTimeWindows.of(Time.minutes(5)))
    .aggregate(new UvAggregateFunction());

// Load: Sink to Delta Lake
uvStream.sinkTo(deltaSink);

3. Transformation Best Practices

✅ Data Cleaning (The "Minimum Cleaning" Principle)

Deduplication: Use business keys (Order ID) for batch and time-windowed logic for streams.
Outlier Handling: Instead of deleting records, flag them (e.g., is_valid=false) to maintain auditability.
Null Values: Use fillna() for non-critical fields; discard if primary keys are missing.

🔒 Data Masking & Privacy

Compliance is non-negotiable (GDPR/CCPA). Use hashing or masking for sensitive fields:

Phone Numbers: 138****5678
Emails: jo****@example.com

📏 Standardization

Naming: Use snake_case (e.g., user_id) consistently.
Timezones: Always standardize to UTC (yyyy-MM-dd HH:mm:ss).
Units: Explicitly label units (e.g., amount_usd, weight_kg).

4. Performance Tuning 101

Performance is about matching resources to demand. Focus on these two levers:

I. Parallelism (Concurrency)

Rule of Thumb: Parallelism = Total Data Size / Task Processing Capacity.

Spark: Adjust spark.sql.shuffle.partitions. For 100GB of data, 400-800 partitions is a good starting point.
Flink: Set operator-level parallelism. Use rebalance() to prevent data skew.

II. Resource Allocation

Resource	Spark Config	Flink Config
Memory	`spark.executor.memory`	`taskmanager.memory.process.size`
CPU Cores	`spark.executor.cores`	`taskmanager.numberOfTaskSlots`

Summary: Correctness → Performance → Cost

As the data_engineering_book suggests: "First ensure correctness, then optimize performance, and finally reduce cost." Never sacrifice data availability for a few seconds of speed.

Are you team Spark or team Flink for your daily transformations? Let's settle the debate in the comments! 👇

From Kimball to Lakehouse: The Evolution of Data Storage (with Python Demo)

Xin Xu — Fri, 13 Feb 2026 09:54:20 +0000

Data Storage Architecture: Deconstructing Warehouse, Lake, and Lakehouse 🏛️

In modern data engineering, choosing the right storage architecture is critical. Based on the data_engineering_book, this guide breaks down the core differences between traditional Warehouses, Data Lakes, and the modern Lakehouse, while providing a hands-on Delta Lake demo.

1. Warehouse vs. Lake vs. Lakehouse

Understanding the core philosophy of each architecture is the first step toward a successful design.

Architecture	Definition	Design Philosophy
Warehouse (Kimball/Inmon)	Structured, integrated, non-volatile storage using Star/Snowflake schemas.	Schema-on-Write. Optimized for fast BI reporting and business logic.
Data Lake	A vast repository for raw data (Structured/Unstructured) with no strict schema.	Schema-on-Read. Optimized for data exploration, ML, and low-cost storage.
Data Lakehouse	A hybrid architecture bringing warehouse management to the data lake.	Best of both. Retains lake flexibility with warehouse-level ACID transactions and governance.

2. Core Storage Design Principles

According to the Data Engineering Book, a robust storage layer must balance maintainability, performance, and cost.

A. Layering Strategy (The Medallion Architecture)

Raw (Bronze/ODS): Stores data in its original form. Enables reprocessing if logic changes.
Clean (Silver/CDM): Deduplicated, standardized, and filtered data.
Integrated (Gold/DWD): Themed data organized by business subjects (User, Order, etc.).
Aggregated (Platinum/DM): Summarized data ready for BI dashboards.

B. Partitioning Strategy

Partitions reduce the amount of data scanned, directly boosting query performance.

Partition Keys: Choose high-frequency filter fields (e.g., dt, region).
Granularity: Avoid "Small File Problem" by ensuring partitions aren't too granular (e.g., use day instead of second).

C. Data Lifecycle Management (DLM)

Hot Data (<7 days): High-performance storage (SSD / Delta Lake active partitions).
Warm Data (7 days - 3 months): Standard object storage (S3 Standard).
Cold Data (>3 months): Archival storage (S3 Glacier) to minimize costs.

3. Hands-on: Building a Lakehouse with Delta Lake

Delta Lake is the backbone of the Lakehouse architecture, providing ACID transactions and Schema Enforcement. Here is a Python/PySpark snippet to get you started:

from pyspark.sql import SparkSession
from delta.tables import DeltaTable
import pyspark.sql.functions as F

# 1. Initialize Spark with Delta Support
spark = SparkSession.builder \
    .appName("DeltaLakehouseDemo") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# 2. Ingest to Raw Layer (ODS)
ods_path = "./lakehouse/ods/user_behavior"
data = [("user1", "2024-05-20", "click"), ("user2", "2024-05-20", "purchase")]
df = spark.createDataFrame(data, ["user_id", "dt", "action"])

df.write.format("delta").partitionBy("dt").mode("overwrite").save(ods_path)

# 3. Time Travel Capability
# Access a specific version of your data effortlessly
df_v0 = spark.read.format("delta").option("versionAsOf", 0).load(ods_path)

# 4. ACID Transaction: Atomic Updates
delta_table = DeltaTable.forPath(spark, ods_path)
delta_table.update(
    condition="user_id = 'user1'",
    set={"action": "'view'"}
)

print("Updated Lakehouse Data:")
spark.read.format("delta").load(ods_path).show()

4. Decision Tree: Choosing Your Architecture

Not every project needs a full Lakehouse. Use this decision tree from our handbook to decide:

Do you only need structured data for fixed BI reports?
Yes: Traditional Data Warehouse (Snowflake/Redshift).
No: Proceed.
Do you need to store Unstructured data (Logs, Videos, JSON)?
Yes: Proceed to Lakehouse/Data Lake.
Do you need ACID transactions and Schema Enforcement?
No: Pure Data Lake (S3 + Hive/Glue).
Yes: Data Lakehouse (Delta Lake / Iceberg).

Conclusion

The evolution from Warehouses to Lakehouses represents a shift toward balancing agility with governance. By implementing layering, partitioning, and lifecycle management, you can build a storage layer that scales with your business.

For the full architectural guide and more hands-on demos, visit our repository:

👉 GitHub: datascale-ai/data_engineering_book

Are you still using a traditional Data Warehouse, or have you migrated to a Lakehouse? Share your migration stories below! 👇

How to Build Scalable Data Pipelines: Lessons from the Data Engineering Book

Xin Xu — Fri, 13 Feb 2026 09:50:10 +0000

Data Ingestion 101: Building Robust Pipelines with CDC, Batch, and APIs 🛠️

Data ingestion is the "first gateway" of data engineering. The stability and efficiency of your ingestion layer directly determine the quality of all downstream processing and analytics.

In this guide, based on the open-source data_engineering_book, we’ll explore how to handle different data sources, choose the right ingestion patterns, and implement a real-time CDC pipeline.

1. Understanding Your Data Sources

We categorize data sources into two main dimensions: Form and Latency.

By Form

Structured: Databases (MySQL, PostgreSQL), CSVs, or ERP exports with fixed schemas.
Semi-Structured: JSON/XML logs, Kafka messages, and NoSQL (MongoDB). These require schema inference or flattening.
Unstructured: PDFs, images, and audio/video files.

By Latency

Batch (Offline): Daily/weekly reports or full database backups. High latency, but high data integrity.
Streaming (Real-time): User clickstreams, payment logs, and DB change logs. Requires millisecond-level processing.

2. Core Ingestion Strategies

Based on the Data Engineering Book, there are three primary patterns:

A. CDC (Change Data Capture)

The gold standard for database synchronization. It captures row-level changes (Insert/Update/Delete) from database logs (like MySQL Binlog) without impacting the production application.

Top Tool: Flink CDC (supports full + incremental sync).

B. Batch Ingestion

Standardized scheduled pulls for offline scenarios.

Tools: DataX, Apache Sqoop, or even Python/Pandas for smaller datasets.

C. API Pulling

The go-to method for 3rd-party SaaS (Stripe, Shopify, TikTok Ads).

Key Challenges: Handling OAuth2, pagination logic, and exponential backoff for rate limiting.

3. Hands-on: Real-time MySQL to Kafka Pipeline

Let's implement a real-time sync using Flink CDC. This setup captures every change in a MySQL table and streams it to Kafka as a JSON message.

The Code (Java/Flink)

public class MySql2KafkaCDC {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.enableCheckpointing(5000); // Critical for preventing data loss

        // 1. Configure MySQL CDC Source
        MySqlSource<String> mySqlSource = MySqlSource.<String>builder()
                .hostname("localhost")
                .port(3306)
                .databaseList("production_db")
                .tableList("production_db.orders")
                .username("cdc_user")
                .password("cdc_password")
                .deserializer(new JsonDebeziumDeserializationSchema()) // Convert to JSON
                .build();

        // 2. Configure Kafka Sink
        KafkaSink<String> kafkaSink = KafkaSink.<String>builder()
                .setBootstrapServers("localhost:9092")
                .setRecordSerializer(KafkaRecordSerializationSchema.builder()
                        .setTopic("db_changes_orders")
                        .setValueSerializationSchema((element, context) -> element.getBytes())
                        .build())
                .build();

        // 3. Run the Pipeline
        env.fromSource(mySqlSource, WatermarkStrategy.noWatermarks(), "MySQL-Source")
                .sinkTo(kafkaSink);

        env.execute("MySQL to Kafka Real-time Sync");
    }
}

4. Common Pitfalls (And How to Avoid Them)

🚨 Data Loss

Scenario: A Flink job restarts but doesn't have Checkpointing enabled, losing the current offset.
Fix: Always enable persistent Checkpointing (S3/HDFS) and implement Idempotent Writes at the sink.

🐢 Data Lag

Scenario: Binlog accumulation or insufficient Kafka partitions.
Fix: Increase Flink parallelism and split synchronization for giant tables into separate jobs.

🧩 Schema Drift

Scenario: Upstream DB changes a column from INT to STRING, breaking your pipeline.
Fix: Use Schema Validation tools (like Great Expectations) at the ingestion layer to catch mismatches early.

Conclusion

Ingestion is the first line of defense for your data system. Small leaks here become floods downstream.

For the full Docker-compose environment (MySQL + Kafka + Flink) and complete source code, head over to our repository:

👉 GitHub: datascale-ai/data_engineering_book

What's your preferred tool for data ingestion? Are you a Flink CDC fan or do you prefer Airbyte/Meltano? Let's discuss below! 👇

The Modern Data Stack: A Guide from the Open-Source Data Engineering Book

Xin Xu — Fri, 13 Feb 2026 09:40:07 +0000

Data Engineering Fundamentals: Definitions, Tech Stacks, and Mastery Roadmap 🏗️

Data Engineering is the "infrastructure" of the big data world. However, many people still confuse it with Data Analysis or Data Science.

In this post, we’ll use the open-source data_engineering_book to deconstruct the core logic of data engineering—from its definition and tech stack to the competency model and a quick self-test.

👉 GitHub Repo: datascale-ai/data_engineering_book

1. What exactly is Data Engineering?

In our handbook, we define Data Engineering as the engineering practice of turning data into assets. The core goal is to build stable, scalable, and efficient pipelines that transform raw, fragmented, and heterogeneous data into structured, reusable, and high-availability assets.

The "House" Analogy: DE vs. DA vs. DS

Feature	Data Engineering (DE)	Data Analytics (DA)	Data Science (DS)
Goal	Build pipelines/foundations	Interpret data/Business QA	Build predictive models
Output	Data Warehouse, ETL, APIs	Reports, Insights, Dashboards	ML Models, AI Systems
Analogy	The Architect (Builds the house)	The Interior Designer (Uses the house)	The Scientist (Optimizes house functions)

2. Breaking Down the Modern Tech Stack

We categorize the stack based on the "Data Flow Lifecycle" rather than just listing tools:

📥 Storage: The "Containers"

Structured: Data Warehouses (Snowflake, ClickHouse, BigQuery).
Unstructured: Data Lakes (S3, HDFS, MinIO).
Unified: Lakehouse (Delta Lake, Iceberg, Hudi) — Solving the rigidity of warehouses and the chaos of lakes.

⚙️ Compute: The "Processing Center"

Batch Processing: Spark, Flink Batch — For heavy-duty offline processing (e.g., daily syncs).
Stream Processing: Flink, Kafka Streams — For real-time processing (e.g., live order monitoring).
Lightweight Compute: Polars, Dask, Trino — High-performance tools for small-to-medium datasets.

🎼 Orchestration: The "Conductor"

The "brain" that ensures tasks run in order (scheduling, retries, dependencies).

Key Tools: Apache Airflow (The industry standard), Dagster, Prefect.

🛡️ Operations & Observability: The "Safety Net"

Observability: Prometheus + Grafana (Monitoring), ELK (Logging).
Data Quality: Great Expectations, Soda — Checking for missing values or schema drift.
Engineering Standards: CI/CD (GitHub Actions), Environment Isolation.

3. The Data Engineering Competency Model

One of the highlights of the data_engineering_book is the Growth Map, moving beyond "tool-watching" to "capability-building":

Foundational (The Essentials): SQL (Window functions, CTEs), Data Modeling (Star/Snowflake schema), Linux/Python basics.
Core Engineering (Mid-Level): Designing ETL/ELT pipelines, understanding Batch vs. Stream, and mastering Data CDC (Change Data Capture).
Ecosystem & Business (Senior): Abstracting business needs into data architectures and managing cross-team data contracts.
Expert Level: Building automated data platforms, cost optimization (FinOps), and ensuring global compliance (GDPR/Data Privacy).

🧠 Quick Quiz: Are you ready?

These questions are pulled from Part 1 of our book. Can you answer them?

What is the core difference between ETL and ELT? When should you use which?
What are the pros and cons of Star Schema vs. Snowflake Schema?
What is a DAG in Airflow, and how does it manage task dependencies?
What problem does a Lakehouse (e.g., Delta Lake) solve that a traditional Data Lake cannot?
How do you validate Data Completeness in a production pipeline?

(Check the answers in our GitHub Wiki/Docs)

Final Thoughts

Data Engineering is about moving from being a "tool user" to a "system designer." If you’re looking for a systematic path to master these skills, check out our repository.

If you found this helpful, give us a Star ⭐️ on GitHub to support open-source education!

Data Engineering for LLMs: A Comprehensive Open-Source Guide 🚀

Xin Xu — Fri, 13 Feb 2026 09:32:30 +0000

Data Engineering for LLMs: The Open-Source Guide to High-Quality Data Pipelines 🚀

In the era of Large Language Models (LLMs), we all know that "Data quality determines the model's upper limit." However, most developers are still "crossing the river by feeling the stones" when it comes to LLM data engineering. Finding systematic resources for data collection, cleaning, alignment, and RAG pipelines is surprisingly difficult. Many end up with datasets that are either low quality or impossible to deploy in production.

That’s why we created data_engineering_book — a one-stop open-source guide for LLM data engineering, covering architecture, algorithms, and real-world projects.

👉 GitHub: datascale-ai/data_engineering_book
👉 Live Docs: Read Online Here

🛠 Why This Project?

Current industry pain points are clear:

Fragmented Knowledge: Tutorials are scattered across random blogs and papers.
Model-Centric Bias: Too much focus on "fine-tuning parameters" while ignoring the Data-Centric AI core.
Lack of Production Context: Theory is great, but how do you scale a cleaning pipeline to billions of tokens?

Our goal is to bridge this gap, helping you move from "using tools" to "building robust data lifecycles."

🏗 What’s Inside?

The handbook is structured into 6 parts, covering 13 chapters and 5 end-to-end production projects:

🗺 The Roadmap

Part 1: Infrastructure & Core Concepts (Modern stack selection)
Part 2: Text Pre-training (Scraping, Cleaning, Tokenization)
Part 3: Multi-modal Data (Image-text pairs, Audio, Video)
Part 4: Alignment & Synthetic Data (SFT, RLHF, and Synthetic generation)
Part 5: Application-level Engineering (Advanced RAG pipelines)
Part 6: Hands-on Projects (Runnable enterprise-level code)

💻 The Modern Tech Stack

We don't just talk theory. We focus on tools used in production today:

Domain	Tech Stack
Distributed Computing	Ray Data, Apache Spark
Storage	Parquet, WebDataset, Vector DBs
NLP Processing	Trafilatura, KenLM, MinHash LSH
Multi-modal	CLIP, ColPali, img2dataset
Data Versioning	DVC, LakeFS

🚀 Hands-on Projects You Can Run

The repo includes 5 full-stack projects with reusable code:

Mini-C4 Construction: Build a pre-training dataset from scratch.
Legal Expert SFT: High-quality instruction set generation for vertical domains.
Multi-modal Instruction Sets: Building visual-language datasets.
Synthetic Data Pipeline: Using LLMs to generate training data for LLMs.
Multi-modal RAG: An enterprise-grade financial report assistant.

🌟 Support the Project

This is a community-driven project maintained by the datascale-ai team. It’s licensed under MIT and supports both English and Chinese.

If you find this resource helpful for your AI journey, we’d love your support:

Star the Repo: datascale-ai/data_engineering_book ⭐️
Contribute: Open an Issue or PR if you have better ideas for data cleaning or RAG optimization!

What is the biggest challenge you've faced in your LLM data pipeline? Let’s discuss in the comments! 👇